【 TiDB 使用环境】生产环境
【 TiDB 版本】5.4
【 CDC 版本】6.5.9
【复现路径】TiKV 有一个节点 OOM,并重新拉起,发现 TiCDC 卡住,重启 TiCDC 无效,将该TiKV 节点剔除后重启 TiCDC 恢复
【遇到的问题:问题现象及影响】
TiCDC 卡住,日志里面存在大量 region failed 信息,查看日志发现有很多 cancel 日志,但无法确认为何 cancel。同时有一个CDC节点的日志里面存在很多 LastSyncedTs should not be greater than newLastSyncedTs 的 Warn 日志
【资源配置】进入到 TiDB Dashboard -集群信息 (Cluster Info) -主机(Hosts) 截图此页面
【附件:截图/日志/监控】
[2024/10/28 16:31:21.117 +08:00] [INFO] [client.go:613] ["region failed"] [span="[7480000000000005ffcb5f7280000000e8ffc0771a0000000000fa, 7480000000000005ffcb5f7280000000e8ffc5a1560000000000fa)"] [regionId=232179] [error="[CDC:ErrEventFeedAborted]single event feed aborted"]
[2024/10/28 16:31:33.572 +08:00] [WARN] [changefeed.go:373] ["LastSyncedTs should not be greater than newLastSyncedTs"] [c.LastSyncedTs=453536364393136132] [newLastSyncedTs=0]
[2024/10/28 16:31:33.572 +08:00] [WARN] [changefeed.go:373] ["LastSyncedTs should not be greater than newLastSyncedTs"] [c.LastSyncedTs=453536365664534601] [newLastSyncedTs=0]
[2024/10/28 16:31:33.572 +08:00] [WARN] [changefeed.go:373] ["LastSyncedTs should not be greater than newLastSyncedTs"] [c.LastSyncedTs=452650171763785820] [newLastSyncedTs=0]
[2024/10/28 16:31:33.572 +08:00] [WARN] [changefeed.go:373] ["LastSyncedTs should not be greater than newLastSyncedTs"] [c.LastSyncedTs=453536365638320213] [newLastSyncedTs=0]
补充 cancel 日志
[2024/10/28 16:28:21.088 +08:00] [ERROR] [client.go:1068] ["region worker exited with error"] [namespace=default] [xxxxx] [tableID=1483] [tableName=`xxx`.`xxx`] [store=xxxxx] [storeID=4] [streamID=2981] [error="context canceled"] [errorVerbose="context canceled\ngithub.com/pingcap/errors.AddStack\n\tgithub.com/pingcap/errors@v0.11.5-0.20220729040631-518f63d66278/errors.go:174\ngithub.com/pingcap/errors.Trace\n\tgithub.com/pingcap/errors@v0.11.5-0.20220729040631-518f63d66278/juju_adaptor.go:15\ngithub.com/pingcap/tiflow/cdc/kv.(*regionWorker).eventHandler\n\tgithub.com/pingcap/tiflow/cdc/kv/region_worker.go:480\ngithub.com/pingcap/tiflow/cdc/kv.(*regionWorker).run.func4\n\tgithub.com/pingcap/tiflow/cdc/kv/region_worker.go:654\ngolang.org/x/sync/errgroup.(*Group).Go.func1\n\tgolang.org/x/sync@v0.5.0/errgroup/errgroup.go:75\nruntime.goexit\n\truntime/asm_amd64.s:1594"]
[2024/10/28 16:28:21.088 +08:00] [ERROR] [client.go:1068] ["region worker exited with error"] [namespace=default] [changefeed=xxxxxx] [tableID=1483] [tableName=`xxxx`.`xxxx`] [store=xxxxx] [storeID=4489703] [streamID=2977] [error="context canceled"] [errorVerbose="context canceled\ngithub.com/pingcap/errors.AddStack\n\tgithub.com/pingcap/errors@v0.11.5-0.20220729040631-518f63d66278/errors.go:174\ngithub.com/pingcap/errors.Trace\n\tgithub.com/pingcap/errors@v0.11.5-0.20220729040631-518f63d66278/juju_adaptor.go:15\ngithub.com/pingcap/tiflow/cdc/kv.(*regionWorker).eventHandler\n\tgithub.com/pingcap/tiflow/cdc/kv/region_worker.go:480\ngithub.com/pingcap/tiflow/cdc/kv.(*regionWorker).run.func4\n\tgithub.com/pingcap/tiflow/cdc/kv/region_worker.go:654\ngolang.org/x/sync/errgroup.(*Group).Go.func1\n\tgolang.org/x/sync@v0.5.0/errgroup/errgroup.go:75\nruntime.goexit\n\truntime/asm_amd64.s:1594"]
看起来是 TiKV 的问题?
region not receiving resolved event from tikv or resolved ts is not pushing for too long time, try to resolve lock
补充监控
补充一下监控