TICDC 卡住，日志大量 region failed，剔除 TiKV 节点后恢复

TiDBer_gLV5ml22 · 2024 年10 月 29 日 04:42

【 TiDB 使用环境】生产环境
【 TiDB 版本】5.4
【 CDC 版本】6.5.9
【复现路径】TiKV 有一个节点 OOM，并重新拉起，发现 TiCDC 卡住，重启 TiCDC 无效，将该TiKV 节点剔除后重启 TiCDC 恢复
【遇到的问题：问题现象及影响】
TiCDC 卡住，日志里面存在大量 region failed 信息，查看日志发现有很多 cancel 日志，但无法确认为何 cancel。同时有一个CDC节点的日志里面存在很多 LastSyncedTs should not be greater than newLastSyncedTs 的 Warn 日志

【资源配置】进入到 TiDB Dashboard -集群信息 (Cluster Info) -主机(Hosts) 截图此页面
【附件：截图/日志/监控】

[2024/10/28 16:31:21.117 +08:00] [INFO] [client.go:613] ["region failed"] [span="[7480000000000005ffcb5f7280000000e8ffc0771a0000000000fa, 7480000000000005ffcb5f7280000000e8ffc5a1560000000000fa)"] [regionId=232179] [error="[CDC:ErrEventFeedAborted]single event feed aborted"]

[2024/10/28 16:31:33.572 +08:00] [WARN] [changefeed.go:373] ["LastSyncedTs should not be greater than newLastSyncedTs"] [c.LastSyncedTs=453536364393136132] [newLastSyncedTs=0]
[2024/10/28 16:31:33.572 +08:00] [WARN] [changefeed.go:373] ["LastSyncedTs should not be greater than newLastSyncedTs"] [c.LastSyncedTs=453536365664534601] [newLastSyncedTs=0]
[2024/10/28 16:31:33.572 +08:00] [WARN] [changefeed.go:373] ["LastSyncedTs should not be greater than newLastSyncedTs"] [c.LastSyncedTs=452650171763785820] [newLastSyncedTs=0]
[2024/10/28 16:31:33.572 +08:00] [WARN] [changefeed.go:373] ["LastSyncedTs should not be greater than newLastSyncedTs"] [c.LastSyncedTs=453536365638320213] [newLastSyncedTs=0]

补充 cancel 日志

[2024/10/28 16:28:21.088 +08:00] [ERROR] [client.go:1068] ["region worker exited with error"] [namespace=default] [xxxxx] [tableID=1483] [tableName=`xxx`.`xxx`] [store=xxxxx] [storeID=4] [streamID=2981] [error="context canceled"] [errorVerbose="context canceled\ngithub.com/pingcap/errors.AddStack\n\tgithub.com/pingcap/errors@v0.11.5-0.20220729040631-518f63d66278/errors.go:174\ngithub.com/pingcap/errors.Trace\n\tgithub.com/pingcap/errors@v0.11.5-0.20220729040631-518f63d66278/juju_adaptor.go:15\ngithub.com/pingcap/tiflow/cdc/kv.(*regionWorker).eventHandler\n\tgithub.com/pingcap/tiflow/cdc/kv/region_worker.go:480\ngithub.com/pingcap/tiflow/cdc/kv.(*regionWorker).run.func4\n\tgithub.com/pingcap/tiflow/cdc/kv/region_worker.go:654\ngolang.org/x/sync/errgroup.(*Group).Go.func1\n\tgolang.org/x/sync@v0.5.0/errgroup/errgroup.go:75\nruntime.goexit\n\truntime/asm_amd64.s:1594"]
[2024/10/28 16:28:21.088 +08:00] [ERROR] [client.go:1068] ["region worker exited with error"] [namespace=default] [changefeed=xxxxxx] [tableID=1483] [tableName=`xxxx`.`xxxx`] [store=xxxxx] [storeID=4489703] [streamID=2977] [error="context canceled"] [errorVerbose="context canceled\ngithub.com/pingcap/errors.AddStack\n\tgithub.com/pingcap/errors@v0.11.5-0.20220729040631-518f63d66278/errors.go:174\ngithub.com/pingcap/errors.Trace\n\tgithub.com/pingcap/errors@v0.11.5-0.20220729040631-518f63d66278/juju_adaptor.go:15\ngithub.com/pingcap/tiflow/cdc/kv.(*regionWorker).eventHandler\n\tgithub.com/pingcap/tiflow/cdc/kv/region_worker.go:480\ngithub.com/pingcap/tiflow/cdc/kv.(*regionWorker).run.func4\n\tgithub.com/pingcap/tiflow/cdc/kv/region_worker.go:654\ngolang.org/x/sync/errgroup.(*Group).Go.func1\n\tgolang.org/x/sync@v0.5.0/errgroup/errgroup.go:75\nruntime.goexit\n\truntime/asm_amd64.s:1594"]

看起来是 TiKV 的问题？

region not receiving resolved event from tikv or resolved ts is not pushing for too long time, try to resolve lock

补充监控
补充一下监控

WalterWj · 2024 年10 月 29 日 06:47

【 CDC 版本】6.5.9
怎么还是混合版本使用

TiDBer_gLV5ml22 · 2024 年10 月 29 日 06:56

因为业务数据库动不了，之前 CDC 性能不够，加上 5.4 有些 bug，就单独升级了 CDC

xfworld · 2024 年10 月 30 日 01:10

CDC 是跟着 tidb 的版本走的，混用感觉会比较困难…

主要排查也会很困难

TiDBer_gLV5ml22 · 2024 年10 月 30 日 01:38

TiCDC 有个最小支持版本，升级前我看是支持 5.4 的…
升级 CDC 是无奈之举，业务 QPS 要提至 10 倍，5.4 追不上

TiDBer_gLV5ml22 · 2024 年10 月 30 日 01:38

排查困难是指监控、日志？

xfworld · 2024 年10 月 30 日 02:39

tidb 大版本的功能和特性上有很大的差异，CDC 更是如此了

6.1.x 和 6.5.x 的CDC 会比较稳了，5.x 的 CDC 问题会多一些…

有机会的话，先升级 tidb 集群的版本会更好一些

TiDBer_gLV5ml22 · 2024 年10 月 30 日 02:52

嗯，现在想的是起码得知道为啥出问题才能让业务升级，看日志有一些 table stuck、region not receiving resolved event 的日志，像是 TiKV 的 CDC 模块出问题了，但一直没法定位根因

TiDBer_gLV5ml22 · 2024 年10 月 30 日 06:54

好像找到疑点了，有一台没重启的 TiKV 节点，发现反而 ts 是滞后的

Ti青涩 · 2024 年10 月 30 日 08:37

这个tikv节点的ts滞后了，需要重启tikv组件解决嘛？

TiDBer_gLV5ml22 · 2024 年10 月 30 日 12:45

我们是踢掉另一个 OOM 重启的节点后（看监控没有滞后），重启 CDC 服务解决

Jellybean · 2024 年10 月 30 日 14:30

5.4 的版本，我们之前cdc性能不足，通过调优也有非常大的性能提升。

可以参考我之前的优化经历，试试看：

xiaohaozifeifeifei · 2024 年10 月 31 日 05:42

感觉更多的是版本的原因吧，测试环境尝试都升级一下试试呢