TiCDC同步卡住,提示:GetSnapshot is taking too long, DDL puller stuck?

【TiDB 版本】 --源
V5.2.1

【TiCDC 版本】
V5.2.2

【TiDB 版本】 --目标
V5.2.2

【故障说明】

  1. 一共有3个changefeed,1个同步到mysql,2个同步到TIDB;同步时间和tso都卡在2021-11-11 11:36:28.025、429025339120025644,延迟有2小时。
  2. 源端最近一次DDL在11:05分,我看目标库早已经同步过来。
  3. 源端查过没有正在执行的大事务,也没有锁冲突;目标库无业务运行;
  4. TiCDC无错误日志,运行日志有大量WARN信息:GetSnapshot is taking too long, DDL puller stuck?

【故障排查材料】

1.TiCDC运行日志
[2021/11/11 13:56:59.913 +08:00] [WARN] [schema_storage.go:726] [“GetSnapshot is taking too long, DDL puller stuck?”] [ts=429025521952358428]
[2021/11/11 13:56:59.914 +08:00] [WARN] [schema_storage.go:726] [“GetSnapshot is taking too long, DDL puller stuck?”] [ts=429025556830617610]
[2021/11/11 13:56:59.916 +08:00] [WARN] [schema_storage.go:726] [“GetSnapshot is taking too long, DDL puller stuck?”] [ts=429025341059891213]
[2021/11/11 13:56:59.919 +08:00] [WARN] [schema_storage.go:726] [“GetSnapshot is taking too long, DDL puller stuck?”] [ts=429025339670790159]
[2021/11/11 13:56:59.920 +08:00] [WARN] [schema_storage.go:726] [“GetSnapshot is taking too long, DDL puller stuck?”] [ts=429025344074547264]
[2021/11/11 13:56:59.920 +08:00] [WARN] [schema_storage.go:726] [“GetSnapshot is taking too long, DDL puller stuck?”] [ts=429025560605491251]
[2021/11/11 13:56:59.933 +08:00] [WARN] [schema_storage.go:726] [“GetSnapshot is taking too long, DDL puller stuck?”] [ts=429025412638310434]
[2021/11/11 13:56:59.935 +08:00] [WARN] [schema_storage.go:726] [“GetSnapshot is taking too long, DDL puller stuck?”] [ts=429025359305375750]
[2021/11/11 13:56:59.937 +08:00] [WARN] [schema_storage.go:726] [“GetSnapshot is taking too long, DDL puller stuck?”] [ts=429025396175929366]
[2021/11/11 13:56:59.938 +08:00] [WARN] [schema_storage.go:726] [“GetSnapshot is taking too long, DDL puller stuck?”] [ts=429025343917522948]
[2021/11/11 13:56:59.938 +08:00] [WARN] [schema_storage.go:726] [“GetSnapshot is taking too long, DDL puller stuck?”] [ts=429025386030956550]
[2021/11/11 13:56:59.939 +08:00] [WARN] [schema_storage.go:726] [“GetSnapshot is taking too long, DDL puller stuck?”] [ts=429025439888179820]
[2021/11/11 13:56:59.940 +08:00] [WARN] [schema_storage.go:726] [“GetSnapshot is taking too long, DDL puller stuck?”] [ts=429025449888972871]
[2021/11/11 13:56:59.941 +08:00] [WARN] [schema_storage.go:726] [“GetSnapshot is taking too long, DDL puller stuck?”] [ts=429025398980870154]
[2021/11/11 13:56:59.941 +08:00] [WARN] [schema_storage.go:726] [“GetSnapshot is taking too long, DDL puller stuck?”] [ts=429025350798540831]
[2021/11/11 13:56:59.944 +08:00] [WARN] [schema_storage.go:726] [“GetSnapshot is taking too long, DDL puller stuck?”] [ts=429025396175929366]
[2021/11/11 13:56:59.946 +08:00] [WARN] [schema_storage.go:726] [“GetSnapshot is taking too long, DDL puller stuck?”] [ts=429025342423302158]
[2021/11/11 13:56:59.946 +08:00] [WARN] [schema_storage.go:726] [“GetSnapshot is taking too long, DDL puller stuck?”] [ts=429025521952358428]
[2021/11/11 13:56:59.949 +08:00] [WARN] [schema_storage.go:726] [“GetSnapshot is taking too long, DDL puller stuck?”] [ts=429025522437586972]
[2021/11/11 13:56:59.952 +08:00] [WARN] [schema_storage.go:726] [“GetSnapshot is taking too long, DDL puller stuck?”] [ts=429025443322527752]
[2021/11/11 13:56:59.955 +08:00] [WARN] [schema_storage.go:726] [“GetSnapshot is taking too long, DDL puller stuck?”] [ts=429025407329894432]

大概过滤下:
$ grep WARN cdc.log |awk -F ‘[’ ‘{print $5}’ |awk -F’]’ ‘{print $1}’ |sort |uniq -c
828454 “GetSnapshot is taking too long, DDL puller stuck?”
285284 “region not receiving resolved event from tikv or resolved ts is not pushing for too long time, try to resolve lock”
1 “send request to stream failed”
6598 “The resolvedTs is fallen back in kvclient”

这是一天的日志信息:
cdc.log1.tar.gz (14.6 MB)
cdc-2021-11-11T13-16-42.043.rar (10.6 MB)
cdc-2021-11-11T13-37-34.453.rar (10.2 MB)
cdc-2021-11-11T14-04-29.518.rar (10.0 MB)
cdc-2021-11-11T12-28-03.407.rar (11.4 MB)
cdc-2021-11-11T12-55-02.717.rar (10.9 MB)

TiCDC监控快照
tidb-ts-prd-TiCDC_2021-11-11T06_11_42.715Z.json (151.6 KB)

3.changgefeed截图

4.capture截图

  1. 源库监控快照
    tidb-ts-prd-TiDB_2021-11-11T06_18_55.213Z.json (9.0 MB)

6.目标监控快照
这个没业务跑,就不提交了

稍等,我们分析一下

1 个赞

有结果了吗

有个怀疑,但需要和我们研发确认一下https://github.com/tikv/tikv/pull/10976,看看怎么解决比较好

另外,能提供一下 tikv 的日志不(开始卡住时 的日志就行,不需要提供到现在的)

源节点

tikv.log.tar.gz (25.2 MB)

对 ,提供一台就行,现在还没好,对吧

我们确认了,就是上面的 https://github.com/tikv/tikv/pull/10976 这个原因,目前解决办法:轮着重启一下 tikv(彻底解决需要在 v5.3 里解决了,月底才发布:我看看能合并到 v5.2 不)

2 个赞

目标节点有抽数,16点30的时候我重启了cdc,目前已经恢复了。 不过这问题两天时间出现3次,不重启CDC集群,就会一直卡着

不用重启kv,cdc多重启几次基本都能恢复

@levy 这个我们研发反馈,已经在 v5.2.2 修复了,咱们是否考虑 上游升级到 v5.2.2 啊

这个得评估下,怕有新的bug

谢谢老师了

好滴,客气。

此话题已在最后回复的 1 分钟后被自动关闭。不再允许新回复。