TiCDC无法删除同步任务、同步延迟及无法更新状态

brightwen · 2021 年10 月 18 日 12:52

问题1：
版本TiDB:4.14
问题描述：
TiCDC无法删除同步任务
使用命令:
tiup cdc cli changefeed remove --pd=http://172.29.1.23:2379 --changefeed-id kafka-ote-userquestionchoiceitem --force

问题2：版本TiDB:4.14
问题描述：
重新创建任务后，同步状态一直不更新
–删除任务：
tiup cdc cli changefeed remove --pd=http://172.29.1.23:2379 --changefeed-id kafka-ote-userquestionchoiceitem --force
–重新创建任务
tiup cdc cli changefeed create --pd=http://172.29.1.23:2379 --sink-uri=“kafka://172.29.1.47:9092/ote_userquestionchoiceitem?kafka-version=2.2.0&partition-num=1&max-message-bytes=10485760&replication-factor=1&enable-old-value=true&protocol=canal-json” --changefeed-id=“kafka-ote-userquestionchoiceitem” --config=ote_userquestionchoiceitem.toml --sort-engine=“unified”

任务状态一直没刷新
tiup cdc cli changefeed list --pd=http://172.29.1.23:2379
[
{
“id”: “kafka-ote-userquestionchoiceitem”,
“summary”: {
“state”: “normal”,
“tso”: 0,
“checkpoint”: “”,
“error”: null
}
}
]

查看cdc日志:
[2021/10/18 14:58:42.782 +08:00] [INFO] [owner.go:620] [“stale task status is not deleted, wait metadata cleaned to create new changefeed”] [“task status”="{“tables”:{“53”:{“start-ts”:428484117658861579,“mark-table-id”:0}},“operation”:{“53”:{“done”:false,“delete”:false,“boundary_ts”:428484117658861579}},“admin-job-type”:3}"] [changefeed=kafka-ote-userquestionchoiceitem]
显示一直在等待任务清理，只能重启cdc服务后，任务状态才能正常

问题3: 版本TiDB5.10
问题i描述：
TiCDC导致PD无法启动
报错日志:
[2021/10/14 20:13:14.445 +08:00] [ERROR] [server.go:1203] [“campaign pd leader meets error due to etcd error”] [campaign-pd-leader-name=pd-10.70.5.14-2379] [error="[PD:etcd:ErrEtcdGrantLease]etcdserver: mvcc: database space exceeded"]
大概原因：TiCDC导致etcd空间占满
这个我们只能根据类似的文档，先压缩后清理etcd数据方式，才能恢复tidb集群，如何规避ticdc占用太多的etcd内存导致PD挂掉
目前使用情况：
[tidb@tidb-deploy-center kafka-rpt-task]$ etcdctl --write-out=table --endpoints=$ENDPOINTS endpoint status
±----------------±-----------------±--------±--------±----------±----------±-----------+
| ENDPOINT | ID | VERSION | DB SIZE | IS LEADER | RAFT TERM | RAFT INDEX |
±----------------±-----------------±--------±--------±----------±----------±-----------+
| 10.70.5.14:2379 | accdac886f1f1196 | 3.4.3 | 5.7 GB | false | 16 | 249110805 |
| 10.70.5.18:2379 | b59661359d753ad7 | 3.4.3 | 5.7 GB | false | 16 | 249110805 |
| 10.70.5.28:2379 | dd900bc98bc4e4ce | 3.4.3 | 5.7 GB | true | 16 | 249110805 |
±----------------±-----------------±--------±--------±----------±----------±-----------+

问题4：版本TiDB5.10
问题描述：ticdc同步任务异常后无法删除，类似问题1
–查看cdc任务状态
tiup cdc cli changefeed list --pd=http://10.70.5.14:2379
“id”: “kafka-ote-arrange-control”,
“summary”: {
“state”: “error”,
“tso”: 428372373893808323,
“checkpoint”: “2021-10-13 15:42:03.362”,
“error”: {
“addr”: “10.70.5.27:8303”,
“code”: “CDC:ErrGCTTLExceeded”,
“message”: “[CDC:ErrGCTTLExceeded]the checkpoint-ts(428372373893808323) lag of the changefeed(kafka-ote-arrange-control) %!d(MISSING) has exceeded the GC TTL”
报错日志：
[2021/10/14 21:28:54.106 +08:00] [ERROR] [changefeed.go:106] [“an error occurred in Owner”] [changefeedID=kafka-util-tag] [error="[CDC:ErrGCTTLExceeded]the checkpoint-ts(428372373893808323) lag of the changefeed(kafka-util-tag) %!d(MISSING) has exceeded the GC TTL"] [errorVerbose="[CDC:ErrGCTTLExceeded]the checkpoint-ts(428372373893808323) lag of the changefeed(kafka-util-tag) %!d(MISSING) has exceeded the GC TTL\ngithub.com/pingcap/errors.AddStack\ \tgithub.com/pingcap/errors@v0.11.5-0.20201126102027-b0a155152ca3/errors.go:174\ github.com/pingcap/errors.

可能是超过service pd savepoint 24小时了，尝试重新创建任务，但是同样无法删除任务
tiup cdc cli changefeed remove --pd=http://10.70.5.14:2379 --changefeed-id “kafka-ote-arrange-control” --force
尝试使用reset清理所有cdc元数据
tiup cdc cli unsafe reset --pd=http://10.70.5.14:2379
但这种方法风险较高，会影响其他同步任务，所以还是无法删除单个同步任务的问题

liuzix · 2021 年10 月 18 日 14:34

问题1、2：请问能否提供问题发生时的 CDC 各节点日志？
问题3：请问 TiCDC 在这次报错之前是否能正常工作，总共有多少个 changefeed, 各同步了多少个表？
问题4：是已知问题，已经在 5.1.2 修复。我们会在文档中补充一些详细的解决方案。

zhangji · 2021 年10 月 19 日 02:12

你好
问题1：对应的日志已经找不到了，问题1和问题4是否属于一类问题，都属于无法删除cdc任务的情况
问题2：截取了部分日志，附件为cdc_20211018_1458.log.tar.gz ，时间点位20211018-14:58左右
cdc_20211018_1458.zip (637.7 KB)
问题3：ticdc报错前一直是正常同步的，直到etcd报no space的错误，总共34个changefeed，每个changefeed只同步1张表到对应的kafka topic

yilong · 2021 年10 月 19 日 06:23

麻烦帮忙反馈下 etcd 报 no space 前后的日志
是否有 ticdc 的监控，也麻烦提供下（和日志相同时间段即可），多谢。

zhangji · 2021 年10 月 19 日 07:02

问题3：时间点大概在2021-10-14 20:13左右
pd大量报错：error=“[PD:etcd:ErrEtcdGrantLease]etcdserver: mvcc: database space exceeded”
pd-2021-10-14T20-06-16.673.zip (19.5 MB)

ticdc也大量报错：error=“[CDC:ErrGCTTLExceeded]the checkpoint-ts(428372373736522012) lag of the changefeed(kafka-udp-dept) %!d(MISS
ING) has exceeded the GC TTL”
cdc-2021-10-14T21-39-03.858.zip (4.0 MB)

2021-10-14 16:00:00 - 23:00时间端的ticdc监控

麻烦帮忙看一下

yilong · 2021 年10 月 25 日 04:04

问题1：是 4.0.x 旧 owner 的已知问题。
workaround： cdc cli unsafe reset --pd=http://10.70.5.14:2379 来恢复

问题2：是 4.0.x 的新发现问题，但考虑到旧 owner 在下个小版本将会淘汰，暂定不修复。在 force remove 之后立刻 create changefeed 会触发。
workaround：需要重启整个集群，版本重构后，不会有这个问题。

问题3：已经查明问题，并在准备修复 https://github.com/pingcap/ticdc/issues/3112
会在近期版本修复。
workaround:清理 etcd 的方法可行，或者减少 changefeed 数量。

问题4：这种情况下不能删除 changefeed 系已知问题，已经在 v5.1.2 修复 https://github.com/pingcap/ticdc/issues/2391
workaround: cdc cli unsafe reset --pd=http://10.70.5.14:2379 来恢复

zhangji · 2021 年10 月 25 日 05:25

好的，非常感谢，我们记录下问题
问题1和问题2是建议升级4.0.14之后的版本吗

system · 2022 年10 月 31 日 19:17

此话题已在最后回复的 1 分钟后被自动关闭。不再允许新回复。