v6.5.0 ticdc同步偶尔存在长时间延迟。重启ticdc之后,立即恢复正常。

【 TiDB 使用环境】生产环境 /测试/ Poc
生产环境
【 TiDB 版本】
v6.5.0
【复现路径】做过哪些操作出现的问题
不知道如何复现。
【遇到的问题:问题现象及影响】
现象:平时间基本没什么延迟。最长的也就1min钟。但是ticdc偶尔出现长时间的延迟,checkpointLag一直增大。重启ticdc后。延迟又立马消失。

[2023/02/15 20:31:30.208 +08:00] [INFO] [replication_manager.go:551] ["schedulerv3: slow table"] [namespace=default] [changefeed=repl-to-tenxunguangzhou] [ta
bleID=2873] [tableStatus="region_count:10 current_ts:439475054874722304 stage_checkpoints:<key:\"puller-egress\" value:<checkpoint_ts:439474753250787343 reso
lved_ts:439474716537520134 > > stage_checkpoints:<key:\"puller-ingress\" value:<checkpoint_ts:439474753250787343 resolved_ts:439475054834876421 > > stage_che
ckpoints:<key:\"sink\" value:<checkpoint_ts:439474716537520134 resolved_ts:439474716537520134 > > stage_checkpoints:<key:\"sorter-egress\" value:<checkpoint_
ts:439474716537520134 resolved_ts:439474716537520134 > > stage_checkpoints:<key:\"sorter-ingress\" value:<checkpoint_ts:439474753250787343 resolved_ts:439474
716537520134 > > barrier_ts:439475054834876418 "] [checkpointTs=439474716537520134] [resolvedTs=439474716537520134] [checkpointLag=21m30.706968307s]
[2023/02/15 20:31:30.209 +08:00] [INFO] [replication_manager.go:551] ["schedulerv3: slow table"] [namespace=default] [changefeed=repl-to-zp] [tableID=3082] [
tableStatus="region_count:1 current_ts:439475054887567360 stage_checkpoints:<key:\"puller-egress\" value:<checkpoint_ts:439474977803337743 resolved_ts:439474
977803337743 > > stage_checkpoints:<key:\"puller-ingress\" value:<checkpoint_ts:439474977803337743 resolved_ts:439474977803337743 > > stage_checkpoints:<key:
\"sink\" value:<checkpoint_ts:439474977737801745 resolved_ts:439474977737801745 > > stage_checkpoints:<key:\"sorter-egress\" value:<checkpoint_ts:43947497773
7801745 resolved_ts:439474977737801745 > > stage_checkpoints:<key:\"sorter-ingress\" value:<checkpoint_ts:439474977737801745 resolved_ts:439474977737801745 >
 > barrier_ts:439475054834876418 "] [checkpointTs=439474977737801745] [resolvedTs=439474977803337743] [checkpointLag=4m54.307444085s]
[2023/02/15 20:31:30.209 +08:00] [INFO] [replication_manager.go:551] ["schedulerv3: slow table"] [namespace=default] [changefeed=repl-to-zp] [tableID=3040] [
tableStatus="region_count:1 current_ts:439475054888091648 stage_checkpoints:<key:\"puller-egress\" value:<checkpoint_ts:439474977803337743 resolved_ts:439474
977803337743 > > stage_checkpoints:<key:\"puller-ingress\" value:<checkpoint_ts:439474977803337743 resolved_ts:439474977803337743 > > stage_checkpoints:<key:
\"sink\" value:<checkpoint_ts:439474977737801745 resolved_ts:439474977737801745 > > stage_checkpoints:<key:\"sorter-egress\" value:<checkpoint_ts:43947497773
7801745 resolved_ts:439474977737801745 > > stage_checkpoints:<key:\"sorter-ingress\" value:<checkpoint_ts:439474977737801745 resolved_ts:439474977737801745 >
 > barrier_ts:439475054834876418 "] [checkpointTs=439474977737801745] [resolvedTs=439474977803337743] [checkpointLag=4m54.307444085s]
[2023/02/15 20:31:30.209 +08:00] [INFO] [replication_manager.go:551] ["schedulerv3: slow table"] [namespace=default] [changefeed=repl-to-zp] [tableID=2887] [
tableStatus="region_count:5 current_ts:439475054887567360 stage_checkpoints:<key:\"puller-egress\" value:<checkpoint_ts:439474753250787343 resolved_ts:439474
732082659333 > > stage_checkpoints:<key:\"puller-ingress\" value:<checkpoint_ts:439474753250787343 resolved_ts:439475054873673746 > > stage_checkpoints:<key:
\"sink\" value:<checkpoint_ts:439474732082659333 resolved_ts:439474732082659333 > > stage_checkpoints:<key:\"sorter-egress\" value:<checkpoint_ts:43947473208
2659333 resolved_ts:439474732082659333 > > stage_checkpoints:<key:\"sorter-ingress\" value:<checkpoint_ts:439474753250787343 resolved_ts:439474732082659333 >
 > barrier_ts:439475054834876418 "] [checkpointTs=439474732082659333] [resolvedTs=439474732082659333] [checkpointLag=20m31.407444085s]

【资源配置】
cdc.zip (612.6 KB)

【附件:截图/日志/监控】
附件日志为cdc owner 2023/02/15 20:00到21:20的日志

上游写一条数据,然后到下游看看数据是不是确实经过这么长延迟后才同步过来了呢?
先排除监控显示问题。

目前同步正常的,无延迟。 这个大延迟现象是偶发性的。 等下次发生的时候,我测一下试试

想问下,大延时 TiCDC 重启恢复后,同步的数据本身没发生丢失吗?

使用 sync-diff-inspector校验过几次。显示结果都是一致的 应该是不存在数据丢失的情况

那可能确实得等复现的时候,再排查一下:
(1)楼上的,是否实际数据发生延迟,不过看你这个现象估计不太是
(2)在 TiCDC 所在机器上查询下实际 changefeed 的状态,可能是上报有问题

你好,可以上传下完整的 TiCDC 监控和所有 TiCDC 节点的日志吗?

2023/02/15 这个时间点的ticdc监控数据没有了。prometheus的数据没保留那么长。
2023/02/23 19:00~20:00出现了30min的延迟。建立了测试表,插入数据,发现延迟立马消失了。未做重启cdc的操作。 我上传一下昨天的日志和监控信息。
cdc.zip (1.3 MB)
附件里面有ticdc的grafana的json文件。按这个文档做的导出
[FAQ] Grafana Metrics 页面的导出和导入 - :milky_way: 运维指南 / TiDB 常见 FAQ - TiDB 的问答社区 (asktug.com)

数据库在这个时间点也确实存在大事务。 先解决掉大事务,后面再观察是否还存在偶发性的大延迟情况。

TiCDC 延迟大一般是大事务,你说的重启之后立刻恢复看起来像是已经同步了但是 cdc 反馈给监控的状态发生问题。

不会,同步的进度是保存在 etcd 里的, 重启后会从那个时间点重新拉取数据