升级到DM5.4后,worker打印大量日志:[syncer.go:1114] ["checkpoint has no change, skip sync flush checkpoint"]

【问题】 DM升级到5.4后,secondsBehindMaster的值变化较大,并且dm-worker打印大量如下日志:

[2022/02/17 10:25:42.162 +08:00] [INFO] [syncer.go:1114] [“checkpoint has no change, skip sync flush checkpoint”]
[2022/02/17 10:25:42.190 +08:00] [INFO] [syncer.go:1114] [“checkpoint has no change, skip sync flush checkpoint”]
[2022/02/17 10:25:42.221 +08:00] [INFO] [syncer.go:1114] [“checkpoint has no change, skip sync flush checkpoint”]
[2022/02/17 10:25:42.258 +08:00] [INFO] [syncer.go:1114] [“checkpoint has no change, skip sync flush checkpoint”]
[2022/02/17 10:25:42.293 +08:00] [INFO] [syncer.go:1114] [“checkpoint has no change, skip sync flush checkpoint”]
[2022/02/17 10:25:42.326 +08:00] [INFO] [syncer.go:1114] [“checkpoint has no change, skip sync flush checkpoint”]
[2022/02/17 10:25:42.360 +08:00] [INFO] [syncer.go:1114] [“checkpoint has no change, skip sync flush checkpoint”]
[2022/02/17 10:25:42.390 +08:00] [INFO] [syncer.go:1114] [“checkpoint has no change, skip sync flush checkpoint”]
[2022/02/17 10:25:42.427 +08:00] [INFO] [syncer.go:1114] [“checkpoint has no change, skip sync flush checkpoint”]
[2022/02/17 10:25:42.456 +08:00] [INFO] [syncer.go:1114] [“checkpoint has no change, skip sync flush checkpoint”]
[2022/02/17 10:25:42.486 +08:00] [INFO] [syncer.go:1114] [“checkpoint has no change, skip sync flush checkpoint”]
[2022/02/17 10:25:42.518 +08:00] [INFO] [syncer.go:1114] [“checkpoint has no change, skip sync flush checkpoint”]
[2022/02/17 10:25:42.549 +08:00] [INFO] [syncer.go:1114] [“checkpoint has no change, skip sync flush checkpoint”]
[2022/02/17 10:25:42.583 +08:00] [INFO] [syncer.go:1114] [“checkpoint has no change, skip sync flush checkpoint”]
[2022/02/17 10:25:42.615 +08:00] [INFO] [syncer.go:1114] [“checkpoint has no change, skip sync flush checkpoint”]
[2022/02/17 10:25:42.644 +08:00] [INFO] [syncer.go:1114] [“checkpoint has no change, skip sync flush checkpoint”]
[2022/02/17 10:25:42.674 +08:00] [INFO] [syncer.go:1114] [“checkpoint has no change, skip sync flush checkpoint”]
[2022/02/17 10:25:42.704 +08:00] [INFO] [syncer.go:1114] [“checkpoint has no change, skip sync flush checkpoint”]
[2022/02/17 10:25:42.734 +08:00] [INFO] [syncer.go:1114] [“checkpoint has no change, skip sync flush checkpoint”]
[2022/02/17 10:25:42.764 +08:00] [INFO] [syncer.go:1114] [“checkpoint has no change, skip sync flush checkpoint”]
[2022/02/17 10:25:42.795 +08:00] [INFO] [syncer.go:1114] [“checkpoint has no change, skip sync flush checkpoint”]
[2022/02/17 10:25:42.834 +08:00] [INFO] [syncer.go:1114] [“checkpoint has no change, skip sync flush checkpoint”]
[2022/02/17 10:25:42.872 +08:00] [INFO] [syncer.go:1114] [“checkpoint has no change, skip sync flush checkpoint”]
[2022/02/17 10:25:42.906 +08:00] [INFO] [syncer.go:1114] [“checkpoint has no change, skip sync flush checkpoint”]
[2022/02/17 10:25:42.939 +08:00] [INFO] [syncer.go:1114] [“checkpoint has no change, skip sync flush checkpoint”]
[2022/02/17 10:25:42.972 +08:00] [INFO] [syncer.go:1114] [“checkpoint has no change, skip sync flush checkpoint”]
[2022/02/17 10:25:43.006 +08:00] [INFO] [syncer.go:1114] [“checkpoint has no change, skip sync flush checkpoint”]
[2022/02/17 10:25:43.035 +08:00] [INFO] [syncer.go:1114] [“checkpoint has no change, skip sync flush checkpoint”]
[2022/02/17 10:25:43.065 +08:00] [INFO] [syncer.go:1114] [“checkpoint has no change, skip sync flush checkpoint”]
[2022/02/17 10:25:43.098 +08:00] [INFO] [syncer.go:1114] [“checkpoint has no change, skip sync flush checkpoint”]
[2022/02/17 10:25:43.133 +08:00] [INFO] [syncer.go:1114] [“checkpoint has no change, skip sync flush checkpoint”]
[2022/02/17 10:25:43.166 +08:00] [INFO] [syncer.go:1114] [“checkpoint has no change, skip sync flush checkpoint”]
[2022/02/17 10:25:43.206 +08:00] [INFO] [syncer.go:1114] [“checkpoint has no change, skip sync flush checkpoint”]
[2022/02/17 10:25:43.267 +08:00] [INFO] [syncer.go:1114] [“checkpoint has no change, skip sync flush checkpoint”]
[2022/02/17 10:25:43.314 +08:00] [INFO] [syncer.go:1114] [“checkpoint has no change, skip sync flush checkpoint”]
[2022/02/17 10:25:43.351 +08:00] [INFO] [syncer.go:1114] [“checkpoint has no change, skip sync flush checkpoint”]
[2022/02/17 10:25:43.391 +08:00] [INFO] [syncer.go:1114] [“checkpoint has no change, skip sync flush checkpoint”]
[2022/02/17 10:25:43.425 +08:00] [INFO] [syncer.go:1114] [“checkpoint has no change, skip sync flush checkpoint”]
[2022/02/17 10:25:43.456 +08:00] [INFO] [syncer.go:1114] [“checkpoint has no change, skip sync flush checkpoint”]
[2022/02/17 10:25:43.486 +08:00] [INFO] [syncer.go:1114] [“checkpoint has no change, skip sync flush checkpoint”]
[2022/02/17 10:25:43.517 +08:00] [INFO] [syncer.go:1114] [“checkpoint has no change, skip sync flush checkpoint”]
[2022/02/17 10:25:43.550 +08:00] [INFO] [syncer.go:1114] [“checkpoint has no change, skip sync flush checkpoint”]
[2022/02/17 10:25:43.580 +08:00] [INFO] [syncer.go:1114] [“checkpoint has no change, skip sync flush checkpoint”]
[2022/02/17 10:25:43.610 +08:00] [INFO] [syncer.go:1114] [“checkpoint has no change, skip sync flush checkpoint”]
[2022/02/17 10:25:43.643 +08:00] [INFO] [syncer.go:1114] [“checkpoint has no change, skip sync flush checkpoint”]
[2022/02/17 10:25:43.681 +08:00] [INFO] [syncer.go:1114] [“checkpoint has no change, skip sync flush checkpoint”]
[2022/02/17 10:25:43.715 +08:00] [INFO] [syncer.go:1114] [“checkpoint has no change, skip sync flush checkpoint”]
dm-worker.log.gz (105.1 KB)

这个变化具体是指怎样变化很大?请举一个『变化很大的示例』

另外,下面这段日志,对你的 DM 数据增量同步造成了怎样的影响?

此类日志产生过快,大量的此类日志干扰了拍错。从query-status看明显又延迟的,但是显示的延时为0,相关监控如下:


另外升级到5.4后,延迟一直下不去,从tidb监控看,感觉没同步数据;降级到5.3版本,延时显示正常了,tidb的也开始有数据写入了


从query-status看明显又延迟的,但是显示的延时为0

这个是因为 dm 内部监控 lag 的逻辑和查看当前 binlog pos 的逻辑不是同步的
dmctl query-satatus 显示的结果会有一段时间的延迟,grafana 上的是实时的


另外从监控的图上看,升级到 5.4 之后就只能处理 delete 语句了,其他的语句都没有了,能提供较为完整的 v5.4 版本的 dm-worker日志供我们排查问题么?

上面的日志附件就是完整了的。5.4上grafana也是不正常的:

您上游在卡住的 syncerBinlog(binlog 文件 4921 那个位置)后面的 binlog 有什么特性吗,比如是一个大事务、是涉及黑名单表的、是关于 online ddl 表的

同步没有做过滤,也没有做online ddl,上游的库在凌晨会有load data infile操作,有大量的导入,这个是一直有的,5.3之前版本同步都正常。

您方便给一下卡住位置后面的 binlog event 的类型以及其 position 吗

update:如果是 dml 相关的 event,还希望看一下这个表是不是在同步的白名单里


上面那个 worker-log 文件不全,请问有收集完整的么,从新 worker 起来到下掉之间的日志
另外能提供一下 task 的配置文件么?

可以的话最好再提供一下使用 5.4版本期间的grafana监控,可以利用 https://metricstool.pingcap.com/ 导出一份 json 就行

相关的binlog如下,当时有个大事务删除操作:

任务配置如下:

感谢反馈,我们会尝试复现。

您这边能估计一下这个大事务删除了多少行吗,比如一百万行

一百多万行吧,因为我们这个是报表库,大的删除也有4~5百万;
若有结果辛苦反馈下,多谢

感谢反馈。我们初步判断是处理大事务时,由于事务没有结束以及其他原因无法更新 checkpoint,DM 5.4 版本中对于 checkpoint 长时间没有更新新加了这样一条日志,日志写盘进一步拖慢了同步进度。您可以尝试将日志级别调高来避免。

想再跟您确认一下,您上面的 binlog event 截图是一个事务中包含了多个 event,每个大约 10kB。后面有没有一个巨大的 event,大约是 MB 级别的?

这个删除操作后面,是个批量插入的操作,估计有十几兆。

这个大事务处理的问题,是5.4版本首次出现,还是以前版本也有这个问题?

5.4 新增日志。您可以尝试将 worker 日志级别改为 WARN 看看有没有缓解

我们在 https://github.com/pingcap/tiflow/issues/4619 更新这个 BUG 的进展

1 个赞

Hi,您这边这个同步任务包含了多少张表?

500多张表,我觉得磁盘写日志对同步影响不会这么大,估计是其他问题导致同步异常

500 多张表的话,确实可能是另外的问题。在 5.3 版本中,检查 checkpoint 是否要 flush 是只看的 global checkpoint, 5.4 的话还会看 table checkpoint(这样就多了 500 个 table checkpoint)。您这个 GTID 又比较复杂,对比 table checkpoint 可能就更慢了。

感谢反馈,我们会在 5.4 下一个版本尝试修复。目前如果您使用 5.4 的话可以试试任务同步的表少一些,多几个任务

4 个赞

升级到5.4.0遇到同样问题,所有跟这个源相关的同步任务都动不了了