Binlog Drainer Event中断10分钟

为提高效率,提问时请提供以下信息,问题描述清晰可优先响应。

  • 【TiDB 版本】:4.0.4
  • 【问题描述】:今天14.02~14.12期间,Drainer Event中断了10分钟,查看Pump日志出现大量的“[“Find no MVCC record for a young txn”]”日志

若提问为性能优化、故障排查类问题,请下载脚本运行。终端输出的打印结果,请务必全选并复制粘贴上传。
screencapture-172-16-12-71-3000-d-RDdDTFvZz-pre-tidb-cluster-binlog-2020-09-08-14_25_13.pdf (3.5 MB)

Pump日志截图信息如下:

您好,请问下 drainer 中断后自动修复了么?还是您这边做了什么操作?
另外麻烦提供下 drainer 中断前后的日志,看 pump 日志中显示 transport is closing 的报错,也可以排查下 pump 当时是否正常。
辛苦再提供下 tidb-binlog 的版本信息。

过了10分钟自动恢复了

麻烦您这边可以先检查下网络情况,rpc error: code = Unavaliable desc = transport is closing 有可能是因为网络原因导致,上面的信息方便提供么 ?

今天新搭建的链路,刚刚11点Drainer Event又中断了10分钟。

Pump 11点前后的日志
binlog同步中断10分钟.log (82.2 KB)

当时Drainer组件 在11:00到11点10期间的日志
[2020/09/22 11:00:12.825 +08:00] [INFO] [syncer.go:257] [“write save point”] [ts=419625333509849092]
[2020/09/22 11:00:15.836 +08:00] [INFO] [syncer.go:257] [“write save point”] [ts=419625334374924289]
[2020/09/22 11:00:18.846 +08:00] [INFO] [syncer.go:257] [“write save point”] [ts=419625335174463509]
[2020/09/22 11:00:49.530 +08:00] [INFO] [syncer.go:257] [“write save point”] [ts=419625335213785091]
[2020/09/22 11:03:26.050 +08:00] [INFO] [client.go:772] ["[sarama] client/metadata fetching metadata for [aliprod-tidb-binlog] from broker 172.16.12.153:9092\ "]
[2020/09/22 11:10:20.080 +08:00] [INFO] [syncer.go:257] [“write save point”] [ts=419625335239999526]
[2020/09/22 11:10:20.804 +08:00] [INFO] [async_producer.go:813] ["[sarama] producer/broker/3 maximum request accumulated, waiting for space\ "]
[2020/09/22 11:10:21.272 +08:00] [INFO] [async_producer.go:813] ["[sarama] producer/broker/3 maximum request accumulated, waiting for space\ "]
[2020/09/22 11:10:22.856 +08:00] [INFO] [async_producer.go:813] ["[sarama] producer/broker/3 maximum request accumulated, waiting for space\ "]
[2020/09/22 11:10:22.951 +08:00] [INFO] [async_producer.go:813] ["[sarama] producer/broker/3 maximum request accumulated, waiting for space\ "]
[2020/09/22 11:10:23.128 +08:00] [INFO] [syncer.go:257] [“write save point”] [ts=419625415796850713]
[2020/09/22 11:10:23.237 +08:00] [INFO] [async_producer.go:813] ["[sarama] producer/broker/3 maximum request accumulated, waiting for space\ "]
[2020/09/22 11:10:23.338 +08:00] [INFO] [async_producer.go:813] ["[sarama] producer/broker/3 maximum request accumulated, waiting for space\ "]
[2020/09/22 11:10:23.499 +08:00] [INFO] [async_producer.go:813] ["[sarama] producer/broker/3 maximum request accumulated, waiting for space\ "]
[2020/09/22 11:10:23.962 +08:00] [INFO] [async_producer.go:813] ["[sarama] producer/broker/3 maximum request accumulated, waiting for space\ "]
[2020/09/22 11:10:24.077 +08:00] [INFO] [async_producer.go:813] ["[sarama] producer/broker/3 maximum request accumulated, waiting for space\ "]
[2020/09/22 11:10:24.412 +08:00] [INFO] [async_producer.go:813] ["[sarama] producer/broker/3 maximum request accumulated, waiting for space\ "]
[2020/09/22 11:10:24.847 +08:00] [INFO] [async_producer.go:813] ["[sarama] producer/broker/3 maximum request accumulated, waiting for space\ "]
[2020/09/22 11:10:26.174 +08:00] [INFO] [syncer.go:257] [“write save point”] [ts=419625494295871500]
[2020/09/22 11:10:29.180 +08:00] [INFO] [syncer.go:257] [“write save point”] [ts=419625495174053917]
[2020/09/22 11:10:32.186 +08:00] [INFO] [syncer.go:257] [“write save point”] [ts=419625495960485912]
[2020/09/22 11:10:35.192 +08:00] [INFO] [syncer.go:257] [“write save point”] [ts=419625496746917933]
[2020/09/22 11:10:38.198 +08:00] [INFO] [syncer.go:257] [“write save point”] [ts=419625497507135507]
[2020/09/22 11:10:41.199 +08:00] [INFO] [syncer.go:257] [“write save point”] [ts=419625498332889105]
[2020/09/22 11:10:44.201 +08:00] [INFO] [syncer.go:257] [“write save point”] [ts=419625499093106692]
[2020/09/22 11:10:47.208 +08:00] [INFO] [syncer.go:257] [“write save point”] [ts=419625499892645894]
[2020/09/22 11:10:50.222 +08:00] [INFO] [syncer.go:257] [“write save point”] [ts=419625500679077917]

这是个 drainer 的最新报错?看起来哦是在 11 点时候这个是正常同步的,有没有最新的其他异常报错?麻烦发一下。

看Pump的日志,里面有报错的,Grafana中 Drainer Event在上午11点到11点10分之间数据中断了。

这个日志报错是事务没有 commit 成功,也没有回滚的一个不确定状态。这个可能的原因比较多,因为 c-binlog 解决,这部分提交是异步的。可能原因是比如上游的 TiDB 异常重启或者 crash ,还有可能是因为网络异常中断导致。

如果报错事务已经 commit 成功,但是pump 的 c-binlog 同步失败,则需要通过 sync-diff-inspector检查一下这个上下游的数据是否一致。如果不一致,需要通过 sync-diff-inspector 将差异数据补充上。

是不是可以理解为大事务导致commit失败,最终引发binlog组件出问题

不可以,应该不是大事务导致的,还是检察一下 网络和 tidb server 有没有异常。

最近有个UPDATE涉及查询数据量太大,导致TiDB频繁OOM,难道和这个也有一定的关系

应该和这个有关系,那说明这个事务没有 commit 成功,然后回滚了。