tidb-binlog的drainer突然挂掉,起不来

为提高效率,提问时请提供以下信息,问题描述清晰可优先响应。

  • 【TiDB 版本】:v4.0.7
  • 【问题描述】:drainer突然挂掉,起不来

已经确认kafka没有异常

报错drainer日志:

[2020/10/16 23:42:26.175 +08:00] [INFO] [pump.go:166] [“receive big size binlog”] [size=“108 MB”]
[2020/10/16 23:42:41.757 +08:00] [INFO] [broker.go:212] ["[sarama] Connected to broker at 10.40.14.11:9092 (registered as #1)\n"]
[2020/10/16 23:42:41.977 +08:00] [INFO] [async_producer.go:971] ["[sarama] producer/broker/1 state change to [closing] because write tcp 10.40.195.229:59076->10.40.14.11:9092: write: connection reset by peer\n"]
[2020/10/16 23:42:41.978 +08:00] [INFO] [broker.go:253] ["[sarama] Closed connection to broker 10.40.14.11:9092\n"]
[2020/10/16 23:42:41.978 +08:00] [INFO] [async_producer.go:578] ["[sarama] producer/leader/bi2b_tidb_obinlog/0 state change to [retrying-7]\n"]
[2020/10/16 23:42:41.978 +08:00] [INFO] [async_producer.go:588] ["[sarama] producer/leader/bi2b_tidb_obinlog/0 abandoning broker 1\n"]
[2020/10/16 23:42:41.978 +08:00] [INFO] [async_producer.go:717] ["[sarama] producer/broker/1 input chan closed\n"]
[2020/10/16 23:42:41.978 +08:00] [INFO] [async_producer.go:801] ["[sarama] producer/broker/1 shut down\n"]
[2020/10/16 23:42:42.478 +08:00] [INFO] [client.go:772] ["[sarama] client/metadata fetching metadata for [bi2b_tidb_obinlog] from broker 10.40.75.137:9092\n"]
[2020/10/16 23:42:42.480 +08:00] [INFO] [async_producer.go:711] ["[sarama] producer/broker/1 starting up\n"]
[2020/10/16 23:42:42.480 +08:00] [INFO] [async_producer.go:727] ["[sarama] producer/broker/1 state change to [open] on bi2b_tidb_obinlog/0\n"]
[2020/10/16 23:42:42.480 +08:00] [INFO] [async_producer.go:570] ["[sarama] producer/leader/bi2b_tidb_obinlog/0 selected broker 1\n"]
[2020/10/16 23:42:42.480 +08:00] [INFO] [async_producer.go:594] ["[sarama] producer/leader/bi2b_tidb_obinlog/0 state change to [flushing-7]\n"]
[2020/10/16 23:42:42.480 +08:00] [INFO] [async_producer.go:616] ["[sarama] producer/leader/bi2b_tidb_obinlog/0 state change to [normal]\n"]

现象:重试几十次,就直接退出

已经确认kafka没有异常

经测试,应该跟[“receive big size binlog”] [size=“108 MB”]有关,如果我把savepoint移走,重启;就能正常工作;把之前的savepoint恢复,就无法正常工作

感觉跟这个pr:https://github.com/pingcap/tidb-binlog/pull/333 有点关联,可否帮忙看下?

打开debug模式,发现这108MB都是DDL语句,应该是dm全量拉取时,建的1700张表的DDL

drainer 版本也是 4.0.7 吗,这个 pr(https://github.com/pingcap/tidb-binlog/pull/333 ) 修复的问题比较早了,应该不是这个问题

drainer 日志里有没有其他报错,stderr.log 有输出吗,也可以 journalctl -eu drainer-8249.service 检查下 systemd 的 log

journal log里面没有什么有价值的信息。。。

你好,
最好反馈下,drainer.log 看下其中是否有其他 error ,重点关注下 kafka 下游最大 message 为多少。

下游改过,1G,看drainer代码,如果使用kafka作为syncer.to,消息最大大小为1g