tidb-binlog的drainer突然挂掉,起不来

为提高效率,提问时请提供以下信息,问题描述清晰可优先响应。

  • 【TiDB 版本】:v4.0.7
  • 【问题描述】:drainer突然挂掉,起不来

已经确认kafka没有异常

报错drainer日志:

[2020/10/16 23:42:26.175 +08:00] [INFO] [pump.go:166] [“receive big size binlog”] [size=“108 MB”]
[2020/10/16 23:42:41.757 +08:00] [INFO] [broker.go:212] ["[sarama] Connected to broker at 10.40.14.11:9092 (registered as #1)
“]
[2020/10/16 23:42:41.977 +08:00] [INFO] [async_producer.go:971] [”[sarama] producer/broker/1 state change to [closing] because write tcp 10.40.195.229:59076->10.40.14.11:9092: write: connection reset by peer
“]
[2020/10/16 23:42:41.978 +08:00] [INFO] [broker.go:253] [”[sarama] Closed connection to broker 10.40.14.11:9092
“]
[2020/10/16 23:42:41.978 +08:00] [INFO] [async_producer.go:578] [”[sarama] producer/leader/bi2b_tidb_obinlog/0 state change to [retrying-7]
“]
[2020/10/16 23:42:41.978 +08:00] [INFO] [async_producer.go:588] [”[sarama] producer/leader/bi2b_tidb_obinlog/0 abandoning broker 1
“]
[2020/10/16 23:42:41.978 +08:00] [INFO] [async_producer.go:717] [”[sarama] producer/broker/1 input chan closed
“]
[2020/10/16 23:42:41.978 +08:00] [INFO] [async_producer.go:801] [”[sarama] producer/broker/1 shut down
“]
[2020/10/16 23:42:42.478 +08:00] [INFO] [client.go:772] [”[sarama] client/metadata fetching metadata for [bi2b_tidb_obinlog] from broker 10.40.75.137:9092
“]
[2020/10/16 23:42:42.480 +08:00] [INFO] [async_producer.go:711] [”[sarama] producer/broker/1 starting up
“]
[2020/10/16 23:42:42.480 +08:00] [INFO] [async_producer.go:727] [”[sarama] producer/broker/1 state change to [open] on bi2b_tidb_obinlog/0
“]
[2020/10/16 23:42:42.480 +08:00] [INFO] [async_producer.go:570] [”[sarama] producer/leader/bi2b_tidb_obinlog/0 selected broker 1
“]
[2020/10/16 23:42:42.480 +08:00] [INFO] [async_producer.go:594] [”[sarama] producer/leader/bi2b_tidb_obinlog/0 state change to [flushing-7]
“]
[2020/10/16 23:42:42.480 +08:00] [INFO] [async_producer.go:616] [”[sarama] producer/leader/bi2b_tidb_obinlog/0 state change to [normal]
"]

现象:重试几十次,就直接退出

已经确认kafka没有异常

经测试,应该跟[“receive big size binlog”] [size=“108 MB”]有关,如果我把savepoint移走,重启;就能正常工作;把之前的savepoint恢复,就无法正常工作

感觉跟这个pr:drainer/pump.go: fix when msg bigger than 4M by july2993 · Pull Request #333 · pingcap/tidb-binlog · GitHub 有点关联,可否帮忙看下?

打开debug模式,发现这108MB都是DDL语句,应该是dm全量拉取时,建的1700张表的DDL

drainer 版本也是 4.0.7 吗,这个 pr(https://github.com/pingcap/tidb-binlog/pull/333 ) 修复的问题比较早了,应该不是这个问题

drainer 日志里有没有其他报错,stderr.log 有输出吗,也可以 journalctl -eu drainer-8249.service 检查下 systemd 的 log

journal log里面没有什么有价值的信息。。。

你好,
最好反馈下,drainer.log 看下其中是否有其他 error ,重点关注下 kafka 下游最大 message 为多少。

下游改过,1G,看drainer代码,如果使用kafka作为syncer.to,消息最大大小为1g

目前 drainer 恢复了吗。

  1. drainer log 反馈下。
  2. -cmd drainers 看下当前的状态,尝试使用 tiup 重启 drainer 看下 log 输出。

辛苦反馈下需要的信息,否则排查比较困难。

恢复了啊,只要binlog size不超过100MB就没事啊

这个是的:tidb-binlog/drainer/sync/kafka.go at cb11afb39b7dfe38ba83a6b3aa3f4f5b06df24a2 · pingcap/tidb-binlog · GitHub

如果不方便反馈 drainer log 可以,在观察下吧。

@troy_wang
改 kafka 配置中的

socket.request.max.bytes=104857600

为 1073741824 (1GB) 即可,这个值默认为 100MB,超过后会 reset connection