drainer异常宕机，Failed to close syncer

db_user · 2021 年12 月 22 日 06:34

为提高效率，请提供以下信息，问题描述清晰能够更快得到解决：
【 TiDB 使用环境】
centos

【现象】业务和数据库现象
drainer异常宕机

【TiDB 版本】
v4.0.7
【附件】

相关日志、配置文件、Grafana 监控（https://metricstool.pingcap.com/)
TiUP Cluster Display 信息
TiUP CLuster Edit config 信息
TiDB-Overview 监控
对应模块的 Grafana 监控（如有 BR、TiDB-binlog、TiCDC 等）
对应模块日志（包含问题前后 1 小时日志）

drainer同步到Kafka,今早四点一台drainer异常宕机

主要日志如下：
[2021/12/22 04:07:53.936 +08:00] [INFO] [async_producer.go:813] [“[sarama] producer/broker/0 maximum request accumulated, waiting for space
“]
[2021/12/22 04:07:58.490 +08:00] [INFO] [async_producer.go:918] [”[sarama] producer/broker/0 state change to [retrying] on bi2b_tidb_yecai/0 because kafka server: Request exceeded the user-specified time limit in the request.
“]
[2021/12/22 04:07:58.492 +08:00] [INFO] [async_producer.go:578] [”[sarama] producer/leader/bi2b_tidb_yecai/0 state change to [retrying-1]
“]
[2021/12/22 04:07:58.492 +08:00] [INFO] [async_producer.go:588] [”[sarama] producer/leader/bi2b_tidb_yecai/0 abandoning broker 0
“]
[2021/12/22 04:07:58.492 +08:00] [INFO] [async_producer.go:743] [”[sarama] producer/broker/0 state change to [closed] on bi2b_tidb_yecai/0
“]
[2021/12/22 04:07:58.492 +08:00] [INFO] [async_producer.go:717] [”[sarama] producer/broker/0 input chan closed
“]
[2021/12/22 04:07:58.493 +08:00] [INFO] [async_producer.go:801] [”[sarama] producer/broker/0 shut down
“]
[2021/12/22 04:07:58.993 +08:00] [INFO] [client.go:772] [”[sarama] client/metadata fetching metadata for [bi2b_tidb_yecai] from broker kfk_ip:9092
“]
[2021/12/22 04:07:58.995 +08:00] [INFO] [async_producer.go:711] [”[sarama] producer/broker/0 starting up
“]
[2021/12/22 04:07:58.995 +08:00] [INFO] [async_producer.go:727] [”[sarama] producer/broker/0 state change to [open] on bi2b_tidb_yecai/0
“]
[2021/12/22 04:07:58.995 +08:00] [INFO] [async_producer.go:570] [”[sarama] producer/leader/bi2b_tidb_yecai/0 selected broker 0
“]
[2021/12/22 04:07:58.997 +08:00] [INFO] [async_producer.go:813] [”[sarama] producer/broker/0 maximum request accumulated, waiting for space
“]
[2021/12/22 04:08:08.999 +08:00] [INFO] [async_producer.go:918] [”[sarama] producer/broker/0 state change to [retrying] on bi2b_tidb_yecai/0 because kafka server: Request exceeded the user-specified time limit in the request.
“]
[2021/12/22 04:08:09.000 +08:00] [INFO] [async_producer.go:594] [”[sarama] producer/leader/bi2b_tidb_yecai/0 state change to [flushing-1]
“]
[2021/12/22 04:08:09.001 +08:00] [INFO] [async_producer.go:616] [”[sarama] producer/leader/bi2b_tidb_yecai/0 state change to [normal]
“]
[2021/12/22 04:08:09.001 +08:00] [INFO] [async_producer.go:578] [”[sarama] producer/leader/bi2b_tidb_yecai/0 state change to [retrying-2]
“]
[2021/12/22 04:08:09.001 +08:00] [INFO] [async_producer.go:588] [”[sarama] producer/leader/bi2b_tidb_yecai/0 abandoning broker 0
“]
[2021/12/22 04:08:09.001 +08:00] [INFO] [async_producer.go:743] [”[sarama] producer/broker/0 state change to [closed] on bi2b_tidb_yecai/0
“]
[2021/12/22 04:08:09.001 +08:00] [INFO] [async_producer.go:717] [”[sarama] producer/broker/0 input chan closed
“]
[2021/12/22 04:08:09.001 +08:00] [INFO] [async_producer.go:801] [”[sarama] producer/broker/0 shut down
“]
[2021/12/22 04:08:09.501 +08:00] [INFO] [client.go:772] [”[sarama] client/metadata fetching metadata for [bi2b_tidb_yecai] from broker kfk_ip:9092
“]
[2021/12/22 04:08:09.503 +08:00] [INFO] [async_producer.go:711] [”[sarama] producer/broker/0 starting up
“]
[2021/12/22 04:08:09.504 +08:00] [INFO] [async_producer.go:727] [”[sarama] producer/broker/0 state change to [open] on bi2b_tidb_yecai/0
“]
[2021/12/22 04:08:09.504 +08:00] [INFO] [async_producer.go:570] [”[sarama] producer/leader/bi2b_tidb_yecai/0 selected broker 0
“]
[2021/12/22 04:08:09.505 +08:00] [INFO] [async_producer.go:813] [”[sarama] producer/broker/0 maximum request accumulated, waiting for space
“]
[2021/12/22 04:08:19.127 +08:00] [INFO] [syncer.go:257] [“write save point”] [ts=429946901167341601]
[2021/12/22 04:08:19.129 +08:00] [ERROR] [syncer.go:457] [“Failed to close syncer”] [error=“fail to push msg to kafka after 30s, check if kafka is up and working”] [errorVerbose=“fail to push msg to kafka after 30s, check if kafka is up and working
github.com/pingcap/tidb-binlog/drainer/sync.(*KafkaSyncer).run
\t/home/jenkins/agent/workspace/build_binlog_multi_branch_v4.0.7/go/src/github.com/pingcap/tidb-binlog/drainer/sync/kafka.go:236
runtime.goexit
\t/usr/local/go/src/runtime/asm_amd64.s:1357”]
[2021/12/22 04:08:19.507 +08:00] [INFO] [async_producer.go:918] [”[sarama] producer/broker/0 state change to [retrying] on bi2b_tidb_yecai/0 because kafka server: Request exceeded the user-specified time limit in the request.
“]
[2021/12/22 04:08:19.509 +08:00] [INFO] [async_producer.go:594] [”[sarama] producer/leader/bi2b_tidb_yecai/0 state change to [flushing-2]
“]
[2021/12/22 04:08:19.509 +08:00] [INFO] [async_producer.go:616] [”[sarama] producer/leader/bi2b_tidb_yecai/0 state change to [normal]
“]
[2021/12/22 04:08:19.509 +08:00] [INFO] [async_producer.go:578] [”[sarama] producer/leader/bi2b_tidb_yecai/0 state change to [retrying-3]
“]
[2021/12/22 04:08:19.509 +08:00] [INFO] [async_producer.go:588] [”[sarama] producer/leader/bi2b_tidb_yecai/0 abandoning broker 0
“]
[2021/12/22 04:08:19.509 +08:00] [INFO] [async_producer.go:743] [”[sarama] producer/broker/0 state change to [closed] on bi2b_tidb_yecai/0
“]
[2021/12/22 04:08:19.509 +08:00] [INFO] [async_producer.go:717] [”[sarama] producer/broker/0 input chan closed
“]
[2021/12/22 04:08:19.511 +08:00] [INFO] [async_producer.go:801] [”[sarama] producer/broker/0 shut down
“]
[2021/12/22 04:08:20.009 +08:00] [INFO] [client.go:772] [”[sarama] client/metadata fetching metadata for [bi2b_tidb_yecai] from broker ip:9092
“]
[2021/12/22 04:08:20.012 +08:00] [INFO] [async_producer.go:711] [”[sarama] producer/broker/0 starting up
“]
[2021/12/22 04:08:20.012 +08:00] [INFO] [async_producer.go:727] [”[sarama] producer/broker/0 state change to [open] on bi2b_tidb_yecai/0
“]
[2021/12/22 04:08:20.012 +08:00] [INFO] [async_producer.go:570] [”[sarama] producer/leader/bi2b_tidb_yecai/0 selected broker 0
“]
[2021/12/22 04:08:20.013 +08:00] [INFO] [async_producer.go:813] [”[sarama] producer/broker/0 maximum request accumulated, waiting for space
“]
[2021/12/22 04:08:20.018 +08:00] [INFO] [async_producer.go:594] [”[sarama] producer/leader/bi2b_tidb_yecai/0 state change to [flushing-3]
“]
[2021/12/22 04:08:20.018 +08:00] [INFO] [async_producer.go:616] [”[sarama] producer/leader/bi2b_tidb_yecai/0 state change to [normal]
“]
[2021/12/22 04:08:29.129 +08:00] [INFO] [server.go:451] [“begin to close drainer server”]
[2021/12/22 04:08:29.130 +08:00] [ERROR] [util.go:65] [“Recovered from panic”] [err=”"Waiting too long for Syncer.run to quit."”] [“real stack”=“github.com/pingcap/tidb-binlog/drainer.(*taskGroup).start.func1.1
\t/home/jenkins/agent/workspace/build_binlog_multi_branch_v4.0.7/go/src/github.com/pingcap/tidb-binlog/drainer/util.go:67
runtime.gopanic
\t/usr/local/go/src/runtime/panic.go:679
github.com/pingcap/tidb-binlog/drainer.(*Syncer).run
\t/home/jenkins/agent/workspace/build_binlog_multi_branch_v4.0.7/go/src/github.com/pingcap/tidb-binlog/drainer/syncer.go:463
github.com/pingcap/tidb-binlog/drainer.(*Syncer).Start
\t/home/jenkins/agent/workspace/build_binlog_multi_branch_v4.0.7/go/src/github.com/pingcap/tidb-binlog/drainer/syncer.go:132
github.com/pingcap/tidb-binlog/drainer.(*Server).Start.func4
\t/home/jenkins/agent/workspace/build_binlog_multi_branch_v4.0.7/go/src/github.com/pingcap/tidb-binlog/drainer/server.go:288
github.com/pingcap/tidb-binlog/drainer.(*taskGroup).start.func1
\t/home/jenkins/agent/workspace/build_binlog_multi_branch_v4.0.7/go/src/github.com/pingcap/tidb-binlog/drainer/util.go:75”] [name=syncer]
[2021/12/22 04:08:29.130 +08:00] [INFO] [util.go:72] [Exit] [name=syncer]
[2021/12/22 04:08:29.131 +08:00] [INFO] [server.go:416] [“has already update status”] [id=ip:8249]
[2021/12/22 04:08:29.131 +08:00] [INFO] [server.go:455] [“commit status done”]
[2021/12/22 04:08:29.131 +08:00] [INFO] [collector.go:135] [“publishBinlogs quit”]
[2021/12/22 04:08:29.131 +08:00] [INFO] [pump.go:77] [“pump is closing”] [id=ip:8250]
[2021/12/22 04:08:29.131 +08:00] [INFO] [util.go:72] [Exit] [name=heartbeat]
[2021/12/22 04:08:29.131 +08:00] [INFO] [pump.go:77] [“pump is closing”] [id=ip:8250]
[2021/12/22 04:08:29.131 +08:00] [INFO] [pump.go:77] [“pump is closing”] [id=ip:8250]
[2021/12/22 04:08:29.134 +08:00] [INFO] [util.go:72] [Exit] [name=collect]
[2021/12/22 04:08:29.134 +08:00] [INFO] [main.go:73] [“drainer exit”]
[2021/12/22 04:08:29.134 +08:00] [INFO] [server.go:470] [“drainer exit”]
[2021/12/22 04:08:29.134 +08:00] [INFO] [server.go:322] [“drainer http server stopped”] [error=“mux: listener closed”]

重要监控如下

看内存监控没有oom,只是kafka消费不起，不明白为啥会造成drainer的宕机，看了gc大佬的帖子，采用的解决办法是换成cdc，不知道binlog该怎么解决这个问题

若提问为性能优化、故障排查类问题，请下载脚本运行。终端输出的打印结果，请务必全选并复制粘贴上传。

songxuecheng · 2021 年12 月 22 日 06:40

查看下是否有大事务文件，导致超过了kafka的消息限制？

db_user · 2021 年12 月 22 日 06:57

DDL job count一直在增多，大事务文件有什么查看的方法么

songxuecheng · 2021 年12 月 22 日 07:06

ddl job 增加是正常，大文件的话可以在日志中查看

songxuecheng · 2021 年12 月 22 日 07:08

类似这样的

db_user · 2021 年12 月 22 日 07:15

感谢，宕机的drianer查看对应的时间没有大事务，只有重启之后有些大事务，宕机那段时间的日志不正常的地方就是我上面列出的那个,所有drainer节点都看了一遍，故障时间点没有大事务

songxuecheng · 2021 年12 月 22 日 07:27

这种情况就只能去查kafka 是否出现了message 写入的限制，我记得之前有过类似的

db_user · 2021 年12 月 22 日 07:30

我也查看了kafka，可能排查的方式不对，没有找到对应的topic日志，只有一个
java.io.IOException: Connection to 0 was disconnected before the response was read
at kafka.utils.NetworkClientBlockingOps$$anonfun$blockingSendAndReceive$extension$1$$anonfun$apply$1.apply(NetworkClientBlockingOps.scala:114)
at kafka.utils.NetworkClientBlockingOps$$anonfun$blockingSendAndReceive$extension$1$$anonfun$apply$1.apply(NetworkClientBlockingOps.scala:112)
at scala.Option.foreach(Option.scala:257)
at kafka.utils.NetworkClientBlockingOps$$anonfun$blockingSendAndReceive$extension$1.apply(NetworkClientBlockingOps.scala:112)
at kafka.utils.NetworkClientBlockingOps$$anonfun$blockingSendAndReceive$extension$1.apply(NetworkClientBlockingOps.scala:108)
at kafka.utils.NetworkClientBlockingOps$.recursivePoll$1(NetworkClientBlockingOps.scala:136)
at kafka.utils.NetworkClientBlockingOps$.kafka$utils$NetworkClientBlockingOps$$pollContinuously$extension(NetworkClientBlockingOps.scala:142)
at kafka.utils.NetworkClientBlockingOps$.blockingSendAndReceive$extension(NetworkClientBlockingOps.scala:108)
at kafka.server.ReplicaFetcherThread.sendRequest(ReplicaFetcherThread.scala:249)
at kafka.server.ReplicaFetcherThread.fetch(ReplicaFetcherThread.scala:234)
at kafka.server.ReplicaFetcherThread.fetch(ReplicaFetcherThread.scala:42)
at kafka.server.AbstractFetcherThread.processFetchRequest(AbstractFetcherThread.scala:118)
at kafka.server.AbstractFetcherThread.doWork(AbstractFetcherThread.scala:103)
at kafka.utils.ShutdownableThread.run(ShutdownableThread.scala:63)
没有对应的max message之类的信息

lichunzhu-PingCAP · 2021 年12 月 22 日 12:15

https://docs.pingcap.com/zh/tidb/stable/handle-tidb-binlog-errors#drainer-同步数据到-kafka-时报错-kafka-server-message-was-too-large-server-rejected-it-to-avoid-allocation-error

message.max.bytes=1073741824
replica.fetch.max.bytes=1073741824
fetch.message.max.bytes=1073741824

这个有配置吗

db_user · 2021 年12 月 23 日 01:34

你好，感谢回复，我这里有配这个参数，这个参数是报错max message时候做的处理，更早的时候有这个问题，采用的这种方式处理解决，这次kafka报的是上面我写的那种报错，百度了下，各种原因，kafka这块不太懂。有说配置太大的问题

db_user · 2021 年12 月 28 日 02:01

自己捞一下，同样的错误又出现了，大佬们帮忙看下

songxuecheng · 2021 年12 月 28 日 05:17

kafka的监控有吗。

db_user · 2021 年12 月 28 日 05:39

kafka只有机器监控，没有进程监控，机器看了，资源还有好多，监控和系统日志没有发现oom,不知道该怎么看了

songxuecheng · 2021 年12 月 28 日 05:51

kafka-eagle. 你可以试试这个。

db_user · 2021 年12 月 28 日 06:03

好的，我学习下，感谢大佬

db_user · 2022 年10 月 31 日 19:04

此话题已在最后回复的 1 分钟后被自动关闭。不再允许新回复。