drainer异常宕机,Failed to close syncer

为提高效率,请提供以下信息,问题描述清晰能够更快得到解决:
【 TiDB 使用环境】
centos

【现象】 业务和数据库现象
drainer异常宕机

【TiDB 版本】
v4.0.7
【附件】

  • 相关日志、配置文件、Grafana 监控(https://metricstool.pingcap.com/)
  • TiUP Cluster Display 信息
  • TiUP CLuster Edit config 信息
  • TiDB-Overview 监控
  • 对应模块的 Grafana 监控(如有 BR、TiDB-binlog、TiCDC 等)
  • 对应模块日志(包含问题前后 1 小时日志)

drainer同步到Kafka,今早四点一台drainer异常宕机

主要日志如下:
[2021/12/22 04:07:53.936 +08:00] [INFO] [async_producer.go:813] [“[sarama] producer/broker/0 maximum request accumulated, waiting for space
“]
[2021/12/22 04:07:58.490 +08:00] [INFO] [async_producer.go:918] [”[sarama] producer/broker/0 state change to [retrying] on bi2b_tidb_yecai/0 because kafka server: Request exceeded the user-specified time limit in the request.
“]
[2021/12/22 04:07:58.492 +08:00] [INFO] [async_producer.go:578] [”[sarama] producer/leader/bi2b_tidb_yecai/0 state change to [retrying-1]
“]
[2021/12/22 04:07:58.492 +08:00] [INFO] [async_producer.go:588] [”[sarama] producer/leader/bi2b_tidb_yecai/0 abandoning broker 0
“]
[2021/12/22 04:07:58.492 +08:00] [INFO] [async_producer.go:743] [”[sarama] producer/broker/0 state change to [closed] on bi2b_tidb_yecai/0
“]
[2021/12/22 04:07:58.492 +08:00] [INFO] [async_producer.go:717] [”[sarama] producer/broker/0 input chan closed
“]
[2021/12/22 04:07:58.493 +08:00] [INFO] [async_producer.go:801] [”[sarama] producer/broker/0 shut down
“]
[2021/12/22 04:07:58.993 +08:00] [INFO] [client.go:772] [”[sarama] client/metadata fetching metadata for [bi2b_tidb_yecai] from broker kfk_ip:9092
“]
[2021/12/22 04:07:58.995 +08:00] [INFO] [async_producer.go:711] [”[sarama] producer/broker/0 starting up
“]
[2021/12/22 04:07:58.995 +08:00] [INFO] [async_producer.go:727] [”[sarama] producer/broker/0 state change to [open] on bi2b_tidb_yecai/0
“]
[2021/12/22 04:07:58.995 +08:00] [INFO] [async_producer.go:570] [”[sarama] producer/leader/bi2b_tidb_yecai/0 selected broker 0
“]
[2021/12/22 04:07:58.997 +08:00] [INFO] [async_producer.go:813] [”[sarama] producer/broker/0 maximum request accumulated, waiting for space
“]
[2021/12/22 04:08:08.999 +08:00] [INFO] [async_producer.go:918] [”[sarama] producer/broker/0 state change to [retrying] on bi2b_tidb_yecai/0 because kafka server: Request exceeded the user-specified time limit in the request.
“]
[2021/12/22 04:08:09.000 +08:00] [INFO] [async_producer.go:594] [”[sarama] producer/leader/bi2b_tidb_yecai/0 state change to [flushing-1]
“]
[2021/12/22 04:08:09.001 +08:00] [INFO] [async_producer.go:616] [”[sarama] producer/leader/bi2b_tidb_yecai/0 state change to [normal]
“]
[2021/12/22 04:08:09.001 +08:00] [INFO] [async_producer.go:578] [”[sarama] producer/leader/bi2b_tidb_yecai/0 state change to [retrying-2]
“]
[2021/12/22 04:08:09.001 +08:00] [INFO] [async_producer.go:588] [”[sarama] producer/leader/bi2b_tidb_yecai/0 abandoning broker 0
“]
[2021/12/22 04:08:09.001 +08:00] [INFO] [async_producer.go:743] [”[sarama] producer/broker/0 state change to [closed] on bi2b_tidb_yecai/0
“]
[2021/12/22 04:08:09.001 +08:00] [INFO] [async_producer.go:717] [”[sarama] producer/broker/0 input chan closed
“]
[2021/12/22 04:08:09.001 +08:00] [INFO] [async_producer.go:801] [”[sarama] producer/broker/0 shut down
“]
[2021/12/22 04:08:09.501 +08:00] [INFO] [client.go:772] [”[sarama] client/metadata fetching metadata for [bi2b_tidb_yecai] from broker kfk_ip:9092
“]
[2021/12/22 04:08:09.503 +08:00] [INFO] [async_producer.go:711] [”[sarama] producer/broker/0 starting up
“]
[2021/12/22 04:08:09.504 +08:00] [INFO] [async_producer.go:727] [”[sarama] producer/broker/0 state change to [open] on bi2b_tidb_yecai/0
“]
[2021/12/22 04:08:09.504 +08:00] [INFO] [async_producer.go:570] [”[sarama] producer/leader/bi2b_tidb_yecai/0 selected broker 0
“]
[2021/12/22 04:08:09.505 +08:00] [INFO] [async_producer.go:813] [”[sarama] producer/broker/0 maximum request accumulated, waiting for space
“]
[2021/12/22 04:08:19.127 +08:00] [INFO] [syncer.go:257] [“write save point”] [ts=429946901167341601]
[2021/12/22 04:08:19.129 +08:00] [ERROR] [syncer.go:457] [“Failed to close syncer”] [error=“fail to push msg to kafka after 30s, check if kafka is up and working”] [errorVerbose=“fail to push msg to kafka after 30s, check if kafka is up and working
github.com/pingcap/tidb-binlog/drainer/sync.(*KafkaSyncer).run
\t/home/jenkins/agent/workspace/build_binlog_multi_branch_v4.0.7/go/src/github.com/pingcap/tidb-binlog/drainer/sync/kafka.go:236
runtime.goexit
\t/usr/local/go/src/runtime/asm_amd64.s:1357”]
[2021/12/22 04:08:19.507 +08:00] [INFO] [async_producer.go:918] [”[sarama] producer/broker/0 state change to [retrying] on bi2b_tidb_yecai/0 because kafka server: Request exceeded the user-specified time limit in the request.
“]
[2021/12/22 04:08:19.509 +08:00] [INFO] [async_producer.go:594] [”[sarama] producer/leader/bi2b_tidb_yecai/0 state change to [flushing-2]
“]
[2021/12/22 04:08:19.509 +08:00] [INFO] [async_producer.go:616] [”[sarama] producer/leader/bi2b_tidb_yecai/0 state change to [normal]
“]
[2021/12/22 04:08:19.509 +08:00] [INFO] [async_producer.go:578] [”[sarama] producer/leader/bi2b_tidb_yecai/0 state change to [retrying-3]
“]
[2021/12/22 04:08:19.509 +08:00] [INFO] [async_producer.go:588] [”[sarama] producer/leader/bi2b_tidb_yecai/0 abandoning broker 0
“]
[2021/12/22 04:08:19.509 +08:00] [INFO] [async_producer.go:743] [”[sarama] producer/broker/0 state change to [closed] on bi2b_tidb_yecai/0
“]
[2021/12/22 04:08:19.509 +08:00] [INFO] [async_producer.go:717] [”[sarama] producer/broker/0 input chan closed
“]
[2021/12/22 04:08:19.511 +08:00] [INFO] [async_producer.go:801] [”[sarama] producer/broker/0 shut down
“]
[2021/12/22 04:08:20.009 +08:00] [INFO] [client.go:772] [”[sarama] client/metadata fetching metadata for [bi2b_tidb_yecai] from broker ip:9092
“]
[2021/12/22 04:08:20.012 +08:00] [INFO] [async_producer.go:711] [”[sarama] producer/broker/0 starting up
“]
[2021/12/22 04:08:20.012 +08:00] [INFO] [async_producer.go:727] [”[sarama] producer/broker/0 state change to [open] on bi2b_tidb_yecai/0
“]
[2021/12/22 04:08:20.012 +08:00] [INFO] [async_producer.go:570] [”[sarama] producer/leader/bi2b_tidb_yecai/0 selected broker 0
“]
[2021/12/22 04:08:20.013 +08:00] [INFO] [async_producer.go:813] [”[sarama] producer/broker/0 maximum request accumulated, waiting for space
“]
[2021/12/22 04:08:20.018 +08:00] [INFO] [async_producer.go:594] [”[sarama] producer/leader/bi2b_tidb_yecai/0 state change to [flushing-3]
“]
[2021/12/22 04:08:20.018 +08:00] [INFO] [async_producer.go:616] [”[sarama] producer/leader/bi2b_tidb_yecai/0 state change to [normal]
“]
[2021/12/22 04:08:29.129 +08:00] [INFO] [server.go:451] [“begin to close drainer server”]
[2021/12/22 04:08:29.130 +08:00] [ERROR] [util.go:65] [“Recovered from panic”] [err=”"Waiting too long for Syncer.run to quit."”] [“real stack”=“github.com/pingcap/tidb-binlog/drainer.(*taskGroup).start.func1.1
\t/home/jenkins/agent/workspace/build_binlog_multi_branch_v4.0.7/go/src/github.com/pingcap/tidb-binlog/drainer/util.go:67
runtime.gopanic
\t/usr/local/go/src/runtime/panic.go:679
github.com/pingcap/tidb-binlog/drainer.(*Syncer).run
\t/home/jenkins/agent/workspace/build_binlog_multi_branch_v4.0.7/go/src/github.com/pingcap/tidb-binlog/drainer/syncer.go:463
github.com/pingcap/tidb-binlog/drainer.(*Syncer).Start
\t/home/jenkins/agent/workspace/build_binlog_multi_branch_v4.0.7/go/src/github.com/pingcap/tidb-binlog/drainer/syncer.go:132
github.com/pingcap/tidb-binlog/drainer.(*Server).Start.func4
\t/home/jenkins/agent/workspace/build_binlog_multi_branch_v4.0.7/go/src/github.com/pingcap/tidb-binlog/drainer/server.go:288
github.com/pingcap/tidb-binlog/drainer.(*taskGroup).start.func1
\t/home/jenkins/agent/workspace/build_binlog_multi_branch_v4.0.7/go/src/github.com/pingcap/tidb-binlog/drainer/util.go:75”] [name=syncer]
[2021/12/22 04:08:29.130 +08:00] [INFO] [util.go:72] [Exit] [name=syncer]
[2021/12/22 04:08:29.131 +08:00] [INFO] [server.go:416] [“has already update status”] [id=ip:8249]
[2021/12/22 04:08:29.131 +08:00] [INFO] [server.go:455] [“commit status done”]
[2021/12/22 04:08:29.131 +08:00] [INFO] [collector.go:135] [“publishBinlogs quit”]
[2021/12/22 04:08:29.131 +08:00] [INFO] [pump.go:77] [“pump is closing”] [id=ip:8250]
[2021/12/22 04:08:29.131 +08:00] [INFO] [util.go:72] [Exit] [name=heartbeat]
[2021/12/22 04:08:29.131 +08:00] [INFO] [pump.go:77] [“pump is closing”] [id=ip:8250]
[2021/12/22 04:08:29.131 +08:00] [INFO] [pump.go:77] [“pump is closing”] [id=ip:8250]
[2021/12/22 04:08:29.134 +08:00] [INFO] [util.go:72] [Exit] [name=collect]
[2021/12/22 04:08:29.134 +08:00] [INFO] [main.go:73] [“drainer exit”]
[2021/12/22 04:08:29.134 +08:00] [INFO] [server.go:470] [“drainer exit”]
[2021/12/22 04:08:29.134 +08:00] [INFO] [server.go:322] [“drainer http server stopped”] [error=“mux: listener closed”]

重要监控如下
image

image

看内存监控没有oom,只是kafka消费不起,不明白为啥会造成drainer的宕机,看了gc大佬的帖子,采用的解决办法是换成cdc,不知道binlog该怎么解决这个问题


若提问为性能优化、故障排查类问题,请下载脚本运行。终端输出的打印结果,请务必全选并复制粘贴上传。

4 个赞

查看下是否有大事务文件,导致超过了kafka的消息限制?

1 个赞

DDL job count一直在增多,大事务文件有什么查看的方法么

1 个赞

ddl job 增加是正常,大文件的话 可以在日志中查看

1 个赞

类似这样的

1 个赞

感谢,宕机的drianer查看对应的时间没有大事务,只有重启之后有些大事务,宕机那段时间的日志不正常的地方就是我上面列出的那个,所有drainer节点都看了一遍,故障时间点没有大事务

1 个赞

这种情况就只能去查kafka 是否出现了message 写入的限制,我记得之前有过类似的

1 个赞

我也查看了kafka,可能排查的方式不对,没有找到对应的topic日志,只有一个
java.io.IOException: Connection to 0 was disconnected before the response was read
at kafka.utils.NetworkClientBlockingOps$$anonfun$blockingSendAndReceive$extension$1$$anonfun$apply$1.apply(NetworkClientBlockingOps.scala:114)
at kafka.utils.NetworkClientBlockingOps$$anonfun$blockingSendAndReceive$extension$1$$anonfun$apply$1.apply(NetworkClientBlockingOps.scala:112)
at scala.Option.foreach(Option.scala:257)
at kafka.utils.NetworkClientBlockingOps$$anonfun$blockingSendAndReceive$extension$1.apply(NetworkClientBlockingOps.scala:112)
at kafka.utils.NetworkClientBlockingOps$$anonfun$blockingSendAndReceive$extension$1.apply(NetworkClientBlockingOps.scala:108)
at kafka.utils.NetworkClientBlockingOps$.recursivePoll$1(NetworkClientBlockingOps.scala:136)
at kafka.utils.NetworkClientBlockingOps$.kafka$utils$NetworkClientBlockingOps$$pollContinuously$extension(NetworkClientBlockingOps.scala:142)
at kafka.utils.NetworkClientBlockingOps$.blockingSendAndReceive$extension(NetworkClientBlockingOps.scala:108)
at kafka.server.ReplicaFetcherThread.sendRequest(ReplicaFetcherThread.scala:249)
at kafka.server.ReplicaFetcherThread.fetch(ReplicaFetcherThread.scala:234)
at kafka.server.ReplicaFetcherThread.fetch(ReplicaFetcherThread.scala:42)
at kafka.server.AbstractFetcherThread.processFetchRequest(AbstractFetcherThread.scala:118)
at kafka.server.AbstractFetcherThread.doWork(AbstractFetcherThread.scala:103)
at kafka.utils.ShutdownableThread.run(ShutdownableThread.scala:63)
没有对应的max message之类的信息

1 个赞

https://docs.pingcap.com/zh/tidb/stable/handle-tidb-binlog-errors#drainer-同步数据到-kafka-时报错-kafka-server-message-was-too-large-server-rejected-it-to-avoid-allocation-error

message.max.bytes=1073741824
replica.fetch.max.bytes=1073741824
fetch.message.max.bytes=1073741824

这个有配置吗

1 个赞

你好,感谢回复,我这里有配这个参数,这个参数是报错max message时候做的处理,更早的时候有这个问题,采用的这种方式处理解决,这次kafka报的是上面我写的那种报错,百度了下,各种原因,kafka这块不太懂。有说配置太大的问题

自己捞一下,同样的错误又出现了,大佬们帮忙看下

kafka的监控有吗。

kafka只有机器监控,没有进程监控,机器看了,资源还有好多,监控和系统日志没有发现oom,不知道该怎么看了

kafka-eagle. 你可以试试这个。

好的,我学习下,感谢大佬

此话题已在最后回复的 1 分钟后被自动关闭。不再允许新回复。