drainer扩容后不往kafka写消息

这个操作没试过,危险也不太清楚,如果正式环境的话慎重操作,kafka9092的端口有没有给新的drainer开,排查下网络方面问题,防火墙之类的

在此之后经历了挺多后续问题,截至目前半个多月没出问题了,现在来帖子下总结一下

遇到的问题

版本

v4.0.4

场景

node-01 pump、kafka
node-02 pump、kafka
node-03 pump、kafka、drainer

问题发生前配置

kafka版本:2.11-1.0.2

kafka已调整配置:

  • message.max.bytes=1073741824
  • replica.fetch.max.bytes=1073741824
  • fetch.message.max.bytes=1073741824

第一波现象

drainer往kafka送数据提示超时然后挂掉,勉强送入kafka的消息也无法被消费,消费者会卡住

kafka报错:

[2023-04-02 20:00:22,726] ERROR Processor got uncaught exception. (kafka.network.Processor)
java.lang.OutOfMemoryError: Direct buffer memory
        at java.nio.Bits.reserveMemory(Bits.java:694)
        at java.nio.DirectByteBuffer.<init>(DirectByteBuffer.java:123)
        at java.nio.ByteBuffer.allocateDirect(ByteBuffer.java:311)
        at sun.nio.ch.Util.getTemporaryDirectBuffer(Util.java:241)
        at sun.nio.ch.IOUtil.read(IOUtil.java:195)
        at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:380)
        at org.apache.kafka.common.network.PlaintextTransportLayer.read(PlaintextTransportLayer.java:104)
        at org.apache.kafka.common.network.NetworkReceive.readFromReadableChannel(NetworkReceive.java:145)
        at org.apache.kafka.common.network.NetworkReceive.readFrom(NetworkReceive.java:93)
        at org.apache.kafka.common.network.KafkaChannel.receive(KafkaChannel.java:235)
        at org.apache.kafka.common.network.KafkaChannel.read(KafkaChannel.java:196)
        at org.apache.kafka.common.network.Selector.attemptRead(Selector.java:547)
        at org.apache.kafka.common.network.Selector.pollSelectionKeys(Selector.java:483)
        at org.apache.kafka.common.network.Selector.poll(Selector.java:412)
        at kafka.network.Processor.poll(SocketServer.scala:575)
        at kafka.network.Processor.run(SocketServer.scala:492)
        at java.lang.Thread.run(Thread.java:748)

drainer报错:

[2023/04/10 14:42:43.573 +08:00] [ERROR] [syncer.go:457] ["Failed to close syncer"] [error="fail to push msg to kafka after 30s, check if kafka is up and working"] [errorVerbose="fail to push msg to kafka after 30s, check if kafka is up and working\ngithub.com/pingcap/tidb-binlog/drainer/sync.(*KafkaSyncer).run\n\t/home/jenkins/agent/workspace/build_binlog_multi_branch_v4.0.4/go/src/github.com/pingcap/tidb-binlog/drainer/sync/kafka.go:236\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:1357"]

drainer启动后短时间内内存吃满主机,主机卡死,被系统kill

判断是主机资源不足,由于在论坛上看到有帖子提到drainer突然吃掉非常大量的内存,担心扩容原drainer主机也没用,于是打算迁移至一台新的64G内存的主机

迁移:

  • 使用tiup cluster scale-out扩容新drainer
  • 暂时没删掉旧drainer
  • 原kafka为解决OutOfMemoryError,加大kafka启动内存

启动正常

第二波现象

短暂有消息写入后,kafka开始频繁报错连不上其他节点,这个报错换着来,比如1报错0失联,但2报错1失联,重启kafka也没用

java.io.IOException: Connection to 1 was disconnected before the response was read
        at org.apache.kafka.clients.NetworkClientUtils.sendAndReceive(NetworkClientUtils.java:95)
        at kafka.server.ReplicaFetcherBlockingSend.sendRequest(ReplicaFetcherBlockingSend.scala:96)
        at kafka.server.ReplicaFetcherThread.fetch(ReplicaFetcherThread.scala:209)
        at kafka.server.ReplicaFetcherThread.fetch(ReplicaFetcherThread.scala:41)
        at kafka.server.AbstractFetcherThread.processFetchRequest(AbstractFetcherThread.scala:149)
        at kafka.server.AbstractFetcherThread.doWork(AbstractFetcherThread.scala:113)
        at kafka.utils.ShutdownableThread.run(ShutdownableThread.scala:64)

这个时候drainer开始不往kafka送消息了,只是一行行的打印保存ts的info日志,怎么重启都没用

这时候感觉可能是kafka的问题,在网上看到这是kafka低版本bug,于是升级kafka版本为2.12-2.3.1,kafka不报错了

但是drainer还是那样日志照常打印,但不送消息,这时候我是两个drainer节点,我曾试过用binlogctl将老的drainer标记为下线状态,但还是不行

对比新旧日志发现drainer不打印连接broker的信息了(以下是有正常消息送入kafka时):

[2023/04/10 14:41:39.096 +08:00] [INFO] [broker.go:214] ["[sarama] Connected to broker at node-01:9092 (unregistered)\n"]
[2023/04/10 14:41:39.100 +08:00] [INFO] [client.go:534] ["[sarama] client/brokers registered new broker #0 at node-02:9092"]
[2023/04/10 14:41:39.100 +08:00] [INFO] [client.go:534] ["[sarama] client/brokers registered new broker #2 at node-03:9092"]
[2023/04/10 14:41:39.100 +08:00] [INFO] [client.go:534] ["[sarama] client/brokers registered new broker #1 at node-01:9092"]
[2023/04/10 14:42:21.176 +08:00] [INFO] [client.go:772] ["[sarama] client/metadata fetching metadata for [tidb-binlog-pre] from broker node-01:9092\n"]

转而打印:

[INFO] [client.go:716] [“[sarama] Client background metadata update:kafka: no specific topics to update metadata”]

于是下定决心删掉所有drainer,全部清理后,在新大内存主机上唯一装一个

启动后可以打印kafka相关的日志了

第三波现象

过了一段时,kafka又30s超时无响应,查看kafka日志,发现报错:

org.apache.kafka.common.network.InvalidReceiveException: Invalid receive (size = 194889327 larger than 104857600)
        at org.apache.kafka.common.network.NetworkReceive.readFrom(NetworkReceive.java:104)
        at org.apache.kafka.common.network.KafkaChannel.receive(KafkaChannel.java:424)
        at org.apache.kafka.common.network.KafkaChannel.read(KafkaChannel.java:385)
        at org.apache.kafka.common.network.Selector.attemptRead(Selector.java:651)
        at org.apache.kafka.common.network.Selector.pollSelectionKeys(Selector.java:572)
        at org.apache.kafka.common.network.Selector.poll(Selector.java:483)
        at kafka.network.Processor.poll(SocketServer.scala:863)
        at kafka.network.Processor.run(SocketServer.scala:762)
        at java.lang.Thread.run(Thread.java:748)
[2023-04-10 14:42:14,740] WARN [SocketServer brokerId=2] Unexpected error from /172.16.65.123; closing connection (org.apache.kafka.common.network.Selector)

这与我配置的message.max.bytes=1073741824数值对不上

kafka只添加原来的三个不够,还要添加配置:

  • socket.request.max.bytes=1572864000

第四波现象

但drainer正常运行了半天又不能送消息了

这次是报错:

[2023/04/11 17:45:47.134 +08:00] [ERROR] [pump.go:152] ["pump receive binlog failed"] [id=node-03:8250] [error="rpc error: code = ResourceExhausted desc = grpc: received message larger than max (1530891262 vs. 1073741824)"]

这次感觉和kafka没有关系,因为kafka没有报错且已配置的消息大小限制和这个1073741824对不上,搜索论坛其他回答,发现需要给drainer加上配置kafka-max-message-size

但是这个配置一旦加上drainer根本启动不了,我版本是v4.0.4,有一篇帖子提到可以去修改drainer源码然后编译,本人不懂golang,于是放弃

经多次下载包尝试,发现v4.0.3-dirty版本是可以使用这个参数的,于是仅替换掉bin目录下的drainer文件,并增加kafka-max-message-size配置,启动成功了,这个报错也没了

第五波现象

但drainer运行几小时后又挂了,这次报错:

[ERROR] [server.go:289] ["syncer exited abnormal"] [error="filterTable failed: not found table id: 88262"]

这个报错我之前遇到过,追这个table id是没用的,这个我之前是重启tidb组件恢复的,于是这次重启tidb、pump、drainer,又可以正常运行了

目前运行两天没再挂掉

第六波现象

四天后drainer出现:

  • 在应用代码中执行的sql的binlog可以被drainer读取并同步
  • 但在命令行或客户端工具中执行的sql,不会被读取,且不会打印类似日志:
["write save point"] [ts=440847041016365075]

重启tidb,pump,drainer又可以了,但是重启前手动执行的sql是没有被drainer同步的,也无法知道漏掉了多少sql

万幸这是我们预发环境的tidb可以随便重启,如果是生产环境我不敢想

由于预发环境没有测试任务的话不会产生数,于是搞了个脚本,每分钟在上游tidb插入数据,十秒后去下游查,目前还没发现异常现象

总结

内存能大就大,感觉都是一开始资源给的太抠导致的问题

此话题已在最后回复的 60 天后被自动关闭。不再允许新回复。