这个操作没试过,危险也不太清楚,如果正式环境的话慎重操作,kafka9092的端口有没有给新的drainer开,排查下网络方面问题,防火墙之类的
在此之后经历了挺多后续问题,截至目前半个多月没出问题了,现在来帖子下总结一下
遇到的问题
版本
v4.0.4
场景
node-01 | pump、kafka |
---|---|
node-02 | pump、kafka |
node-03 | pump、kafka、drainer |
问题发生前配置
kafka版本:2.11-1.0.2
kafka已调整配置:
- message.max.bytes=1073741824
- replica.fetch.max.bytes=1073741824
- fetch.message.max.bytes=1073741824
第一波现象
drainer往kafka送数据提示超时然后挂掉,勉强送入kafka的消息也无法被消费,消费者会卡住
kafka报错:
[2023-04-02 20:00:22,726] ERROR Processor got uncaught exception. (kafka.network.Processor)
java.lang.OutOfMemoryError: Direct buffer memory
at java.nio.Bits.reserveMemory(Bits.java:694)
at java.nio.DirectByteBuffer.<init>(DirectByteBuffer.java:123)
at java.nio.ByteBuffer.allocateDirect(ByteBuffer.java:311)
at sun.nio.ch.Util.getTemporaryDirectBuffer(Util.java:241)
at sun.nio.ch.IOUtil.read(IOUtil.java:195)
at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:380)
at org.apache.kafka.common.network.PlaintextTransportLayer.read(PlaintextTransportLayer.java:104)
at org.apache.kafka.common.network.NetworkReceive.readFromReadableChannel(NetworkReceive.java:145)
at org.apache.kafka.common.network.NetworkReceive.readFrom(NetworkReceive.java:93)
at org.apache.kafka.common.network.KafkaChannel.receive(KafkaChannel.java:235)
at org.apache.kafka.common.network.KafkaChannel.read(KafkaChannel.java:196)
at org.apache.kafka.common.network.Selector.attemptRead(Selector.java:547)
at org.apache.kafka.common.network.Selector.pollSelectionKeys(Selector.java:483)
at org.apache.kafka.common.network.Selector.poll(Selector.java:412)
at kafka.network.Processor.poll(SocketServer.scala:575)
at kafka.network.Processor.run(SocketServer.scala:492)
at java.lang.Thread.run(Thread.java:748)
drainer报错:
[2023/04/10 14:42:43.573 +08:00] [ERROR] [syncer.go:457] ["Failed to close syncer"] [error="fail to push msg to kafka after 30s, check if kafka is up and working"] [errorVerbose="fail to push msg to kafka after 30s, check if kafka is up and working\ngithub.com/pingcap/tidb-binlog/drainer/sync.(*KafkaSyncer).run\n\t/home/jenkins/agent/workspace/build_binlog_multi_branch_v4.0.4/go/src/github.com/pingcap/tidb-binlog/drainer/sync/kafka.go:236\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:1357"]
drainer启动后短时间内内存吃满主机,主机卡死,被系统kill
判断是主机资源不足,由于在论坛上看到有帖子提到drainer突然吃掉非常大量的内存,担心扩容原drainer主机也没用,于是打算迁移至一台新的64G内存的主机
迁移:
- 使用tiup cluster scale-out扩容新drainer
- 暂时没删掉旧drainer
- 原kafka为解决OutOfMemoryError,加大kafka启动内存
启动正常
第二波现象
短暂有消息写入后,kafka开始频繁报错连不上其他节点,这个报错换着来,比如1报错0失联,但2报错1失联,重启kafka也没用
java.io.IOException: Connection to 1 was disconnected before the response was read
at org.apache.kafka.clients.NetworkClientUtils.sendAndReceive(NetworkClientUtils.java:95)
at kafka.server.ReplicaFetcherBlockingSend.sendRequest(ReplicaFetcherBlockingSend.scala:96)
at kafka.server.ReplicaFetcherThread.fetch(ReplicaFetcherThread.scala:209)
at kafka.server.ReplicaFetcherThread.fetch(ReplicaFetcherThread.scala:41)
at kafka.server.AbstractFetcherThread.processFetchRequest(AbstractFetcherThread.scala:149)
at kafka.server.AbstractFetcherThread.doWork(AbstractFetcherThread.scala:113)
at kafka.utils.ShutdownableThread.run(ShutdownableThread.scala:64)
这个时候drainer开始不往kafka送消息了,只是一行行的打印保存ts的info日志,怎么重启都没用
这时候感觉可能是kafka的问题,在网上看到这是kafka低版本bug,于是升级kafka版本为2.12-2.3.1,kafka不报错了
但是drainer还是那样日志照常打印,但不送消息,这时候我是两个drainer节点,我曾试过用binlogctl将老的drainer标记为下线状态,但还是不行
对比新旧日志发现drainer不打印连接broker的信息了(以下是有正常消息送入kafka时):
[2023/04/10 14:41:39.096 +08:00] [INFO] [broker.go:214] ["[sarama] Connected to broker at node-01:9092 (unregistered)\n"]
[2023/04/10 14:41:39.100 +08:00] [INFO] [client.go:534] ["[sarama] client/brokers registered new broker #0 at node-02:9092"]
[2023/04/10 14:41:39.100 +08:00] [INFO] [client.go:534] ["[sarama] client/brokers registered new broker #2 at node-03:9092"]
[2023/04/10 14:41:39.100 +08:00] [INFO] [client.go:534] ["[sarama] client/brokers registered new broker #1 at node-01:9092"]
[2023/04/10 14:42:21.176 +08:00] [INFO] [client.go:772] ["[sarama] client/metadata fetching metadata for [tidb-binlog-pre] from broker node-01:9092\n"]
转而打印:
[INFO] [client.go:716] [“[sarama] Client background metadata update:kafka: no specific topics to update metadata”]
于是下定决心删掉所有drainer,全部清理后,在新大内存主机上唯一装一个
启动后可以打印kafka相关的日志了
第三波现象
过了一段时,kafka又30s超时无响应,查看kafka日志,发现报错:
org.apache.kafka.common.network.InvalidReceiveException: Invalid receive (size = 194889327 larger than 104857600)
at org.apache.kafka.common.network.NetworkReceive.readFrom(NetworkReceive.java:104)
at org.apache.kafka.common.network.KafkaChannel.receive(KafkaChannel.java:424)
at org.apache.kafka.common.network.KafkaChannel.read(KafkaChannel.java:385)
at org.apache.kafka.common.network.Selector.attemptRead(Selector.java:651)
at org.apache.kafka.common.network.Selector.pollSelectionKeys(Selector.java:572)
at org.apache.kafka.common.network.Selector.poll(Selector.java:483)
at kafka.network.Processor.poll(SocketServer.scala:863)
at kafka.network.Processor.run(SocketServer.scala:762)
at java.lang.Thread.run(Thread.java:748)
[2023-04-10 14:42:14,740] WARN [SocketServer brokerId=2] Unexpected error from /172.16.65.123; closing connection (org.apache.kafka.common.network.Selector)
这与我配置的message.max.bytes=1073741824数值对不上
kafka只添加原来的三个不够,还要添加配置:
- socket.request.max.bytes=1572864000
第四波现象
但drainer正常运行了半天又不能送消息了
这次是报错:
[2023/04/11 17:45:47.134 +08:00] [ERROR] [pump.go:152] ["pump receive binlog failed"] [id=node-03:8250] [error="rpc error: code = ResourceExhausted desc = grpc: received message larger than max (1530891262 vs. 1073741824)"]
这次感觉和kafka没有关系,因为kafka没有报错且已配置的消息大小限制和这个1073741824对不上,搜索论坛其他回答,发现需要给drainer加上配置kafka-max-message-size
但是这个配置一旦加上drainer根本启动不了,我版本是v4.0.4,有一篇帖子提到可以去修改drainer源码然后编译,本人不懂golang,于是放弃
经多次下载包尝试,发现v4.0.3-dirty版本是可以使用这个参数的,于是仅替换掉bin目录下的drainer文件,并增加kafka-max-message-size配置,启动成功了,这个报错也没了
第五波现象
但drainer运行几小时后又挂了,这次报错:
[ERROR] [server.go:289] ["syncer exited abnormal"] [error="filterTable failed: not found table id: 88262"]
这个报错我之前遇到过,追这个table id是没用的,这个我之前是重启tidb组件恢复的,于是这次重启tidb、pump、drainer,又可以正常运行了
目前运行两天没再挂掉
第六波现象
四天后drainer出现:
- 在应用代码中执行的sql的binlog可以被drainer读取并同步
- 但在命令行或客户端工具中执行的sql,不会被读取,且不会打印类似日志:
["write save point"] [ts=440847041016365075]
重启tidb,pump,drainer又可以了,但是重启前手动执行的sql是没有被drainer同步的,也无法知道漏掉了多少sql
万幸这是我们预发环境的tidb可以随便重启,如果是生产环境我不敢想
由于预发环境没有测试任务的话不会产生数,于是搞了个脚本,每分钟在上游tidb插入数据,十秒后去下游查,目前还没发现异常现象
总结
内存能大就大,感觉都是一开始资源给的太抠导致的问题
此话题已在最后回复的 60 天后被自动关闭。不再允许新回复。