dm同步卡死,tidb数据库不在写入数据。drainer一直像kafka中写数据

为提高效率,提问时请提供以下信息,问题描述清晰可优先响应。

  • 【TiDB 版本】:3.0.5
  • 【问题描述】: dm同步过程中由于sql异常中断,tidb数据库暂停同步。
    tidb数据库暂时没有连接向其输入数据,查询对应表中数据一直为空。
    目前drainer一直像kafka中写数据,数据量已过万,但对应数据库并未写入数据。

drainer_error.log

goroutine 262 [running]:
github.com/pingcap/tidb-binlog/drainer/sync.(*KafkaSyncer).run.func2(0xc004684030, 0xc0022e0ee0)
	/home/jenkins/agent/workspace/release_tidb_3.0/go/src/github.com/pingcap/tidb-binlog/drainer/sync/kafka.go:188 +0xa9
created by github.com/pingcap/tidb-binlog/drainer/sync.(*KafkaSyncer).run
	/home/jenkins/agent/workspace/release_tidb_3.0/go/src/github.com/pingcap/tidb-binlog/drainer/sync/kafka.go:184 +0xd0
2020/03/27 18:52:06 Connected to 192.168.10.203:2181
2020/03/27 18:52:06 Authenticated: id=216314377583788199, timeout=40000
2020/03/27 18:52:06 Re-submitting `0` credentials after reconnect
2020/03/27 18:52:06 Recv loop terminated: err=EOF
2020/03/27 18:52:06 Send loop terminated: err=<nil>
panic: kafka: Failed to produce message to topic kafka_binlog: kafka server: Message was too large, server rejected it to avoid allocation erro
r.

goroutine 260 [running]:
github.com/pingcap/tidb-binlog/drainer/sync.(*KafkaSyncer).run.func2(0xc004e00000, 0xc004da8ee0)
	/home/jenkins/agent/workspace/release_tidb_3.0/go/src/github.com/pingcap/tidb-binlog/drainer/sync/kafka.go:188 +0xa9
created by github.com/pingcap/tidb-binlog/drainer/sync.(*KafkaSyncer).run
	/home/jenkins/agent/workspace/release_tidb_3.0/go/src/github.com/pingcap/tidb-binlog/drainer/sync/kafka.go:184 +0xd0
2020/03/27 18:52:27 Connected to 192.168.10.204:2181
2020/03/27 18:52:27 Authenticated: id=288371984454123681, timeout=40000
2020/03/27 18:52:27 Re-submitting `0` credentials after reconnect
2020/03/27 18:52:27 Recv loop terminated: err=EOF
2020/03/27 18:52:27 Send loop terminated: err=<nil>
panic: kafka: Failed to produce message to topic kafka_binlog: kafka server: Message was too large, server rejected it to avoid allocation erro
r.

goroutine 211 [running]:
github.com/pingcap/tidb-binlog/drainer/sync.(*KafkaSyncer).run.func2(0xc005010000, 0xc004ff8770)
	/home/jenkins/agent/workspace/release_tidb_3.0/go/src/github.com/pingcap/tidb-binlog/drainer/sync/kafka.go:188 +0xa9
created by github.com/pingcap/tidb-binlog/drainer/sync.(*KafkaSyncer).run
	/home/jenkins/agent/workspace/release_tidb_3.0/go/src/github.com/pingcap/tidb-binlog/drainer/sync/kafka.go:184 +0xd0
2020/03/27 18:52:48 Connected to 192.168.10.203:2181
2020/03/27 18:52:48 Authenticated: id=216314377583788200, timeout=40000
2020/03/27 18:52:48 Re-submitting `0` credentials after reconnect
2020/03/27 18:52:48 Recv loop terminated: err=EOF
2020/03/27 18:52:48 Send loop terminated: err=<nil>
panic: kafka: Failed to produce message to topic kafka_binlog: kafka server: Message was too large, server rejected it to avoid allocation erro
r.

goroutine 236 [running]:
github.com/pingcap/tidb-binlog/drainer/sync.(*KafkaSyncer).run.func2(0xc004be2000, 0xc004acdf10)
	/home/jenkins/agent/workspace/release_tidb_3.0/go/src/github.com/pingcap/tidb-binlog/drainer/sync/kafka.go:188 +0xa9
created by github.com/pingcap/tidb-binlog/drainer/sync.(*KafkaSyncer).run
	/home/jenkins/agent/workspace/release_tidb_3.0/go/src/github.com/pingcap/tidb-binlog/drainer/sync/kafka.go:184 +0xd0
2020/03/27 18:53:09 Connected to 192.168.10.202:2181
2020/03/27 18:53:09 Authenticated: id=144262481109385386, timeout=40000
2020/03/27 18:53:09 Re-submitting `0` credentials after reconnect
2020/03/27 18:53:09 Recv loop terminated: err=EOF
2020/03/27 18:53:09 Send loop terminated: err=<nil>
panic: kafka: Failed to produce message to topic kafka_binlog: kafka server: Message was too large, server rejected it to avoid allocation erro
r.

goroutine 260 [running]:
github.com/pingcap/tidb-binlog/drainer/sync.(*KafkaSyncer).run.func2(0xc004a44000, 0xc003c20a80)
	/home/jenkins/agent/workspace/release_tidb_3.0/go/src/github.com/pingcap/tidb-binlog/drainer/sync/kafka.go:188 +0xa9
created by github.com/pingcap/tidb-binlog/drainer/sync.(*KafkaSyncer).run
	/home/jenkins/agent/workspace/release_tidb_3.0/go/src/github.com/pingcap/tidb-binlog/drainer/sync/kafka.go:184 +0xd0
2020/03/27 18:53:31 Connected to 192.168.10.203:2181
2020/03/27 18:53:31 Authenticated: id=216314377583788201, timeout=40000
2020/03/27 18:53:31 Re-submitting `0` credentials after reconnect
2020/03/27 18:53:31 Recv loop terminated: err=EOF
2020/03/27 18:53:31 Send loop terminated: err=<nil>
panic: kafka: Failed to produce message to topic kafka_binlog: kafka server: Message was too large, server rejected it to avoid allocation erro
r.

goroutine 254 [running]:
github.com/pingcap/tidb-binlog/drainer/sync.(*KafkaSyncer).run.func2(0xc004ea0310, 0xc003b768c0)
	/home/jenkins/agent/workspace/release_tidb_3.0/go/src/github.com/pingcap/tidb-binlog/drainer/sync/kafka.go:188 +0xa9
created by github.com/pingcap/tidb-binlog/drainer/sync.(*KafkaSyncer).run
	/home/jenkins/agent/workspace/release_tidb_3.0/go/src/github.com/pingcap/tidb-binlog/drainer/sync/kafka.go:184 +0xd0
2020/03/27 18:53:51 Connected to 192.168.10.201:2181
2020/03/27 18:53:51 Authenticated: id=72204371365855368, timeout=40000
2020/03/27 18:53:51 Re-submitting `0` credentials after reconnect
2020/03/27 18:53:51 Recv loop terminated: err=EOF
2020/03/27 18:53:51 Send loop terminated: err=<nil>
panic: kafka: Failed to produce message to topic kafka_binlog: kafka server: Message was too large, server rejected it to avoid allocation erro
r.

goroutine 266 [running]:
github.com/pingcap/tidb-binlog/drainer/sync.(*KafkaSyncer).run.func2(0xc0045e6040, 0xc0021d57a0)
	/home/jenkins/agent/workspace/release_tidb_3.0/go/src/github.com/pingcap/tidb-binlog/drainer/sync/kafka.go:188 +0xa9
created by github.com/pingcap/tidb-binlog/drainer/sync.(*KafkaSyncer).run
	/home/jenkins/agent/workspace/release_tidb_3.0/go/src/github.com/pingcap/tidb-binlog/drainer/sync/kafka.go:184 +0xd0
2020/03/27 18:54:12 Connected to 192.168.10.202:2181
2020/03/27 18:54:12 Authenticated: id=144262481109385387, timeout=40000
2020/03/27 18:54:12 Re-submitting `0` credentials after reconnect
2020/03/27 18:54:12 Recv loop terminated: err=EOF
2020/03/27 18:54:12 Send loop terminated: err=<nil>
panic: kafka: Failed to produce message to topic kafka_binlog: kafka server: Message was too large, server rejected it to avoid allocation erro
r.

goroutine 191 [running]:
github.com/pingcap/tidb-binlog/drainer/sync.(*KafkaSyncer).run.func2(0xc0019cc010, 0xc003eaaa80)
	/home/jenkins/agent/workspace/release_tidb_3.0/go/src/github.com/pingcap/tidb-binlog/drainer/sync/kafka.go:188 +0xa9
created by github.com/pingcap/tidb-binlog/drainer/sync.(*KafkaSyncer).run
	/home/jenkins/agent/workspace/release_tidb_3.0/go/src/github.com/pingcap/tidb-binlog/drainer/sync/kafka.go:184 +0xd0
2020/03/27 18:54:33 Connected to 192.168.10.202:2181
2020/03/27 18:54:33 Authenticated: id=144262481109385388, timeout=40000
2020/03/27 18:54:33 Re-submitting `0` credentials after reconnect
2020/03/27 18:54:33 Recv loop terminated: err=EOF
2020/03/27 18:54:33 Send loop terminated: err=<nil>
panic: kafka: Failed to produce message to topic kafka_binlog: kafka server: Message was too large, server rejected it to avoid allocation erro
r.

kafka 参数调整

目前的同步链路是 MySQL -> DM -> TiDB -> TiDB Binlog -> Kafka 的链路吗

从 Drainer 日志中看到是 Binlog 发送给 Kafka 接收数据的消息大小限制,可以参考官方文档调整看下

tidb 没有写入,drainber 一直写数据,也可以确认一下是否有大事务执行

确认是有大事务执行,对应的表并没有大事务执行,重复的插入到kafka的只有三条,重复插入

有调整 Kafka 参数尝试下吗

调整kafka参数,重启drainer后不再向kafka写入数据了,还是报错单条数据太大

kafka参数调整完报错: kafka: Failed to produce message to topic kafka_binlog: kafka server: Replication-factor is invalid.

关于上游 DM 卡死之后,TiDB 数据库不再写入数据,drainer 还是往下游 Kafka 写入数据,可以看下 Binlog 监控中已经同步到下游的 tso 时间,确认是否是有同步延迟的情况

调整完 Kafka 参数之后有重启启动 drainer,重新连接 Kafka 吗?

是的,好的

重新部署,然后把同步时间点设为-1 还是有报错:

goroutine 524 [chan receive]:
github.com/Shopify/sarama.(*Broker).responseReceiver(0xc0001dfc00)
	/home/jenkins/agent/workspace/release_tidb_3.0/go/pkg/mod/github.com/!shopify/sarama@v1.23.1/broker.go:791 +0xc7
github.com/Shopify/sarama.withRecover(0xc003c848f0)
	/home/jenkins/agent/workspace/release_tidb_3.0/go/pkg/mod/github.com/!shopify/sarama@v1.23.1/utils.go:45 +0x43
created by github.com/Shopify/sarama.(*Broker).Open.func1
	/home/jenkins/agent/workspace/release_tidb_3.0/go/pkg/mod/github.com/!shopify/sarama@v1.23.1/broker.go:216 +0x96a
2020/03/29 16:51:31 Connected to 192.168.10.204:2181
2020/03/29 16:51:31 Authenticated: id=288400992535707684, timeout=40000
2020/03/29 16:51:31 Re-submitting `0` credentials after reconnect
2020/03/29 16:51:31 Recv loop terminated: err=EOF
2020/03/29 16:51:31 Send loop terminated: err=<nil>
panic: kafka: Failed to produce message to topic kafka_binlog: write tcp 192.168.10.201:59818->192.168.10.201:9092: write: connection reset by 
peer

goroutine 273 [running]:
github.com/pingcap/tidb-binlog/drainer/sync.(*KafkaSyncer).run.func2(0xc002ad4730, 0xc003f35730)
	/home/jenkins/agent/workspace/release_tidb_3.0/go/src/github.com/pingcap/tidb-binlog/drainer/sync/kafka.go:188 +0xa9
created by github.com/pingcap/tidb-binlog/drainer/sync.(*KafkaSyncer).run
	/home/jenkins/agent/workspace/release_tidb_3.0/go/src/github.com/pingcap/tidb-binlog/drainer/sync/kafka.go:184 +0xd0
2020/03/29 18:48:21 Connected to 192.168.10.201:2181
2020/03/29 18:48:21 Authenticated: id=72228094038442038, timeout=40000
2020/03/29 18:48:21 Re-submitting `0` credentials after reconnect
2020/03/29 18:48:21 Recv loop terminated: err=EOF
2020/03/29 18:48:21 Send loop terminated: err=<nil>
2020/03/29 20:11:23 Connected to 192.168.10.203:2181
2020/03/29 20:11:24 Authenticated: id=216343377805377588, timeout=40000
2020/03/29 20:11:24 Re-submitting `0` credentials after reconnect
2020/03/29 20:11:24 Recv loop terminated: err=EOF
2020/03/29 20:11:24 Send loop terminated: err=<nil>

drainer.log (4.4 MB) drainer_stderr.log (190.8 KB)

应该是大事务的问题,但是tidb的数据全由dm同步过来的。业务库并没有1G以上的大事务执行,应该还是有问题的
而且tidb整个数据库重新同步了,然后drainer的同步点设为-1(官网上说-1代表最新的时间点开始同步),应该不会有大事务的。

MySQL [(none)]> show drainer status;
+------------+---------------------+--------+--------------------+---------------------+
| NodeID     | Address             | State  | Max_Commit_Ts      | Update_Time         |
+------------+---------------------+--------+--------------------+---------------------+
| tidb1:8249 | 192.168.10.201:8249 | paused | 415576863372476417 | 2020-03-29 20:52:32 |
+------------+---------------------+--------+--------------------+---------------------+
1 row in set (0.00 sec)

MySQL [(none)]> show pump status;
+------------+---------------------+--------+--------------------+---------------------+
| NodeID     | Address             | State  | Max_Commit_Ts      | Update_Time         |
+------------+---------------------+--------+--------------------+---------------------+
| tidb2:8250 | 192.168.10.202:8250 | online | 415626432535330818 | 2020-03-29 21:36:57 |
| tidb3:8250 | 192.168.10.203:8250 | online | 415626432561545219 | 2020-03-29 21:36:57 |
| tidb4:8250 | 192.168.10.204:8250 | online | 415626433151369217 | 2020-03-29 21:36:58 |
+------------+---------------------+--------+--------------------+---------------------+
3 rows in set (0.00 sec)

415626433151369217 是由于这个事务造成的?怎么查看这个事务内容

  1. 当前drainer的报错是tcp链接问题,请检查下下游的kafka是否正常. producer/broker/0 state change to [closing] because write tcp 192.168.10.201:44572->192.168.10.201:9092: write: connection reset by peer\n"]

  2. drainer日志里是否有记录,你给的日志不全,没有这个信息

1赞

是的,kafka报异常了,清理数据后重启解决了

:+1: