TiDB5.4.X:CDC任务报错:TiCDC cannot deliver messages when the `replication-factor` is less than `min.insync.replicas`

【 TiDB 使用环境】测试
【 TiDB 版本】5.4.0 - 5.4.3
【遇到的问题】TiCDC cannot deliver messages when the replication-factor is less than min.insync.replicas
【问题现象及影响】
1、问题现象
4.0.15升级到了5.4.3后,本来正常的cdc任务出现了错误。
cdc任务报错:
“message”: “[CDC:ErrKafkaNewSaramaProducer]new sarama producer: [CDC:ErrKafkaInvalidConfig]because TiCDC Kafka producer’s request.required.acks defaults to -1, TiCDC cannot deliver messages when the replication-factor is less than min.insync.replicas: replication-factor cannot be smaller than the min.insync.replicas of topic”

cdc日志报错:
[ERROR] [changefeed.go:119] [“an error occurred in Owner”] [changefeed=testcdc0-testcdc-t5] [error=“[CDC:ErrKafkaNewSaramaProducer]new sarama producer: [CDC:ErrKafkaInvalidConfig]because TiCDC Kafka producer’s request.required.acks defaults to -1, TiCDC cannot deliver messages when the replication-factor is less than min.insync.replicas: replication-factor cannot be smaller than the min.insync.replicas of topic”]

2、详细测试
从描述中来看,当request.required.acks设置为-1的时候,kafka的参数replication-factor不能够小于min.insync.replicas,但是实际测试下来后发现,出现了如下的情况:

当min.insync.replicas为1的时候,replication-factor设置为1,2,3,同步都正常。
当min.insync.replicas为2的时候,replication-factor设置为1,2,3,同步都报错。
当min.insync.replicas为3的时候,replication-factor设置为1,2,3,同步都报错。

这里出现的问题是,当min.insync.replicas为2,replication-factor设为3的时候,不管是新建cdc任务还是旧的cdc任务,都会报错。生产环境的kafka往往都是这个配置,所以这种情况下的报错,会对cdc任务有非常大的影响。

另外不明白的是,当min.insync.replicas为2,replication-factor设为2的时候,cdc任务也会报错,而当min.insync.replicas为1,replication-factor设为1的时候,cdc任务却正常了。

3、涉及的版本
测试下来,无论是升级到5.4.X,还是直接新装5.4.X版本,都会出现cdc的这个报错。
而在4.0.X,5.0.X到5.3.X版本,却没有该问题。

1 个赞

猜测:
cdc 4.x的时候 cdc producer的 request.required.acks参数默认是1,
而cdc 5.4.x,把cdc producer的把默认值改写成-1,意味着所有的follower都ack后才算发送成功。
另外,当request.required.acks=-1时,min.insync.replicas参数才生效。

这个看起来不是很符合预期,你的集群默认的 replication-factor 是多少?你的 changefeed 创建 sink-uri 长什么样? ticdc 只会在 replication-factor < min.insync.replicas 时报错。等于大于都不会报错。你的 topic 是新建的还是旧的,如果是旧的那么它创建时的 replication-factor 是多少?

topic是提前手动创建的:
kafka-topics.sh --create --zookeeper XXX --replication-factor 3 --partitions 3 --topic test1

sink-uri是这样的:
/usr/bin/cdc cli changefeed create --pd=XXX --start-ts=XXX --sink-uri="kafka://XXX/test1?message.max.bytes=2147483648?partition-num=3

你在 sink-uri 里面带上 replication-factor= 3 试试

sink-uri改成这样,但依旧报错:
/usr/bin/cdc cli changefeed create --pd=XXX --start-ts=XXX --sink-uri="kafka://XXX/test1?message.max.bytes=2147483648?partition-num=3?replication-factor=3

这就很诡异,你能贴下改完之后完成的创建过程和报错日志吗?顺带查询一下你的 topic 的这些参数信息。

1、创建kafka topic:
/data/kafka/kafka_2.12-2.4.1/bin/kafka-topics.sh --create --zookeeper XXX:2181 --replication-factor 3 --partitions 3 --config min.insync.replicas=2 --topic test3

2、查看topic属性:
/data/kafka/kafka_2.12-2.4.1/bin/kafka-topics.sh --describe --bootstrap-server XXX:9092 --topic test3

Topic: test3 PartitionCount: 3 ReplicationFactor: 3 Configs: min.insync.replicas=2,segment.bytes=1073741824
Topic: test3 Partition: 0 Leader: 1 Replicas: 1,2,0 Isr: 1,2,0
Topic: test3 Partition: 1 Leader: 2 Replicas: 2,0,1 Isr: 2,0,1
Topic: test3 Partition: 2 Leader: 0 Replicas: 0,1,2 Isr: 0,1,2

3、创建cdc任务:
/usr/bin/cdc cli changefeed create --pd=XXX --start-ts=436935319064936449 --sink-uri=“kafka://XXX:9092/test3?message.max.bytes=2147483648?partition-num=3?replication-factor=3” --changefeed-id=“testcdc0” --config=/home/tidb/testcdc_yaml/testcdc0_testcdc_t0.yaml

4、配置文件:
case-sensitive = true
enable-old-value = true

[filter]
rules = [
“testcdc0.testcdc_t0”
]

[mounter]
worker-num = 8

[sink]
dispatchers = [
{matcher = [
“testcdc0.testcdc_t0”
], dispatcher = “rowid”},
]
protocol = “canal-json”

[cyclic-replication]
enable = false
replica-id = 1

5、创建任务返回:
Create changefeed successfully!
ID: testcdc0
Info: {“sink-uri”:“kafka://XXX:9092/test3?message.max.bytes=2147483648?partition-num=3?replication-factor=3”,“opts”:{},“create-time”:“2022-10-26T17:19:45.759358278+08:00”,“start-ts”:436935319064936449,“target-ts”:0,“admin-job-type”:0,“sort-engine”:“unified”,“sort-dir”:“”,“config”:{“case-sensitive”:true,“enable-old-value”:true,“force-replicate”:false,“check-gc-safe-point”:true,“filter”:{“rules”:[“testcdc0.testcdc_t0”],“ignore-txn-start-ts”:null},“mounter”:{“worker-num”:8},“sink”:{“dispatchers”:[{“matcher”:[“testcdc0.testcdc_t0”],“dispatcher”:“rowid”}],“protocol”:“canal-json”},“cyclic-replication”:{“enable”:false,“replica-id”:1,“filter-replica-ids”:null,“id-buckets”:0,“sync-ddl”:false},“scheduler”:{“type”:“table-number”,“polling-time”:-1}},“state”:“normal”,“history”:null,“error”:null,“sync-point-enabled”:false,“sync-point-interval”:600000000000,“creator-version”:“v4.0.16”}

6、任务错误提示:
cdc cli changefeed query -s --pd=http://XXX:2379 --changefeed-id=testcdc0
{
“state”: “error”,
“tso”: 436935319064936449,
“checkpoint”: “2022-10-26 17:19:26.892”,
“error”: {
“addr”: “172.16.72.22:8300”,
“code”: “CDC:ErrKafkaNewSaramaProducer”,
“message”: “[CDC:ErrKafkaNewSaramaProducer]new sarama producer: [CDC:ErrKafkaInvalidConfig]because TiCDC Kafka producer’s request.required.acks defaults to -1, TiCDC cannot deliver messages when the replication-factor is less than min.insync.replicas: replication-factor cannot be smaller than the min.insync.replicas of topic”
}
}

7、日志报错:
[2022/10/26 17:20:16.193 +08:00] [ERROR] [kafka.go:571] [“replication-factor cannot be smaller than the min.insync.replicas of topic”] [replicationFactor=1] [minInsyncReplicas=2]
[2022/10/26 17:20:16.581 +08:00] [ERROR] [changefeed.go:119] [“an error occurred in Owner”] [changefeed=testcdc0] [error=“[CDC:ErrKafkaNewSaramaProducer]new sarama producer: [CDC:ErrKafkaInvalidConfig]because TiCDC Kafka producer’s request.required.acks defaults to -1, TiCDC cannot deliver messages when the replication-factor is less than min.insync.replicas: replication-factor cannot be smaller than the min.insync.replicas of topic”]

我发现日志报错里有这个提示: [replicationFactor=1] [minInsyncReplicas=2],但明明topic的replicationFactor设置为了3。
应该出错的原因就在这里,不知道为何创建的replicationFactor为3,但cdc却认为replicationFactor=1。

这也就解释通了当min.insync.replicas为2,replication-factor设为2的时候,cdc任务也会报错,而当min.insync.replicas为1,replication-factor设为1的时候,cdc任务却正常了。
估计是因为cdc始终认为replicationFactor的值为1。

你的 sink-uri 写错了,传递参数的形式是 xxx?a1=1&a2=2&a3=3

此话题已在最后回复的 60 天后被自动关闭。不再允许新回复。