flink cdc 读取tidb时出现DEADLINE_EXCEEDED异常

【 TiDB 使用环境】生产环境
【 TiDB 版本】v5.4.0
【复现路径】flink sql 配置tidb-cdc后,正常运行一段时间出现了 DEADLINE_EXCEEDED 异常后同步中断。flink sql 配置:
SET execution.checkpointing.interval = 3s;
SET table.exec.sink.not-null-enforcer=DROP;
SET execution.runtime-mode=streaming;
SET pipeline.name=jobName;

CREATE TABLE table_name (
db_name STRING METADATA FROM ‘database_name’ VIRTUAL,
table_name STRING METADATA FROM ‘table_name’ VIRTUAL,
operation_ts TIMESTAMP_LTZ(3) METADATA FROM ‘op_ts’ VIRTUAL,
id bigint ,
name varchar ,
PRIMARY KEY (id) NOT ENFORCED
) WITH (
‘connector’ = ‘tidb-cdc’,
‘tikv.grpc.timeout_in_ms’ = ‘40000’,
‘tikv.grpc.scan_timeout_in_ms’ = ‘40000’ ,
‘pd-addresses’ = ‘10.xxx.xx.111:2370,10.xxx.xx.112:2370,10.xxx.xx.113:2370’,
‘database-name’ = ‘db_name’,
‘table-name’ = ‘table_name’,
‘scan.startup.mode’ = ‘latest-offset’
);
【遇到的问题:问题现象及影响】
网络正常,tidb集群也正常,目前重启及各种参数配置都没效果。
【复制黏贴 ERROR 报错的日志】
2025-01-14 10:22:48,457 WARN org.tikv.common.region.StoreHealthyChecker - store [10.xxx.xx.113:20160] is not reachable
2025-01-14 10:22:48,462 WARN org.tikv.common.region.StoreHealthyChecker - store [10.xxx.xx.113:20160] is not reachable
2025-01-14 10:22:48,462 WARN org.tikv.common.PDClient - failed to get member from pd server.
org.tikv.shade.io.grpc.StatusRuntimeException: DEADLINE_EXCEEDED: deadline exceeded after 0.199976517s. [remote_addr=/10.xxx.xx.113:2370]
at org.tikv.shade.io.grpc.stub.ClientCalls.toStatusRuntimeException(ClientCalls.java:287) ~[flink-sql-connector-tidb-cdc-3.1.1.jar:3.1.1]
at org.tikv.shade.io.grpc.stub.ClientCalls.getUnchecked(ClientCalls.java:268) ~[flink-sql-connector-tidb-cdc-3.1.1.jar:3.1.1]
at org.tikv.shade.io.grpc.stub.ClientCalls.blockingUnaryCall(ClientCalls.java:175) ~[flink-sql-connector-tidb-cdc-3.1.1.jar:3.1.1]
at org.tikv.kvproto.PDGrpc$PDBlockingStub.getMembers(PDGrpc.java:1868) ~[flink-sql-connector-tidb-cdc-3.1.1.jar:3.1.1]
at org.tikv.common.PDClient.getMembers(PDClient.java:443) ~[flink-sql-connector-tidb-cdc-3.1.1.jar:3.1.1]
at org.tikv.common.PDClient.tryUpdateLeader(PDClient.java:565) ~[flink-sql-connector-tidb-cdc-3.1.1.jar:3.1.1]
at org.tikv.common.PDClient.lambda$initCluster$15(PDClient.java:730) ~[flink-sql-connector-tidb-cdc-3.1.1.jar:3.1.1]
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) [?:1.8.0_131]
at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308) [?:1.8.0_131]
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180) [?:1.8.0_131]
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294) [?:1.8.0_131]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) [?:1.8.0_131]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [?:1.8.0_131]
at java.lang.Thread.run(Thread.java:748) [?:1.8.0_131]
org.tikv.shade.io.grpc.StatusRuntimeException: DEADLINE_EXCEEDED: deadline exceeded after 0.199912183s. [buffered_nanos=200103136, waiting_for_connection]
at org.tikv.shade.io.grpc.stub.ClientCalls.toStatusRuntimeException(ClientCalls.java:287) ~[flink-sql-connector-tidb-cdc-3.1.1.jar:3.1.1]
at org.tikv.shade.io.grpc.stub.ClientCalls.getUnchecked(ClientCalls.java:268) ~[flink-sql-connector-tidb-cdc-3.1.1.jar:3.1.1]
at org.tikv.shade.io.grpc.stub.ClientCalls.blockingUnaryCall(ClientCalls.java:175) ~[flink-sql-connector-tidb-cdc-3.1.1.jar:3.1.1]
at org.tikv.kvproto.PDGrpc$PDBlockingStub.getMembers(PDGrpc.java:1868) ~[flink-sql-connector-tidb-cdc-3.1.1.jar:3.1.1]
at org.tikv.common.PDClient.getMembers(PDClient.java:443) ~[flink-sql-connector-tidb-cdc-3.1.1.jar:3.1.1]
at org.tikv.common.PDClient.tryUpdateLeader(PDClient.java:565) ~[flink-sql-connector-tidb-cdc-3.1.1.jar:3.1.1]
at org.tikv.common.PDClient.lambda$initCluster$15(PDClient.java:730) ~[flink-sql-connector-tidb-cdc-3.1.1.jar:3.1.1]
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) [?:1.8.0_131]
at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308) [?:1.8.0_131]
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180) [?:1.8.0_131]
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294) [?:1.8.0_131]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) [?:1.8.0_131]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [?:1.8.0_131]
at java.lang.Thread.run(Thread.java:748) [?:1.8.0_131]
【其他附件:截图/日志/监控】
image

2025-01-14 10:22:48,462 WARN org.tikv.common.PDClient - failed to get member from pd server.
org.tikv.shade.io.grpc.StatusRuntimeException: DEADLINE_EXCEEDED: deadline exceeded after 0.199976517s. [remote_addr=/10.xxx.xx.113:2370]

PD 是集群哦,这感觉是 BUG了,参考你的配置:
‘pd-addresses’ = ‘10.xxx.xx.111:2370,10.xxx.xx.112:2370,10.xxx.xx.113:2370’,

估计 flink社区 会让你升级 cdc 版本,但是我感觉还是解决不了你的问题

啊这…

ticdc 在 6.x 才有很多修复, 5.x bug也比较多

要解决问题,可以考虑采用 ticdc 把数据分发到 kafka,在通过 flink 对接 kafka 即可,稍微麻烦点

看你对是否有帮助了

明白了,多谢您。但是目前没有好的同步tidb数据的方式,5.x版本ticdc我测试时也会因为任务的增加卡死。。dba这边并不会升级版本,不知道tikv client方式怎么样。

哦,可以试下binlog 的组件,5.x 部分兼容,binlog 的数据capture 是 tidb,会给tidb 节点带来一些压力。

如果不升级,这个是可以参考的方案