flink cdc 读取tidb时出现DEADLINE_EXCEEDED异常

【 TiDB 使用环境】生产环境
【 TiDB 版本】v5.4.0
【复现路径】flink sql 配置tidb-cdc后,正常运行一段时间出现了 DEADLINE_EXCEEDED 异常后同步中断。flink sql 配置:
SET execution.checkpointing.interval = 3s;
SET table.exec.sink.not-null-enforcer=DROP;
SET execution.runtime-mode=streaming;
SET pipeline.name=jobName;

CREATE TABLE table_name (
db_name STRING METADATA FROM ‘database_name’ VIRTUAL,
table_name STRING METADATA FROM ‘table_name’ VIRTUAL,
operation_ts TIMESTAMP_LTZ(3) METADATA FROM ‘op_ts’ VIRTUAL,
id bigint ,
name varchar ,
PRIMARY KEY (id) NOT ENFORCED
) WITH (
‘connector’ = ‘tidb-cdc’,
‘tikv.grpc.timeout_in_ms’ = ‘40000’,
‘tikv.grpc.scan_timeout_in_ms’ = ‘40000’ ,
‘pd-addresses’ = ‘10.xxx.xx.111:2370,10.xxx.xx.112:2370,10.xxx.xx.113:2370’,
‘database-name’ = ‘db_name’,
‘table-name’ = ‘table_name’,
‘scan.startup.mode’ = ‘latest-offset’
);
【遇到的问题:问题现象及影响】
网络正常,tidb集群也正常,目前重启及各种参数配置都没效果。
【复制黏贴 ERROR 报错的日志】
2025-01-14 10:22:48,457 WARN org.tikv.common.region.StoreHealthyChecker - store [10.xxx.xx.113:20160] is not reachable
2025-01-14 10:22:48,462 WARN org.tikv.common.region.StoreHealthyChecker - store [10.xxx.xx.113:20160] is not reachable
2025-01-14 10:22:48,462 WARN org.tikv.common.PDClient - failed to get member from pd server.
org.tikv.shade.io.grpc.StatusRuntimeException: DEADLINE_EXCEEDED: deadline exceeded after 0.199976517s. [remote_addr=/10.xxx.xx.113:2370]
at org.tikv.shade.io.grpc.stub.ClientCalls.toStatusRuntimeException(ClientCalls.java:287) ~[flink-sql-connector-tidb-cdc-3.1.1.jar:3.1.1]
at org.tikv.shade.io.grpc.stub.ClientCalls.getUnchecked(ClientCalls.java:268) ~[flink-sql-connector-tidb-cdc-3.1.1.jar:3.1.1]
at org.tikv.shade.io.grpc.stub.ClientCalls.blockingUnaryCall(ClientCalls.java:175) ~[flink-sql-connector-tidb-cdc-3.1.1.jar:3.1.1]
at org.tikv.kvproto.PDGrpc$PDBlockingStub.getMembers(PDGrpc.java:1868) ~[flink-sql-connector-tidb-cdc-3.1.1.jar:3.1.1]
at org.tikv.common.PDClient.getMembers(PDClient.java:443) ~[flink-sql-connector-tidb-cdc-3.1.1.jar:3.1.1]
at org.tikv.common.PDClient.tryUpdateLeader(PDClient.java:565) ~[flink-sql-connector-tidb-cdc-3.1.1.jar:3.1.1]
at org.tikv.common.PDClient.lambda$initCluster$15(PDClient.java:730) ~[flink-sql-connector-tidb-cdc-3.1.1.jar:3.1.1]
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) [?:1.8.0_131]
at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308) [?:1.8.0_131]
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180) [?:1.8.0_131]
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294) [?:1.8.0_131]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) [?:1.8.0_131]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [?:1.8.0_131]
at java.lang.Thread.run(Thread.java:748) [?:1.8.0_131]
org.tikv.shade.io.grpc.StatusRuntimeException: DEADLINE_EXCEEDED: deadline exceeded after 0.199912183s. [buffered_nanos=200103136, waiting_for_connection]
at org.tikv.shade.io.grpc.stub.ClientCalls.toStatusRuntimeException(ClientCalls.java:287) ~[flink-sql-connector-tidb-cdc-3.1.1.jar:3.1.1]
at org.tikv.shade.io.grpc.stub.ClientCalls.getUnchecked(ClientCalls.java:268) ~[flink-sql-connector-tidb-cdc-3.1.1.jar:3.1.1]
at org.tikv.shade.io.grpc.stub.ClientCalls.blockingUnaryCall(ClientCalls.java:175) ~[flink-sql-connector-tidb-cdc-3.1.1.jar:3.1.1]
at org.tikv.kvproto.PDGrpc$PDBlockingStub.getMembers(PDGrpc.java:1868) ~[flink-sql-connector-tidb-cdc-3.1.1.jar:3.1.1]
at org.tikv.common.PDClient.getMembers(PDClient.java:443) ~[flink-sql-connector-tidb-cdc-3.1.1.jar:3.1.1]
at org.tikv.common.PDClient.tryUpdateLeader(PDClient.java:565) ~[flink-sql-connector-tidb-cdc-3.1.1.jar:3.1.1]
at org.tikv.common.PDClient.lambda$initCluster$15(PDClient.java:730) ~[flink-sql-connector-tidb-cdc-3.1.1.jar:3.1.1]
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) [?:1.8.0_131]
at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308) [?:1.8.0_131]
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180) [?:1.8.0_131]
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294) [?:1.8.0_131]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) [?:1.8.0_131]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [?:1.8.0_131]
at java.lang.Thread.run(Thread.java:748) [?:1.8.0_131]
【其他附件:截图/日志/监控】
image

2025-01-14 10:22:48,462 WARN org.tikv.common.PDClient - failed to get member from pd server.
org.tikv.shade.io.grpc.StatusRuntimeException: DEADLINE_EXCEEDED: deadline exceeded after 0.199976517s. [remote_addr=/10.xxx.xx.113:2370]

PD 是集群哦,这感觉是 BUG了,参考你的配置:
‘pd-addresses’ = ‘10.xxx.xx.111:2370,10.xxx.xx.112:2370,10.xxx.xx.113:2370’,

估计 flink社区 会让你升级 cdc 版本,但是我感觉还是解决不了你的问题

啊这…

ticdc 在 6.x 才有很多修复, 5.x bug也比较多

要解决问题,可以考虑采用 ticdc 把数据分发到 kafka,在通过 flink 对接 kafka 即可,稍微麻烦点

看你对是否有帮助了

明白了,多谢您。但是目前没有好的同步tidb数据的方式,5.x版本ticdc我测试时也会因为任务的增加卡死。。dba这边并不会升级版本,不知道tikv client方式怎么样。

哦,可以试下binlog 的组件,5.x 部分兼容,binlog 的数据capture 是 tidb,会给tidb 节点带来一些压力。

如果不升级,这个是可以参考的方案

我觉得还是升级版本比较好,之前5.x我还出现过一些诡异问题

flink拉取tidb的数据,在非latest模式下比较耗内存。把内存调大就没事了。
我自己拉取配置了30G内存

你能升级版本就升级版本吧

排查建议
1、检查 TiDB 集群状态
通过 Grafana 监控 TiKV 节点 CPU / 内存、Region 分布、Leader 切换频率及 PD TSO 分配延迟,确认集群健康度。
2、优化 Flink CDC 配置
增大 grpc.timeout、connect.timeout 参数,启用重试策略(如 retry.max-attempts),并合理设置 Checkpoint 间隔以缓解反压。
3、分析日志定位瓶颈
收集 Flink TaskManager 日志(关注 gRPC 通信异常堆栈)及 TiCDC 节点日志(排查 GC 状态和同步延迟),结合 TiDB 慢查询日志(如 tidb_slow_query.log)定位热点操作。
4、隔离测试与压力验证
在测试环境模拟相同数据量和并发场景,逐步调整资源配置和超时参数,验证问题是否复现并确定最优配置。
5、升级版本也是一种选择
6、flink拉取tidb的数据比较耗内存,可以适当调大内存观察