集群机器异常宕机踢出后,CDC同步状态normal,checkpoint_time不前进或前进特别慢

【TiDB 使用环境】生产环境
【TiDB 版本】5.1
【操作系统】Centos7.9
【问题复现路径】 集群中PD机器异常宕机,问题机器force删除
tiup cluster scale-in Acluster -N “xx.23:2379,xx.23:8300” --force
【遇到的问题:问题现象及影响】
现象:
1)CDC数据同步状态normal,checkpoint_time不前进
2)文件句柄数>20W
【日志】
信息较多,不知道哪个是关键日志信息
error="[CDC:ErrPendingRegionCancel]pending region cancelled due to stream disconnecting
【其他附件:截图/日志/监控】

问原因及如何解决:

[2025/03/12 20:41:25.722 +08:00] [INFO] [region_cache.go:961] [“switch region peer to next due to NotLeader with NULL leader”] [currIdx=2] [regionID=306397639]
[2025/03/12 20:41:25.723 +08:00] [INFO] [region_range_lock.go:222] [“range locked”] [changefeed=jxs-to-jxsclusterlf] [lockID=1149] [regionID=306397639] [startKey=7480000000000058ffd75f728000000000ff2a6d270000000000fa] [endKey=7480000000000058ffd75f728000000000ff2a6d2b0000000000fa] [checkpointTs=456596116058669078]
[2025/03/12 20:41:25.723 +08:00] [INFO] [client.go:825] [“start new request”] [changefeed=jxs-to-jxsclusterlf] [request=“{"header":{"cluster_id":6868532206274776316,"ticdc_version":"5.1.4"},"region_id":306397639,"region_epoch":{"conf_ver":118280,"version":6167},"checkpoint_ts":456596116058669078,"start_key":"dIAAAAAAAFj/119ygAAAAAD/Km0nAAAAAAD6","end_key":"dIAAAAAAAFj/119ygAAAAAD/Km0rAAAAAAD6","request_id":92202,"extra_op":1,"Request":null}”] [addr=1x.x.38.48:20160]
[2025/03/12 20:41:25.758 +08:00] [INFO] [client.go:1158] [“stream to store closed”] [changefeed=jxs-to-jxsclusterlf] [addr=xx.xx.128.23:20162] [storeID=261261639]
[2025/03/12 20:41:25.758 +08:00] [INFO] [region_range_lock.go:383] [“unlocked range”] [changefeed=jxs-to-jxsclusterlf] [lockID=1149] [regionID=304490790] [startKey=7480000000000058ffd75f728000000000ff02c5810000000000fa] [endKey=7480000000000058ffd75f728000000000ff0f76f50000000000fa] [checkpointTs=456596116058669078]
[2025/03/12 20:41:25.758 +08:00] [INFO] [region_cache.go:711] [“mark store’s regions need be refill”] [store=xx.xx.128.23:20162]
[2025/03/12 20:41:25.758 +08:00] [INFO] [region_cache.go:736] [“switch region peer to next due to send request fail”] [current=“region ID: 304490790, meta: id:304490790 start_key:"t\200\000\000\000\000\000X\377\327_r\200\000\000\000\000\377\002\305\201\000\000\000\000\000\372" end_key:"t\200\000\000\000\000\000X\377\327_r\200\000\000\000\000\377\017v\365\000\000\000\000\000\372" region_epoch:<conf_ver:112214 version:6642 > peers:<id:306488684 store_id:261261639 > peers:<id:306602928 store_id:108517384 > peers:<id:306617335 store_id:25913577 > , peer: id:306488684 store_id:261261639 , addr: xx.xx.128.23:20162, idx: 0, reqStoreType: TiKvOnly, runStoreType: tikv”] [needReload=false] [error=“[CDC:ErrPendingRegionCancel]pending region cancelled due to stream disconnecting”]
[2025/03/12 20:41:25.758 +08:00] [INFO] [region_range_lock.go:222] [“range locked”] [changefeed=jxs-to-jxsclusterlf] [lockID=1149] [regionID=304490790] [startKey=7480000000000058ffd75f728000000000ff02c5810000000000fa] [endKey=7480000000000058ffd75f728000000000ff0f76f50000000000fa] [checkpointTs=456596116058669078]
[2025/03/12 20:41:25.758 +08:00] [INFO] [client.go:779] [“creating new stream to store to send request”] [changefeed=jxs-to-jxsclusterlf] [regionID=304490790] [requestID=92203] [storeID=108517384] [addr=1x.x.234.99:20161]
[2025/03/12 20:41:25.759 +08:00] [INFO] [client.go:825] [“start new request”] [changefeed=jxs-to-jxsclusterlf] [request=“{"header":{"cluster_id":6868532206274776316,"ticdc_version":"5.1.4"},"region_id":304490790,"region_epoch":{"conf_ver":112214,"version":6642},"checkpoint_ts":456596116058669078,"start_key":"dIAAAAAAAFj/119ygAAAAAD/AsWBAAAAAAD6","end_key":"dIAAAAAAAFj/119ygAAAAAD/D3b1AAAAAAD6","request_id":92203,"extra_op":1,"Request":null}”] [addr=1x.x.234.99:20161]

==== 后来 ===
[2025/03/12 23:29:16.806 +08:00] [INFO] [region_range_lock.go:383] [“unlocked range”] [changefeed=jxs-to-jxsclusterlf] [lockID=2291] [regionID=304973130] [startKey=7480000000000058ffd75f728000000000ff2a6d2b0000000000fa] [endKey=7480000000000058ffd75f728000000000ff35c3410000000000fa] [checkpointTs=456600677560877060]
[2025/03/12 23:29:16.806 +08:00] [INFO] [region_cache.go:974] [“switch region leader to specific leader due to kv return NotLeader”] [regionID=304973130] [currIdx=0] [leaderStoreID=292354107]
[2025/03/12 23:29:16.806 +08:00] [INFO] [region_range_lock.go:222] [“range locked”] [changefeed=jxs-to-jxsclusterlf] [lockID=2291] [regionID=304973130] [startKey=7480000000000058ffd75f728000000000ff2a6d2b0000000000fa] [endKey=7480000000000058ffd75f728000000000ff35c3410000000000fa] [checkpointTs=456600677560877060]
[2025/03/12 23:29:16.806 +08:00] [INFO] [client.go:825] [“start new request”] [changefeed=jxs-to-jxsclusterlf] [request=“{"header":{"cluster_id":6868532206274776316,"ticdc_version":"5.1.4"},"region_id":304973130,"region_epoch":{"conf_ver":118369,"version":6167},"checkpoint_ts":456600677560877060,"start_key":"dIAAAAAAAFj/119ygAAAAAD/Km0rAAAAAAD6","end_key":"dIAAAAAAAFj/119ygAAAAAD/NcNBAAAAAAD6","request_id":316108,"extra_op":1,"Request":null}”] [addr=10.x.128.166:20162]
[2025/03/12 23:29:16.896 +08:00] [INFO] [client.go:1158] [“stream to store closed”] [changefeed=jxs-to-jxsclusterlf] [addr=x.x.234.98:20160] [storeID=103584169]

v5.1.4 版本太低了,试试升级吧。
使用 cdc 的话推荐至少 6.5.x 版本以上。低版本 bug 太多了。