TICDC新建changefeed总是报etcd超时

【 TiDB 使用环境】生产环境 /测试/ Poc
生产环境,TIDB部署在k8s当中,TiDB实例大概有表5000~10000张,但是配置TiCDC同步的表只有150张左右,整个TiDB的数据量不大,只有不到10G。

目前整个环境还在测试当中,所以配置了TiCDC的表,也都没有什么流量,一张表的数据量也就几十万;整个环境的DDL操作比较频繁,会经常有truncate table操作;

【 TiDB 版本】
TiDB 5.4
【复现路径】做过哪些操作出现的问题
通过TiCDC 的openapi 创建changefeed失败,总是遇到etcd 超时的问题

【遇到的问题:问题现象及影响】
curl -X POST http://127.0.0.1:8301/api/v1/changefeeds -d ‘{“changefeed_id”:“k1”,“sink_uri”:“kafka://broker-kafka-test-az1-0.jvessel-open-hb.jdcloud.com:9092/tidb_version_test?protocol=canal-json&kafka-version=2.4.0&max-message-bytes=1073741824”, “filter_rules”:[“test.test1”]}’

返回 CDC:ErrPDEtcdAPIError]etcd api call error: context deadline exceeded

不是偶尔出现,几乎是100%的出现,且出现该错误的时候,http请求都是在12~14秒左右返回

同时通过cdc cli 创建就不会有问题

【资源配置】
TiCDC 配置 8C,16G;

/cdc server --addr=0.0.0.0:8301 --advertise-addr=tidb-test-ticdc-0tidb-test-ticdc-peer.tidb-test.svc:8301 --gc-ttl=86400 --log-file=/tmp/cdc_data/log/cdc.log --log-level=info --pd=http://tidb-test-pd:2379

【附件:截图/日志/监控】

创建失败时,有日志
[2022/12/28 09:08:12.272 +00:00] [ERROR] [client.go:502] [“[pd] tso request is canceled due to timeout”] [dc-location=global] [error=“[PD:client:ErrClientGetTSOTimeout]get TSO timeout”]
[2022/12/28 09:08:12.272 +00:00] [ERROR] [client.go:786] [“[pd] getTS error”] [dc-location=global] [error=“[PD:client:ErrClientGetTSO]rpc error: code = Canceled desc = context canceled: rpc error: code = Canceled desc = context canceled”]
[2022/12/28 09:08:12.272 +00:00] [INFO] [client.go:730] [“[pd] tso stream is not ready”] [dc=global]
[2022/12/28 09:08:12.272 +00:00] [INFO] [acquirer.go:71] [“get time from pd failed, retry later”] [error=“rpc error: code = Canceled desc = context canceled”] [errorVerbose=“rpc error: code = Canceled desc = context canceled\ngithub.com/tikv/pd/client.(
[2022/12/28 09:08:16.273 +00:00] [ERROR] [client.go:502] [”[pd] tso request is canceled due to timeout"] [dc-location=global] [error=“[PD:client:ErrClientGetTSOTimeout]get TSO timeout”]
[2022/12/28 09:08:16.274 +00:00] [ERROR] [client.go:786] [“[pd] getTS error”] [dc-location=global] [error=“[PD:client:ErrClientGetTSO]rpc error: code = Canceled desc = context canceled: rpc error: code = Canceled desc = context canceled”]
[2022/12/28 09:08:16.274 +00:00] [INFO] [client.go:730] [“[pd] tso stream is not ready”] [dc=global]
[2022/12/28 09:08:16.274 +00:00] [INFO] [acquirer.go:71] [“get time from pd failed, retry later”] [error=“rpc error: code = Canceled desc = context canceled”] [errorVerbose=“rpc error: code = Canceled desc = context canceled\ngithub.com/tikv/pd/client.(
[2022/12/28 09:08:20.274 +00:00] [ERROR] [client.go:502] [”[pd] tso request is canceled due to timeout"] [dc-location=global] [error=“[PD:client:ErrClientGetTSOTimeout]get TSO timeout”]
[2022/12/28 09:08:20.274 +00:00] [ERROR] [client.go:786] [“[pd] getTS error”] [dc-location=global] [error=“[PD:client:ErrClientGetTSO]rpc error: code = Canceled desc = context canceled: rpc error: code = Canceled desc = context canceled”]
[2022/12/28 09:08:20.274 +00:00] [INFO] [client.go:730] [“[pd] tso stream is not ready”] [dc=global]
[2022/12/28 09:08:20.274 +00:00] [INFO] [acquirer.go:71] [“get time from pd failed, retry later”] [error=“rpc error: code = Canceled desc = context canceled”] [errorVerbose=“rpc error: code = Canceled desc = context canceled\ngithub.com/tikv/pd/client.(
[2022/12/28 09:08:24.275 +00:00] [ERROR] [client.go:502] [”[pd] tso request is canceled due to timeout"] [dc-location=global] [error=“[PD:client:ErrClientGetTSOTimeout]get TSO timeout”]
[2022/12/28 09:08:24.276 +00:00] [ERROR] [client.go:786] [“[pd] getTS error”] [dc-location=global] [error=“[PD:client:ErrClientGetTSO]rpc error: code = Canceled desc = context canceled: rpc error: code = Canceled desc = context canceled”]
[2022/12/28 09:08:24.276 +00:00] [INFO] [client.go:730] [“[pd] tso stream is not ready”] [dc=global]
[2022/12/28 09:08:24.276 +00:00] [INFO] [acquirer.go:71] [“get time from pd failed, retry later”] [error=“rpc error: code = Canceled desc = context canceled”] [errorVerbose=“rpc error: code = Canceled desc = context canceled\ngithub.com/tikv/pd/client.(
[2022/12/28 09:08:28.277 +00:00] [ERROR] [client.go:502] [”[pd] tso request is canceled due to timeout"] [dc-location=global] [error=“[PD:client:ErrClientGetTSOTimeout]get TSO timeout”]
[2022/12/28 09:08:28.277 +00:00] [ERROR] [client.go:786] [“[pd] getTS error”] [dc-location=global] [error=“[PD:client:ErrClientGetTSO]rpc error: code = Canceled desc = context canceled: rpc error: code = Canceled desc = context canceled”]
[2022/12/28 09:08:28.277 +00:00] [INFO] [client.go:730] [“[pd] tso stream is not ready”] [dc=global]
[2022/12/28 09:08:28.277 +00:00] [INFO] [acquirer.go:71] [“get time from pd failed, retry later”] [error=“rpc error: code = Canceled desc = context canceled”] [errorVerbose=“rpc error: code = Canceled desc = context canceled\ngithub.com/tikv/pd/client.(
[2022/12/28 09:08:32.279 +00:00] [ERROR] [client.go:502] [”[pd] tso request is canceled due to timeout"] [dc-location=global] [error=“[PD:client:ErrClientGetTSOTimeout]get TSO timeout”]
[2022/12/28 09:08:32.279 +00:00] [ERROR] [client.go:786] [“[pd] getTS error”] [dc-location=global] [error=“[PD:client:ErrClientGetTSO]rpc error: code = Canceled desc = context canceled: rpc error: code = Canceled desc = context canceled”]
[2022/12/28 09:08:32.279 +00:00] [INFO] [client.go:730] [“[pd] tso stream is not ready”] [dc=global]
[2022/12/28 09:08:32.279 +00:00] [INFO] [acquirer.go:71] [“get time from pd failed, retry later”] [error=“rpc error: code = Canceled desc = context canceled”] [errorVerbose=“rpc error: code = Canceled desc = context canceled\ngithub.com/tikv/pd/client.(
[2022/12/28 09:08:36.281 +00:00] [ERROR] [client.go:502] [”[pd] tso request is canceled due to timeout"] [dc-location=global] [error=“[PD:client:ErrClientGetTSOTimeout]get TSO timeout”]
[2022/12/28 09:08:36.281 +00:00] [ERROR] [client.go:786] [“[pd] getTS error”] [dc-location=global] [error=“[PD:client:ErrClientGetTSO]rpc error: code = Canceled desc = context canceled: rpc error: code = Canceled desc = context canceled”]

看起来是cdc节点请求不到pd,网络不通吧

k8s 的网络环境比较复杂,需要核对下openapi 的请求方,是否能正常访问到 PD