TiCDC报错,错误码CDC-owner-1001

【 TiDB 使用环境】
生产
【 TiDB 版本】

TiCDC Version
Release Version: v4.0.11
Git Commit Hash: 52a6d9ea6da595b869a43e13ae2d3680354f89b8
Git Branch: heads/refs/tags/v4.0.11
UTC Build Time: 2021-02-25 16:40:37
Go Version: go version go1.13 linux/amd64

【遇到的问题:问题现象及影响】
任务为环形同步任务 A、B两个集群之间互相同步,B是备份集群,目前B上暂时没有写入操作且B暂时无法测试,因此无法获知同步任务是否还正常,changefeed可以创建成功,但运行一段时间后会变为 状态正常,但有报错信息如下

"state": "normal",
  "history": [
    1679004242947
  ],
  "error": {
    "addr": "10.241.200.238:8300",
    "code": "CDC-owner-1001",
    "message": "rpc error: code = Unknown desc = rpc error: code = Unavailable desc = not leader"
  },

【资源配置】
两个CDC节点
【附件:截图/日志/监控】
通过日志来看是TiKV 的RPC连接失败导致的,想问一下这种问题如何修复?

[2023/03/25 18:09:34.923 +00:00] [INFO] [client.go:726] ["creating new stream to store to send request"] [regionID=126366642] [requestID=4042] [storeID=65151616] [addr=10.250.78.96:20160]
[2023/03/25 18:09:34.924 +00:00] [INFO] [client.go:398] ["establish stream to store failed, retry later"] [addr=10.250.78.96:20160] [error="[CDC:ErrTiKVEventFeed]rpc error: code = Unavailable desc = connection closed"] [errorVerbose="[CDC:ErrTiKVEventFeed]rpc error: code = Unavailable desc = connection closed\ngithub.com/pingcap/errors.AddStack\n\tgithub.com/pingcap/errors@v0.11.5-0.20201126102027-b0a155152ca3/errors.go:174\ngithub.com/pingcap/errors.(*Error).GenWithStackByCause\n\tgithub.com/pingcap/errors@v0.11.5-0.20201126102027-b0a155152ca3/normalize.go:279\ngithub.com/pingcap/ticdc/pkg/errors.WrapError\n\tgithub.com/pingcap/ticdc@/pkg/errors/helper.go:28\ngithub.com/pingcap/ticdc/cdc/kv.(*CDCClient).newStream.func1\n\tgithub.com/pingcap/ticdc@/cdc/kv/client.go:397\ngithub.com/pingcap/ticdc/pkg/retry.Run.func1\n\tgithub.com/pingcap/ticdc@/pkg/retry/retry.go:32\ngithub.com/cenkalti/backoff.RetryNotify\n\tgithub.com/cenkalti/backoff@v2.2.1+incompatible/retry.go:37\ngithub.com/cenkalti/backoff.Retry\n\tgithub.com/cenkalti/backoff@v2.2.1+incompatible/retry.go:24\ngithub.com/pingcap/ticdc/pkg/retry.Run\n\tgithub.com/pingcap/ticdc@/pkg/retry/retry.go:31\ngithub.com/pingcap/ticdc/cdc/kv.(*CDCClient).newStream\n\tgithub.com/pingcap/ticdc@/cdc/kv/client.go:375\ngithub.com/pingcap/ticdc/cdc/kv.(*eventFeedSession).dispatchRequest\n\tgithub.com/pingcap/ticdc@/cdc/kv/client.go:731\ngithub.com/pingcap/ticdc/cdc/kv.(*eventFeedSession).eventFeed.func1\n\tgithub.com/pingcap/ticdc@/cdc/kv/client.go:521\ngolang.org/x/sync/errgroup.(*Group).Go.func1\n\tgolang.org/x/sync@v0.0.0-20201020160332-67f06af15bc9/errgroup/errgroup.go:57\nruntime.goexit\n\truntime/asm_amd64.s:1357"]
[2023/03/25 18:09:34.924 +00:00] [INFO] [region_range_lock.go:217] ["range locked"] [lockID=150] [regionID=126366089] [startKey=7480000000000089ff1a5f728000000000ff0393680000000000fa] [endKey=7480000000000089ff1a5f728000000000ff03991c0000000000fa] [checkpointTs=440341039823454248]
[2023/03/25 18:09:34.925 +00:00] [INFO] [region_range_lock.go:217] ["range locked"] [lockID=150] [regionID=126245263] [startKey=7480000000000089ff1a5f698000000000ff0000030419ae9a3aff8100000003800000ff000002edd7000000fc] [endKey=7480000000000089ff1a5f698000000000ff0000040419ae98a3ffa900000003800000ff00000089b0000000fc] [checkpointTs=440341039823454248]
[2023/03/25 18:09:34.925 +00:00] [INFO] [region_range_lock.go:217] ["range locked"] [lockID=150] [regionID=126366891] [startKey=7480000000000089ff1a5f698000000000ff0000040419ae9928fff800000003800000ff000002dba9000000fc] [endKey=7480000000000089ff1a5f698000000000ff0000040419ae9a58ff0200000003800000ff0000001a17000000fc] [checkpointTs=440341039823454248]
[2023/03/25 18:09:34.926 +00:00] [INFO] [region_range_lock.go:217] ["range locked"] [lockID=150] [regionID=126352778] [startKey=7480000000000089ff1a5f728000000000ff03991c0000000000fa] [endKey=7480000000000089ff1a5f728000000000ff039d430000000000fa] [checkpointTs=440341039823454248]
[2023/03/25 18:09:34.927 +00:00] [INFO] [region_range_lock.go:217] ["range locked"] [lockID=150] [regionID=126366128] [startKey=7480000000000089ff1a5f728000000000ff039d430000000000fa] [endKey=7480000000000089ff1a5f728000000000ff03a4180000000000fa] [checkpointTs=440341039823454248]
[2023/03/25 18:09:34.927 +00:00] [INFO] [region_range_lock.go:217] ["range locked"] [lockID=150] [regionID=126245240] [startKey=7480000000000089ff1a5f728000000000ff03dd180000000000fa] [endKey=7480000000000089ff1a5f728000000000ff03e5a20000000000fa] [checkpointTs=440341039823454248]
[2023/03/25 18:09:34.929 +00:00] [INFO] [client.go:398] ["establish stream to store failed, retry later"] [addr=10.250.78.96:20160] [error="[CDC:ErrTiKVEventFeed]rpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing dial tcp 10.250.78.96:20160: connect: connection refused\""] [errorVerbose="[CDC:ErrTiKVEventFeed]rpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing dial tcp 10.250.78.96:20160: connect: connection refused\"\ngithub.com/pingcap/errors.AddStack\n\tgithub.com/pingcap/errors@v0.11.5-0.20201126102027-b0a155152ca3/errors.go:174\ngithub.com/pingcap/errors.(*Error).GenWithStackByCause\n\tgithub.com/pingcap/errors@v0.11.5-0.20201126102027-b0a155152ca3/normalize.go:279\ngithub.com/pingcap/ticdc/pkg/errors.WrapError\n\tgithub.com/pingcap/ticdc@/pkg/errors/helper.go:28\ngithub.com/pingcap/ticdc/cdc/kv.(*CDCClient).newStream.func1\n\tgithub.com/pingcap/ticdc@/cdc/kv/client.go:397\ngithub.com/pingcap/ticdc/pkg/retry.Run.func1\n\tgithub.com/pingcap/ticdc@/pkg/retry/retry.go:32\ngithub.com/cenkalti/backoff.RetryNotify\n\tgithub.com/cenkalti/backoff@v2.2.1+incompatible/retry.go:37\ngithub.com/cenkalti/backoff.Retry\n\tgithub.com/cenkalti/backoff@v2.2.1+incompatible/retry.go:24\ngithub.com/pingcap/ticdc/pkg/retry.Run\n\tgithub.com/pingcap/ticdc@/pkg/retry/retry.go:31\ngithub.com/pingcap/ticdc/cdc/kv.(*CDCClient).newStream\n\tgithub.com/pingcap/ticdc@/cdc/kv/client.go:375\ngithub.com/pingcap/ticdc/cdc/kv.(*eventFeedSession).dispatchRequest\n\tgithub.com/pingcap/ticdc@/cdc/kv/client.go:731\ngithub.com/pingcap/ticdc/cdc/kv.(*eventFeedSession).eventFeed.func1\n\tgithub.com/pingcap/ticdc@/cdc/kv/client.go:521\ngolang.org/x/sync/errgroup.(*Group).Go.func1\n\tgolang.org/x/sync@v0.0.0-20201020160332-67f06af15bc9/errgroup/errgroup.go:57\nruntime.goexit\n\truntime/asm_amd64.s:1357"]
[2023/03/25 18:09:34.929 +00:00] [WARN] [client.go:734] ["get grpc stream client failed"] [regionID=125931395] [requestID=4029] [storeID=65151616] [error="[CDC:ErrTiKVEventFeed]rpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing dial tcp 10.250.78.96:20160: connect: connection refused\""]

环形同步是A,B两个集群都部署了TICDC吗?

是的,目前A到B的同步正常,但B的同步可能有点问题,不过看日志是B的TiKV RPC连接失败引起的问题,应该和环形同步的特性无关,但目前我不清楚任务是否还是正常的,以及如何恢复

A同步到B,B再同步到A?

备份集群为啥要同步到主集群?

是的,已经按照官方文档做了环形同步的过滤了,这样B不会把A同步给自己的数据再同步给A
这样设计的目的是:正常情况下A运行时生产的数据同步到B,如果A出现了问题,则切换到B,此时B成为主集群,生成的数据同步到A,这样数据一直是一致的

这样设计的目的是:正常情况下A运行时生产的数据同步到B,如果A出现了问题,则切换到B,此时B成为主集群,生成的数据同步到A,这样数据一直是一致的

明白嘞,但是我总感觉这样做会有问题,并且官方也说这个环形备份是一个实验特性

是的,这个已经和使用的Service说明过了,他们也了解,但目前的这个报错不像是环形同步引起的,看起来像是CDC owner以及TiKV RPC error一类的问题