tidb版本升级(4.0.14升级到5.1.4后cdc部分节点处于down状态

【 TiDB 使用环境`】生产环境
【 TiDB 版本】 v5.1.4
【遇到的问题】 tidb升级后cdc部分节点处于down状态

【问题现象及影响】
tidb v4.0.14使用tiup升级到v5.1.4,升级正常,但是查看集群状态时,cdc部分节点处于down状态。重启cdc节点提示重启成功,但是仍然显示部分节点down,每次查看down的cdc节点都不一样。

单独只重启cdc节点,也是成功的。但是tiup cluster display查看,3个cdc节点仍是不断重启再down掉,每次查看,down掉的cdc节点都不同。

全部重启了也不行?

是的,重启过程显示成功,但是查看总是有cdc节点处于down

那个down,cdc 节点的日志描述的是什么呢?

看下cdc日志呢,报的什么错误

节点时轮番down的,日志内容差不多
cdc_stderr.log这个的日志如下:
goroutine 707 [running]:
github.com/pingcap/tiflow/cdc/model.(*ChangeFeedInfo).FixIncompatible(0x0)
github.com/pingcap/tiflow/cdc/model/changefeed.go:225 +0x37
github.com/pingcap/tiflow/cdc/owner.fixChangefeedInfos.func1(0x0, 0x203000, 0x203000, 0x203000, 0x90)
github.com/pingcap/tiflow/cdc/owner/owner.go:266 +0x2b
github.com/pingcap/tiflow/cdc/model.(*ChangefeedReactorState).PatchInfo.func1(0x0, 0x0, 0x413ec2, 0xc001a35038, 0xf52f6176, 0x53ba57631af9f9da, 0x30)
github.com/pingcap/tiflow/cdc/model/reactor_state.go:296 +0xa2
github.com/pingcap/tiflow/cdc/model.(*ChangefeedReactorState).patchAny.func1(0x0, 0x0, 0x0, 0x3e, 0x4c38360, 0x2af6820, 0x1, 0xc000b95d70, 0xc001a35088)
github.com/pingcap/tiflow/cdc/model/reactor_state.go:389 +0x13a
github.com/pingcap/tiflow/pkg/orchestrator.(*SingleDataPatch).Patch(0xc000995a88, 0xc000b94600, 0xc000b95d40, 0x25, 0xc002424088)
github.com/pingcap/tiflow/pkg/orchestrator/interfaces.go:55 +0x82
github.com/pingcap/tiflow/pkg/orchestrator.getChangedState(0xc000b94600, 0xc000daab40, 0x1, 0x1, 0xc0012f79c0, 0x451, 0x0, 0x0)
github.com/pingcap/tiflow/pkg/orchestrator/batch.go:77 +0xa5
github.com/pingcap/tiflow/pkg/orchestrator.getBatchChangedState(0xc000b94600, 0xc002263600, 0x7, 0x7, 0x4, 0x4, 0xc0006386c0, 0xc001a352e0, 0x2525d13)
github.com/pingcap/tiflow/pkg/orchestrator/batch.go:41 +0x17e
github.com/pingcap/tiflow/pkg/orchestrator.(*EtcdWorker).applyPatchGroups(0xc00102e480, 0x7f6a77689028, 0xc00050a0a0, 0xc002263600, 0x7, 0x7, 0x1, 0x1, 0x0, 0x2, …)
github.com/pingcap/tiflow/pkg/orchestrator/etcd_worker.go:335 +0xc5
github.com/pingcap/tiflow/pkg/orchestrator.(*EtcdWorker).Run(0xc00102e480, 0x7f6a77689028, 0xc00050a0a0, 0xc000626270, 0xbebc200, 0x7fff8cb43e45, 0x13, 0x2c3f079, 0x5, 0x0, …)
github.com/pingcap/tiflow/pkg/orchestrator/etcd_worker.go:207 +0xb87
github.com/pingcap/tiflow/cdc/capture.(*Capture).runEtcdWorker(0xc0007c4000, 0x31f9208, 0xc00050a0a0, 0x318cc80, 0xc001408b40, 0x31b8488, 0xc001407890, 0xbebc200, 0x2c3f079, 0x5, …)
github.com/pingcap/tiflow/cdc/capture/capture.go:291 +0x185
github.com/pingcap/tiflow/cdc/capture.(*Capture).campaignOwner(0xc0007c4000, 0x31f9208, 0xc00050a0a0, 0x40dc00, 0x318dd20)
github.com/pingcap/tiflow/cdc/capture/capture.go:263 +0x6ee
github.com/pingcap/tiflow/cdc/capture.(*Capture).run.func2(0xc000050140, 0xc0007c4000, 0x31f9208, 0xc00050a0a0, 0xc0007c8140)
github.com/pingcap/tiflow/cdc/capture/capture.go:184 +0xb5
created by github.com/pingcap/tiflow/cdc/capture.(*Capture).run
github.com/pingcap/tiflow/cdc/capture/capture.go:178 +0x2c8
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x98 pc=0x1586357]

cdc.log里的内容是:
[2022/05/25 10:45:25.234 +08:00] [INFO] [changefeed.go:227] [“Start fixing incompatible changefeed state”] [changefeed="{“sink-uri”:"",“opts”:{“max-message-bytes”:“10000120”},“create-time”:“2021-04-28T13:07:50.608596171+08:00”,“start-ts”:424564875739529255,“target-ts”:0,“admin-job-type”:2,“sort-engine”:“memory”,“sort-dir”:".",“config”:{“case-sensitive”:true,“enable-old-value”:true,“force-replicate”:false,“check-gc-safe-point”:true,“filter”:{“rules”:["."],“ignore-txn-start-ts”:null},“mounter”:{“worker-num”:16},“sink”:{“dispatchers”:[{“matcher”:["cdmp."],“dispatcher”:“table”}],“protocol”:“canal-json”},“cyclic-replication”:{“enable”:false,“replica-id”:0,“filter-replica-ids”:null,“id-buckets”:0,“sync-ddl”:false},“scheduler”:{“type”:“table-number”,“polling-time”:-1}},“state”:“normal”,“error”:null,“sync-point-enabled”:false,“sync-point-interval”:600000000000,“creator-version”:""}"]

[2022/05/25 10:45:25.234 +08:00] [INFO] [changefeed.go:229] [“Fix incompatibility changefeed state completed”] [changefeed="{“sink-uri”:"",“opts”:{“max-message-bytes”:“10000120”},“create-time”:“2021-04-28T13:07:50.608596171+08:00”,“start-ts”:424564875739529255,“target-ts”:0,“admin-job-type”:2,“sort-engine”:“memory”,“sort-dir”:".",“config”:{“case-sensitive”:true,“enable-old-value”:true,“force-replicate”:false,“check-gc-safe-point”:true,“filter”:{“rules”:["."],“ignore-txn-start-ts”:null},“mounter”:{“worker-num”:16},“sink”:{“dispatchers”:[{“matcher”:["cdmp."],“dispatcher”:“table”}],“protocol”:“canal-json”},“cyclic-replication”:{“enable”:false,“replica-id”:0,“filter-replica-ids”:null,“id-buckets”:0,“sync-ddl”:false},“scheduler”:{“type”:“table-number”,“polling-time”:-1}},“state”:“normal”,“error”:null,“sync-point-enabled”:false,“sync-point-interval”:600000000000,“creator-version”:""}"]

[2022/05/25 10:45:25.234 +08:00] [INFO] [changefeed.go:227] [“Start fixing incompatible changefeed state”] [changefeed="{“sink-uri”:"***",“opts”:{“max-message-bytes”:“10000120”},“create-time”:“2021-10-18T15:20:51.315262631+08:00”,“start-ts”:428485286629212282,“target-ts”:0,“admin-job-type”:2,“sort-engine”:“unified”,“sort-dir”:"",“config”:{“case-sensitive”:true,“enable-old-value”:true,“force-replicate”:false,“check-gc-safe-point”:true,“filter”:{“rules”:[“cdmp_press./.order./”],“ignore-txn-start-ts”:null},“mounter”:{“worker-num”:16},“sink”:{“dispatchers”:[{“matcher”:[“cdmp_press./.order./”],“dispatcher”:“table”}],“protocol”:“canal-json”},“cyclic-replication”:{“enable”:false,“replica-id”:0,“filter-replica-ids”:null,“id-buckets”:0,“sync-ddl”:false},“scheduler”:{“type”:“table-number”,“polling-time”:-1}},“state”:“normal”,“error”:null,“sync-point-enabled”:false,“sync-point-interval”:600000000000,“creator-version”:“v4.0.14”}"]

可能是某些兼容的问题,把常规变量赋值给一个指针变量?

有啥快速处理的办法吗

这个需要ctc的大佬看看 @@小王同学plus

这是个已知问题,参考:
https://github.com/pingcap/tiflow/issues/5266
类似问题:

由于比较紧急,解决办法是回退cdc版本,将cdc版本由5.1.4替换为4.1.16,重启cdc节点,起来了

可以采用:https://pingcap.feishu.cn/docs/doccnpPgWuyZu0sPNHtPM7Cg7Kf# 这个手动解决方案来暂时解决这个问题。我们后续的版本会修复这个问题,收到影响的版本在文档中有说明。