pd三个节点同时挂掉,大量报错:invalid timestamp

【 TiDB 使用环境`】生产环境

【 TiDB 版本】v4.0.15

【遇到的问题】pd突然挂掉

【问题现象及影响】三个pd节点全部挂掉,进而整个tikv集群挂掉。

集群拓扑:

异常前后连接情况:

pd日志:

`[2022/04/13 16:44:43.023 +08:00] [INFO] [grpc_service.go:815] [“update service GC safe point”] [service-id=ticdc] [expire-at=1649925883] [safepoint=432495521068220442]

[2022/04/13 16:44:43.464 +08:00] [ERROR] [server.go:1203] [“failed to update timestamp”] [error=“[PD:etcd:ErrEtcdTxn]etcd Txn failed”]

[2022/04/13 16:44:43.465 +08:00] [INFO] [server.go:108] [“region syncer has been stopped”]

[2022/04/13 16:44:43.465 +08:00] [INFO] [cluster.go:310] [“metrics are reset”]

[2022/04/13 16:44:43.465 +08:00] [INFO] [coordinator.go:103] [“patrol regions has been stopped”]

[2022/04/13 16:44:43.465 +08:00] [INFO] [cluster.go:312] [“background jobs has been stopped”]

[2022/04/13 16:44:43.465 +08:00] [INFO] [coordinator.go:652] [“scheduler has been stopped”] [scheduler-name=balance-region-scheduler] [error=“context canceled”]

[2022/04/13 16:44:43.465 +08:00] [INFO] [coordinator.go:205] [“drive push operator has been stopped”]

[2022/04/13 16:44:43.465 +08:00] [INFO] [coordinator.go:652] [“scheduler has been stopped”] [scheduler-name=balance-hot-region-scheduler] [error=“context canceled”]

[2022/04/13 16:44:43.465 +08:00] [INFO] [cluster.go:331] [“coordinator is stopping”]

[2022/04/13 16:44:43.465 +08:00] [INFO] [coordinator.go:652] [“scheduler has been stopped”] [scheduler-name=balance-leader-scheduler] [error=“context canceled”]

[2022/04/13 16:44:43.465 +08:00] [INFO] [coordinator.go:652] [“scheduler has been stopped”] [scheduler-name=label-scheduler] [error=“context canceled”]

[2022/04/13 16:44:43.465 +08:00] [INFO] [cluster.go:327] [“coordinator has been stopped”]

[2022/04/13 16:44:43.465 +08:00] [INFO] [cluster.go:360] [“raftcluster is stopped”]

[2022/04/13 16:44:43.466 +08:00] [ERROR] [leader.go:142] [“getting pd leader meets error”] [error=“[PD:proto:ErrProtoUnmarshal]proto: Member: wiretype end group for non-group”]

[2022/04/13 16:44:43.468 +08:00] [ERROR] [tso.go:302] [“invalid timestamp”] [timestamp={}]

[2022/04/13 16:44:43.469 +08:00] [ERROR] [tso.go:302] [“invalid timestamp”] [timestamp={}]

[2022/04/13 16:44:43.470 +08:00] [ERROR] [tso.go:302] [“invalid timestamp”] [timestamp={}]

[2022/04/13 16:44:43.470 +08:00] [ERROR] [tso.go:302] [“invalid timestamp”] [timestamp={}]

[2022/04/13 16:44:43.471 +08:00] [ERROR] [tso.go:302] [“invalid timestamp”] [timestamp={}]
`

重启多次失败,然后关闭了cdc,一会之后再次重启成功。

没有PD产生TSO,肯定集群挂了,这种情况只能多机房部署了,都挂的情况应该很少了

我们是先pd挂,然后集群才挂的

对啊,PD集群挂了,整个集群读写的事务Id产生不了,整个集群就没有办法读写,基于整体不可用了的。

可是pd为什么会集体挂掉呢

检查pd状态,估计pd有问题导致。

看一下pd的错误日志,同时挂掉的可能一般很小

pd集群日志.txt (22.6 KB)

PD不行了,而且三个,全都不行了。正常。
不过这个和CDC什么关系?为什么CDC关闭就可以了?

不知道呢,不确定和cdc有没有关系,只是阐述这个情况。

这个需要看看监控和日志了。
日志这个:PD:etcd:ErrEtcdTxn]etcd Txn failed"

像是ETCD有问题了

etcd看着没什么异常

有没有人来给看看啊

在早期版本,cdc 是会影响到 PD 集群的,在 4.0.16 版本做了限制,可以看下这个 https://github.com/pingcap/tiflow/issues/3112

建议升级下版本再开启 cdc

此话题已在最后回复的 1 分钟后被自动关闭。不再允许新回复。