双pd都挂掉了，启动报错

RephaelLee · 2021 年5 月 6 日 02:20

为提高效率，请提供以下信息，问题描述清晰能够更快得到解决：

【TiDB 版本】v5.0.0

【问题描述】

因网络、内存原因，集群挂掉了，双pd都挂了，现在tiup 重新启动无法启动pd
{“level”:“warn”,“ts”:“2021-05-06T10:17:18.086+0800”,“caller”:“clientv3/retry_interceptor.go:62”,“msg”:“retrying of unary invoker failed”,“target”:“endpoint://client-bfd18d84-72f1-4aa6-9715-fbdc7ce7e64c/10.53.x.x:2379”,“attempt”:0,“error”:“rpc error: code = DeadlineExceeded desc = latest balancer error: all SubConns are in TransientFailure, latest connection error: connection error: desc = “transport: Error while dialing dial tcp 10.53.x.x:2379: connect: connection refused””}

若提问为性能优化、故障排查类问题，请下载脚本运行。终端输出的打印结果，请务必全选并复制粘贴上传。

来了老弟 · 2021 年5 月 6 日 02:27

display 看下当前集群的状态
当前集群是线上集群？还是测试集群？
尝试登录 pd 服务器，手动执行 systemctl stop/start pdxxx 并观察 pd.log 中的输出，看是否可正常启动。

RephaelLee · 2021 年5 月 6 日 02:33

测试集群

RephaelLee · 2021 年5 月 6 日 02:37

直接启动服务失败

RephaelLee · 2021 年5 月 6 日 02:38

手动scale-in 一台pd --force

RephaelLee · 2021 年5 月 6 日 02:49

最初集群的服务

RephaelLee · 2021 年5 月 6 日 06:38

有人吗

来了老弟 · 2021 年5 月 6 日 08:47

看下 pd.log 中什么报错哇

RephaelLee · 2021 年5 月 6 日 09:00

RephaelLee · 2021 年5 月 6 日 09:02

来了老弟 · 2021 年5 月 6 日 09:26

将 PD depoloy 目录清理干净，systemctl start 看下是否ok

RephaelLee · 2021 年5 月 6 日 09:39

依旧不行

懂的都懂 · 2021 年5 月 6 日 11:51

backup 一下 pd 的节点。然后 pd-recover 吧
https://docs.pingcap.com/zh/search?lang=zh&type=tidb&version=stable&q=pd-recover

懂的都懂 · 2021 年5 月 6 日 11:51

RephaelLee · 2021 年5 月 7 日 01:41

depoloy 被我删除掉了还有的旧吗

RephaelLee · 2021 年5 月 7 日 02:31

情况是这样的，有个主pd因为网络问题挂掉了，然后就force scale-in了它，导致另一个从pd无法启动。

spc_monkey · 2021 年5 月 7 日 02:39

你全部启动一下集群呢，其他的操作先别操作了，目前看原因是： tikv 记录的 clusterid 和 pd 记录的 clusterid 不一致了

spc_monkey · 2021 年5 月 7 日 02:40

另外，这个是测试环境还是生产环境？什么业务在跑