tidb 5.0.1 集群修改IP

liusx · 2021 年5 月 20 日 09:49

为提高效率，请提供以下信息，问题描述清晰能够更快得到解决：

【TiDB 版本】 v 5.0.1

【问题描述】我的tidb搬机房了，全部IP要换掉，我参考https://asktug.com/t/topic/63294操作改IP，但它报PD连不通，能帮看看是什么原因吗？
报错如下：
[tidb@xx-tidb-001 bin]$ tiup pd-recover -endpoints http://10.x.x.44:2379 -cluster-id 6951514460023118605 -alloc-id 99000
Starting component pd-recover: /home/tidb/.tiup/components/pd-recover/v5.0.1/pd-recover -endpoints http://10.x.x.44:2379 -cluster-id 6951514460023118605 -alloc-id 99000
{“level”:“warn”,“ts”:“2021-05-20T17:43:26.768+0800”,“caller”:“clientv3/retry_interceptor.go:61”,“msg”:“retrying of unary invoker failed”,“target”:“endpoint://client-3b52d2de-eb49-42d2-bc92-009d38366b7d/10.x.x.44:2379”,“attempt”:0,“error”:“rpc error: code = DeadlineExceeded desc = latest balancer error: all SubConns are in TransientFailure, latest connection error: connection error: desc = “transport: Error while dialing dial tcp 10.x.x.44:2379: connect: connection refused””}
context deadline exceeded
Error: run /home/tidb/.tiup/components/pd-recover/v5.0.1/pd-recover (wd:/home/tidb/.tiup/data/SXwWgEZ) failed: exit status 1

若提问为性能优化、故障排查类问题，请下载脚本运行。终端输出的打印结果，请务必全选并复制粘贴上传。

liusx · 2021 年5 月 20 日 09:51

tiup cluster reload tidb-cluster-07 -R pd --force这一步也是报+ [ Serial ] - UpdateTopology: cluster=tidb-cluster-07
{“level”:“warn”,“ts”:“2021-05-20T17:40:00.604+0800”,“caller”:“clientv3/retry_interceptor.go:62”,“msg”:“retrying of unary invoker failed”,“target”:“endpoint://client-ea9543bf-8b3d-41e0-9981-57515e123ee2/10.x.x.44:2379”,“attempt”:0,“error”:“rpc error: code = DeadlineExceeded desc = latest balancer error: all SubConns are in TransientFailure, latest connection error: connection error: desc = “transport: Error while dialing dial tcp 10.x.x.44:2379: connect: connection refused””}

Error: context deadline exceeded

Verbose debug logs has been written to /home/tidb/.tiup/logs/tiup-cluster-debug-2021-05-20-17-40-00.log.
Error: run /home/tidb/.tiup/components/cluster/v1.4.1/tiup-cluster (wd:/home/tidb/.tiup/data/SXwVoZh) failed: exit status 1

spc_monkey · 2021 年5 月 20 日 10:09

可以先验证一下：集群各节点之间的网络端口是否开通

liusx · 2021 年5 月 20 日 10:11

前面把集群关了，这一步部署新的 PD 集群
[tidb@localhost ~]$ tiup cluster reload tidb-test -R pd --force 是能把集群开起来吗？正常都是关的状态吧？

spc_monkey · 2021 年5 月 20 日 10:15

对，pd-recover 是在进程启动的情况下执行的

liusx · 2021 年5 月 20 日 10:20

旧集群关机后，换了整个集群的IP起不来了，这~~~~

spc_monkey · 2021 年5 月 20 日 10:22

先说猜测：你的步骤是停集群，修改服务器IP，再拉起集群发现起不来，对吧
1、建议查看服务器有没有开通防火墙之类，最好直接进行测试验证
2、起不来，需要查看具体进场对应的日志，看看报错的原因

liusx · 2021 年5 月 20 日 10:26

pd是不是要新手工新增节点？

spc_monkey · 2021 年5 月 20 日 10:31

1、机房搬迁，可能采用扩容缩容的方式，可能直接修改 IP 的方式
2、而帖子中方式，是类似于重新搭建一套 pd 集群，不过由于 pd 中记录了集群的 clusterid，这个 clusterid 需要和 tikv 中记录的保持一致，集群才能正常启动，所以在创建完 pd 进程启动后，需要用 pd-recover 命令，指定新创建的 pd 记录的 clusterid 值，然后重启生效即可。

spc_monkey · 2021 年5 月 20 日 10:33

alloc-id 也是一个值，这个可以写大一点，是为了防止和以前的发生冲突

spc_monkey · 2021 年5 月 20 日 10:34

不过你的启动不起来，应该是网络问题，建议先排查网络问题

SCUT-Chan · 2021 年11 月 22 日 07:14

请问你的问题解决了吗？