删除所有的pd数据目录，恢复pd集群，pd数据目录没有同步，pd集群无法启动

wluckdog · 2022 年5 月 4 日 08:36

【 TiDB 使用环境`】测试环境，测试pd节点的恢复
【 TiDB 版本】 v5.2.1 v6.0.0 在无外网的情况下，使用镜像安装tidb集群
【遇到的问题】pd节点在删除所有pd目录的数据后，不会自动重新同步新的数据

【复现路径】做过哪些操作出现的问题
v6.0.0 版本操作

方法1：通过pd日志查找
$ cat /data/tidb/pd-2379/log/pd.log | grep “init cluster id”

$ cat pd.log | grep “init cluster id”
[2021/04/06 17:02:13.223 +08:00] [INFO] [server.go:343] [“init cluster id”] [cluster-id=6947967450173627422]
[2021/04/06 17:10:11.066 +08:00] [INFO] [server.go:343] [“init cluster id”] [cluster-id=6947967450173627422]

alloc-id只能通过pd日志查找
$ cat pd.log | grep “idAllocator allocates a new id” | awk -F’=’ ‘{print $2}’ | awk -F’]’ ‘{print $1}’ | sort -r | head -n 1
9000

防止tiup部署后，在破坏掉pd实例后，pd-server被自动拉起来，影响试验效果，需要做如下修改

1、在/etc/systemd/system/pd-2379.service中去掉 Restart=always或者改Restart=no，
2、执行systemctl daemon-reload 重新加载

$ cat /etc/systemd/system/pd-2379.service
[Unit]
Description=pd service
After=syslog.target network.target remote-fs.target nss-lookup.target

[Service]
LimitNOFILE=1000000
LimitSTACK=10485760
User=tidb
ExecStart=/data/tidb/pd-2379/scripts/run_pd.sh
Restart=no

RestartSec=15s

[Install]
WantedBy=multi-user.target

下面开始模拟破坏pd节点
10.10.103.89:21379
10.10.103.89:22379 pd
10.10.103.89:2379

1、将3个目录节点删除
rm -rf pd-21379 pd-22379 pd-2379

2、查看集群状态

3、通过强制缩容剔除2个节点 10.10.103.89:21379 10.10.103.89:22379

tiup cluster scale-in test -N 10.10.103.89:21379 --force
tiup cluster scale-in test -N 10.10.103.89:22379 --force

$ tiup cluster scale-in test -N 10.10.103.89:22379 --force
tiup is checking updates for component cluster …
Starting component cluster: /home/tidb/.tiup/components/cluster/v1.9.3/tiup-cluster /home/tidb/.tiup/components/cluster/v1.9.3/tiup-cluster scale-in test -N 10.10.103.89:22379 --force

██ ██ █████ ██████ ███ ██ ██ ███ ██ ██████
██ ██ ██ ██ ██ ██ ████ ██ ██ ████ ██ ██
██ █ ██ ███████ ██████ ██ ██ ██ ██ ██ ██ ██ ██ ███
██ ███ ██ ██ ██ ██ ██ ██ ██ ██ ██ ██ ██ ██ ██ ██
███ ███ ██ ██ ██ ██ ██ ████ ██ ██ ████ ██████

Forcing scale in is unsafe and may result in data loss for stateful components.
DO NOT use --force if you have any component in Pending Offline status.
The process is irreversible and could NOT be cancelled.
Only use --force when some of the servers are already permanently offline.
Are you sure to continue?
(Type “Yes, I know my data might be lost.” to continue)
: Yes, I know my data might be lost.
This operation will delete the 10.10.103.89:22379 nodes in test and all their data.
Do you want to continue? [y/N]:(default=N) y

4、查看集群状态
tiup cluster display test
$ tiup cluster display test
tiup is checking updates for component cluster …
Starting component cluster: /home/tidb/.tiup/components/cluster/v1.9.3/tiup-cluster /home/tidb/.tiup/components/cluster/v1.9.3/tiup-cluster display test
Cluster type: tidb
Cluster name: test
Cluster version: v6.0.0
Deploy user: tidb
SSH type: builtin
ID Role Host Ports OS/Arch Status Data Dir Deploy Dir

10.10.103.89:9093 alertmanager 10.10.103.89 9093/9094 linux/x86_64 Up /data/tidb/alertmanager-9093/data /data/tidb/alertmanager-9093
10.10.103.89:3000 grafana 10.10.103.89 3000 linux/x86_64 Up - /data/tidb/grafana-3000
10.10.103.89:2379 pd 10.10.103.89 2379/2380 linux/x86_64 Down /data/tidb/pd-2379/data /data/tidb/pd-2379
10.10.103.89:9090 prometheus 10.10.103.89 9090/12020 linux/x86_64 Up /data/tidb/prometheus-9090/data /data/tidb/prometheus-9090
10.10.103.89:4000 tidb 10.10.103.89 4000/10080 linux/x86_64 Up - /data/tidb/tidb-4000
10.10.103.89:20160 tikv 10.10.103.89 20160/20180 linux/x86_64 N/A /data/tidb/tikv-20160/data /data/tidb/tikv-20160
10.10.103.89:20161 tikv 10.10.103.89 20161/20181 linux/x86_64 N/A /data/tidb/tikv-20161/data /data/tidb/tikv-20161
10.10.103.89:20162 tikv 10.10.103.89 20162/20182 linux/x86_64 N/A /data/tidb/tikv-20162/data /data/tidb/tikv-20162
Total nodes: 8

$ tiup cluster display test
tiup is checking updates for component cluster …
Starting component cluster: /home/tidb/.tiup/components/cluster/v1.9.3/tiup-cluster /home/tidb/.tiup/components/cluster/v1.9.3/tiup-cluster display test
Cluster type: tidb
Cluster name: test
Cluster version: v6.0.0
Deploy user: tidb
SSH type: builtin
ID Role Host Ports OS/Arch Status Data Dir Deploy Dir

10.10.103.89:9093 alertmanager 10.10.103.89 9093/9094 linux/x86_64 Up /data/tidb/alertmanager-9093/data /data/tidb/alertmanager-9093
10.10.103.89:3000 grafana 10.10.103.89 3000 linux/x86_64 Up - /data/tidb/grafana-3000
10.10.103.89:2379 pd 10.10.103.89 2379/2380 linux/x86_64 Down /data/tidb/pd-2379/data /data/tidb/pd-2379
10.10.103.89:9090 prometheus 10.10.103.89 9090/12020 linux/x86_64 Up /data/tidb/prometheus-9090/data /data/tidb/prometheus-9090
10.10.103.89:4000 tidb 10.10.103.89 4000/10080 linux/x86_64 Up - /data/tidb/tidb-4000
10.10.103.89:20160 tikv 10.10.103.89 20160/20180 linux/x86_64 N/A /data/tidb/tikv-20160/data /data/tidb/tikv-20160
10.10.103.89:20161 tikv 10.10.103.89 20161/20181 linux/x86_64 N/A /data/tidb/tikv-20161/data /data/tidb/tikv-20161
10.10.103.89:20162 tikv 10.10.103.89 20162/20182 linux/x86_64 N/A /data/tidb/tikv-20162/data /data/tidb/tikv-20162
Total nodes: 8

5、启动仅存的pd节点

tiup cluster start test -N 10.10.103.89:2379

这一步启动，是没有启动起来的，而且pd数据目录也没有同步

修改pd节点信息
$ tiup pd-recover -endpoints http://10.10.103.89:2379 -cluster-id 6947967450173627422 -alloc-id 16000
tiup is checking updates for component pd-recover …
A new version of pd-recover is available:
The latest version: v6.0.0
Local installed version:
Update current component: tiup update pd-recover
Update all components: tiup update --all

The component pd-recover version is not installed; downloading from repository.
Starting component pd-recover: /home/tidb/.tiup/components/pd-recover/v6.0.0/pd-recover /home/tidb/.tiup/components/pd-recover/v6.0.0/pd-recover -endpoints http://10.10.103.89:2379 -cluster-id 6947967450173627422 -alloc-id 12000
{“level”:“warn”,“ts”:“2022-05-03T11:29:09.522+0800”,“caller”:“clientv3/retry_interceptor.go:61”,“msg”:“retrying of unary invoker failed”,“target”:“endpoint://client-6ab7221d-1310-4a6c-b033-e627126276a8/10.10.103.89:2379”,“attempt”:0,“error”:“rpc error: code = DeadlineExceeded desc = latest balancer error: all SubConns are in TransientFailure, latest connection error: connection error: desc = "transport: Error while dialing dial tcp 10.10.103.89:2379: connect: connection refused"”}
context deadline exceeded

【问题现象及影响】

【附件】

请提供各个组件的 version 信息，如 cdc/tikv，可通过执行 cdc version/tikv-server --version 获取。

h5n1 · 2022 年5 月 4 日 08:50

5、启动仅存的pd节点
这一步时看下节点上是不是还有pd进程运行

wluckdog · 2022 年5 月 4 日 08:55

h5n1 · 2022 年5 月 5 日 01:19

参考下下面几个

wluckdog · 2022 年5 月 5 日 01:21

这些我都看过了，但是就是起不来pd集群

wluckdog · 2022 年5 月 6 日 06:49

需要自己手工把软件目录都还原了，就可以启动pd集群，然后在进行恢复操作

h5n1 · 2022 年5 月 6 日 06:50

deploy目录也删了？