pd restore

今天测试了一下pdserver的数据删除与恢复有一个关于clusterid和allocid 的问题
过程如下
1 集群良好的情况下删除pd的leader节点134的数据文件(pd-server 共三节点)
2 删除其余两节点的pd-server数据文件(删除前剩余的两个节点pd-servert 是up状态)
3 pd-server 全部宕机 查看clusterid和allocid
[root@132 pd-2379]# cat /tidb-deploy/pd-2379/log/pd.log |grep “init cluster id”
[2021/12/07 01:37:56.347 +08:00] [INFO] [server.go:357] [“init cluster id”] [cluster-id=7038591877197459593]
[2021/12/08 18:14:21.245 +08:00] [INFO] [server.go:357] [“init cluster id”] [cluster-id=7039272998777474496]
[root@133 pd-2379]# cat /tidb-deploy/pd-2379/log/pd.log |grep “init cluster id”
[2021/12/07 04:21:31.268 -05:00] [INFO] [server.go:357] [“init cluster id”] [cluster-id=7038591877197459593]
[2021/12/08 05:14:23.112 -05:00] [INFO] [server.go:357] [“init cluster id”] [cluster-id=7039272998777474496]
[root@134 pd-2379]# cat /tidb-deploy/pd-2379/log/pd.log |grep “init cluster id”
[2021/12/06 12:37:57.840 -05:00] [INFO] [server.go:357] [“init cluster id”] [cluster-id=7038591877197459593]
###此处tikv的log只能看到一个id
[root@localhost pd-2379]# cat /tidb-deploy/tikv-20160/log/tikv.log |grep “connect to PD cluster”
[2021/12/08 18:14:38.031 +08:00] [INFO] [server.rs:347] [“connect to PD cluster”] [cluster_id=7039272998777474496]

[root@132 pd-2379]# cat /tidb-deploy/pd-2379/log/pd.log |grep “idAllocator allocates a new id” | awk -F’=’ ‘{print $2}’ | awk -F’]’ ‘{print $1}’ | sort -r | head -n 1
6000
[root@133 pd-2379]# cat /tidb-deploy/pd-2379/log/pd.log |grep “idAllocator allocates a new id” | awk -F’=’ ‘{print $2}’ | awk -F’]’ ‘{print $1}’ | sort -r | head -n 1
4000
[root@134 pd-2379]# cat /tidb-deploy/pd-2379/log/pd.log |grep “idAllocator allocates a new id” | awk -F’=’ ‘{print $2}’ | awk -F’]’ ‘{print $1}’ | sort -r
5000

4 恢复132节点 并设置 cluster-id:7039272998777474496 alloc-id:6000
5 132启动正常的情况下扩容的方式加回133,134
6 重启cluster,tikv报错
cluster ID mismatch, local 7038591877197459593 != remote 7039272998777474496
7 修改pd的cluster的id为7038591877197459593 后恢复正常

这里想问一下
第三步:
cluster-id在log中查到两个值是因为重新选主了么?
为什么tikv的log只有一个clusterid呢?
这两个cluster id 应该选哪一个作为恢复时候使用的id呢?
alloc id这个值每个节点都不一样,那么再恢复的时候应该选取哪个呢?

这一步是什么时间做的?是[2021/12/08 05:14:23.112 -05:00]这个时间前吗

是在删除133,132上边的数据文件之后,集群的所有pd都是down了之后

[cluster-id=7039272998777474496] 这个应该是你在启动134这个单节点的pd后生成的,你在启动这个节点前看看cluster-id是不是一致的,cluser-id、alloc-id要从上一任的leader日志里看或在tikv里看,在操作前先找出来

第一任的leader是134 也是最开始删除数据文件的
第二任的leader是132 tikv中cluster_id=7039272998777474496 alloc-id 6000
但是7039272998777474496 这个id我在杀死134的pd之后没有启动过134的pd-server
但是我注意到一个问题在我杀死134也就是第一任的leader之后集群的三个pd都down了,之后又自己up了这块是不是发生了脑裂或者什么呢导致生成了新的id

一下是132节点的信息,不知道有没有用
[root@localhost pd-2379]# cat /tidb-deploy/pd-2379/log/pd.log |grep “init cluster id”
[2021/12/06 22:11:17.340 +08:00] [INFO] [server.go:357] [“init cluster id”] [cluster-id=7038591877197459593]
[2021/12/07 01:37:56.347 +08:00] [INFO] [server.go:357] [“init cluster id”] [cluster-id=7038591877197459593]
[2021/12/08 18:14:21.245 +08:00] [INFO] [server.go:357] [“init cluster id”] [cluster-id=7039272998777474496]
[root@localhost pd-2379]# cat /tidb-deploy/tidb-4000/log/tidb.log |grep “init cluster id”
[2021/12/06 22:11:25.676 +08:00] [INFO] [base_client.go:104] ["[pd] init cluster id"] [cluster-id=7038591877197459593]
[2021/12/07 01:38:02.696 +08:00] [INFO] [base_client.go:104] ["[pd] init cluster id"] [cluster-id=7038591877197459593]
[2021/12/07 18:46:09.153 +08:00] [INFO] [base_client.go:104] ["[pd] init cluster id"] [cluster-id=7038591877197459593]
[2021/12/08 18:14:37.779 +08:00] [INFO] [base_client.go:104] ["[pd] init cluster id"] [cluster-id=7039272998777474496]
[root@localhost pd-2379]# cat /tidb-deploy/tikv-20160/log/tikv.log |grep “init cluster id”
[root@localhost pd-2379]# cat /tidb-deploy/tikv-20160/log/tikv.log |grep “connect to PD cluster”
[2021/12/08 18:14:38.031 +08:00] [INFO] [server.rs:347] [“connect to PD cluster”] [cluster_id=7039272998777474496]
[2021/12/08 18:14:55.934 +08:00] [INFO] [server.rs:347] [“connect to PD cluster”] [cluster_id=7039272998777474496]
[2021/12/08 18:15:11.477 +08:00] [INFO] [server.rs:347] [“connect to PD cluster”] [cluster_id=7039272998777474496]
[2021/12/08 18:15:26.948 +08:00] [INFO] [server.rs:347] [“connect to PD cluster”] [cluster_id=7039272998777474496]
[2021/12/08 18:15:42.207 +08:00] [INFO] [server.rs:347] [“connect to PD cluster”] [cluster_id=7039272998777474496]
[2021/12/08 18:15:57.950 +08:00] [INFO] [server.rs:347] [“connect to PD cluster”] [cluster_id=7039272998777474496]
[2021/12/08 18:16:13.186 +08:00] [INFO] [server.rs:347] [“connect to PD cluster”] [cluster_id=7039272998777474496]
[2021/12/08 18:16:28.782 +08:00] [INFO] [server.rs:347] [“connect to PD cluster”] [cluster_id=7039272998777474496]
[2021/12/08 18:16:44.188 +08:00] [INFO] [server.rs:347] [“connect to PD cluster”] [cluster_id=7039272998777474496]
[2021/12/08 18:16:59.950 +08:00] [INFO] [server.rs:347] [“connect to PD cluster”] [cluster_id=7039272998777474496]
[2021/12/08 18:17:15.444 +08:00] [INFO] [server.rs:347] [“connect to PD cluster”] [cluster_id=7039272998777474496]
[2021/12/08 18:17:30.949 +08:00] [INFO] [server.rs:347] [“connect to PD cluster”] [cluster_id=7039272998777474496]
[2021/12/08 18:17:46.475 +08:00] [INFO] [server.rs:347] [“connect to PD cluster”] [cluster_id=7039272998777474496]
[2021/12/08 18:18:01.948 +08:00] [INFO] [server.rs:347] [“connect to PD cluster”] [cluster_id=7039272998777474496]
[2021/12/08 18:18:17.469 +08:00] [INFO] [server.rs:347] [“connect to PD cluster”] [cluster_id=7039272998777474496]
[2021/12/08 18:18:32.948 +08:00] [INFO] [server.rs:347] [“connect to PD cluster”] [cluster_id=7039272998777474496]
[2021/12/08 18:18:48.478 +08:00] [INFO] [server.rs:347] [“connect to PD cluster”] [cluster_id=7039272998777474496]
[2021/12/08 18:19:03.936 +08:00] [INFO] [server.rs:347] [“connect to PD cluster”] [cluster_id=7039272998777474496]
[root@localhost pd-2379]# cat /tidb-deploy/pd-2379/log/pd.log |grep “idAllocator allocates a new id” | awk -F’=’ ‘{print $2}’ | awk -F’]’ ‘{print $1}’ | sort -r | head -n 1
6000


kill 134后另外2个pd down掉自动起来后集群tikv都是正常吧?按说cluster-id不应该随着pd 全停了而变化。
在删除pd的文件后重新起pd始会生成一个新进群分配cluster-id

1 个赞

另外看你的系统时间和现在实际时间差异很大,pd间的时间是否同步

下边是kill 134后pd 的状态变化tikv好像是也有中断迹象,看了一下pd的时间确实有一个有点问题。
按下边的状态变化来看是不是新的id是由于pd重启的原因造成的
[tidb@tidb1 ~]$ tiup cluster display tidb-jiantest
Starting component cluster: /home/tidb/.tiup/components/cluster/v1.7.0/tiup-cluster display tidb-jiantest
Cluster type: tidb
Cluster name: tidb-jiantest
Cluster version: v5.3.0
Deploy user: tidb
SSH type: builtin
ID Role Host Ports OS/Arch Status Data Dir Deploy Dir


192.168.135.135:9093 alertmanager 192.168.135.135 9093/9094 linux/x86_64 Up /tidb-data/alertmanager-9093 /tidb-deploy/alertmanager-9093
192.168.135.135:3000 grafana 192.168.135.135 3000 linux/x86_64 Up - /tidb-deploy/grafana-3000
192.168.135.132:2379 pd 192.168.135.132 2379/2380 linux/x86_64 Down /tidb-data/pd-2379 /tidb-deploy/pd-2379
192.168.135.133:2379 pd 192.168.135.133 2379/2380 linux/x86_64 Down /tidb-data/pd-2379 /tidb-deploy/pd-2379
192.168.135.134:2379 pd 192.168.135.134 2379/2380 linux/x86_64 Down /tidb-data/pd-2379 /tidb-deploy/pd-2379
192.168.135.135:9090 prometheus 192.168.135.135 9090 linux/x86_64 Up /tidb-data/prometheus-9090 /tidb-deploy/prometheus-9090
192.168.135.132:4000 tidb 192.168.135.132 4000/10080 linux/x86_64 Up - /tidb-deploy/tidb-4000
192.168.135.133:4000 tidb 192.168.135.133 4000/10080 linux/x86_64 Up - /tidb-deploy/tidb-4000
192.168.135.134:4000 tidb 192.168.135.134 4000/10080 linux/x86_64 Up - /tidb-deploy/tidb-4000
192.168.135.135:9000 tiflash 192.168.135.135 9000/8123/3930/20170/20292/8234 linux/x86_64 N/A /tidb-data/tiflash-9000 /tidb-deploy/tiflash-9000
192.168.135.132:20160 tikv 192.168.135.132 20160/20180 linux/x86_64 N/A /tidb-data/tikv-20160 /tidb-deploy/tikv-20160
192.168.135.133:20160 tikv 192.168.135.133 20160/20180 linux/x86_64 N/A /tidb-data/tikv-20160 /tidb-deploy/tikv-20160
192.168.135.134:20160 tikv 192.168.135.134 20160/20180 linux/x86_64 N/A /tidb-data/tikv-20160 /tidb-deploy/tikv-20160
Total nodes: 13
[tidb@tidb1 ~]$ tiup cluster display tidb-jiantest
Starting component cluster: /home/tidb/.tiup/components/cluster/v1.7.0/tiup-cluster display tidb-jiantest
Cluster type: tidb
Cluster name: tidb-jiantest
Cluster version: v5.3.0
Deploy user: tidb
SSH type: builtin
Dashboard URL: http://192.168.135.134:2379/dashboard
ID Role Host Ports OS/Arch Status Data Dir Deploy Dir


192.168.135.135:9093 alertmanager 192.168.135.135 9093/9094 linux/x86_64 Up /tidb-data/alertmanager-9093 /tidb-deploy/alertmanager-9093
192.168.135.135:3000 grafana 192.168.135.135 3000 linux/x86_64 Up - /tidb-deploy/grafana-3000
192.168.135.132:2379 pd 192.168.135.132 2379/2380 linux/x86_64 Up|L /tidb-data/pd-2379 /tidb-deploy/pd-2379
192.168.135.133:2379 pd 192.168.135.133 2379/2380 linux/x86_64 Up /tidb-data/pd-2379 /tidb-deploy/pd-2379
192.168.135.134:2379 pd 192.168.135.134 2379/2380 linux/x86_64 Down|UI /tidb-data/pd-2379 /tidb-deploy/pd-2379
192.168.135.135:9090 prometheus 192.168.135.135 9090 linux/x86_64 Up /tidb-data/prometheus-9090 /tidb-deploy/prometheus-9090
192.168.135.132:4000 tidb 192.168.135.132 4000/10080 linux/x86_64 Up - /tidb-deploy/tidb-4000
192.168.135.133:4000 tidb 192.168.135.133 4000/10080 linux/x86_64 Up - /tidb-deploy/tidb-4000
192.168.135.134:4000 tidb 192.168.135.134 4000/10080 linux/x86_64 Up - /tidb-deploy/tidb-4000
192.168.135.135:9000 tiflash 192.168.135.135 9000/8123/3930/20170/20292/8234 linux/x86_64 Up /tidb-data/tiflash-9000 /tidb-deploy/tiflash-9000
192.168.135.132:20160 tikv 192.168.135.132 20160/20180 linux/x86_64 Up /tidb-data/tikv-20160 /tidb-deploy/tikv-20160
192.168.135.133:20160 tikv 192.168.135.133 20160/20180 linux/x86_64 Up /tidb-data/tikv-20160 /tidb-deploy/tikv-20160
192.168.135.134:20160 tikv 192.168.135.134 20160/20180 linux/x86_64 Up /tidb-data/tikv-20160 /tidb-deploy/tikv-20160
Total nodes: 13

看下tikv log里的启动时间记录,搜关键字Welcom,看看有没有在2个pd up后的记录

实在不好意思 这个log我有点对不上了 我再重新做一下试试 时间我都同步了

多谢您的帮助
我重做了一遍,弄明白了为什么会生成新的id
因为我之前删除剩余两个节点的数据文件时pd的service时restart=always
所以删除之后他俩就自己重启了,又没有数据文件所以生成了新的id

此话题已在最后回复的 1 分钟后被自动关闭。不再允许新回复。