TiDB集群中一个节点挂掉了，现要重新加入集群有问题

变又未变 · 2021 年2 月 18 日 01:29

那个，大神有个尴尬的问题：我春节7天没有动这个，现在打开一下，还是这个状态：Pending offline：

日志报错：

spc_monkey · 2021 年2 月 18 日 02:13

1、可以使用 -froce 强制下线，但下线后，需要确认 pd-ctl store 看看还有没有该节点，如果有使用需要使用 prune 命令，删除 tombstone 状态的节点
2、去下线的服务器上，查看原先的进程端口是否还存在，正常是不存在的，确认不存在后，再扩容

变又未变 · 2021 年2 月 18 日 02:13

好的，我试试，先谢谢了

spc_monkey · 2021 年2 月 18 日 02:17

ok

变又未变 · 2021 年2 月 18 日 02:53

额，我使用–force强制下线后，使用prune清不掉offline节点：
tiup cluster prune tidb-test

QBin · 2021 年2 月 18 日 06:52

确认一下目前 tiup cluster display 的状态
确认一下目前集群有没有 down 的 region 。可以通过 pd-ctl 来查看。
https://docs.pingcap.com/zh/tidb/stable/pd-control#region-check-miss-peer--extra-peer--down-peer--pending-peer--offline-peer--empty-region--hist-size--hist-keys
再使用 pd-ctl 给一下所有的 store 信息。看看是不是之前缩容的时候的 store 没下掉。

变又未变 · 2021 年2 月 18 日 07:07

1、tiup cluster display的状态;

2、执行 region check miss-peer指令后有返回值，应该是有down的region
如：
{
“count”: 23,
“regions”: [
{
“id”: 10001,
“start_key”: “7480000000000000FF2F00000000000000F8”,
“end_key”: “7480000000000000FF3200000000000000F8”,
“epoch”: {
“conf_ver”: 5,
“version”: 23
},
“peers”: [
{
“id”: 10002,
“store_id”: 1
},
{
“id”: 10003,
“store_id”: 4
},
{
“id”: 10004,
“store_id”: 5
}
],
“leader”: {
“id”: 10002,
“store_id”: 1
},
“down_peers”: [
{
“peer”: {
“id”: 10003,
“store_id”: 4
},
“down_seconds”: 14040
}
],
“pending_peers”: [
{
“id”: 10003,
“store_id”: 4
}
],
“written_bytes”: 0,
“read_bytes”: 0,
“written_keys”: 0,
“read_keys”: 0,
“approximate_size”: 1,
“approximate_keys”: 0
},
3、所有store的信息：
{
“count”: 3,
“stores”: [
{
“store”: {
“id”: 1,
“address”: “172.20.146.144:20160”,
“state”: 1,
“version”: “4.0.10”,
“status_address”: “172.20.146.144:20180”,
“git_hash”: “2ea4e608509150f8110b16d6e8af39284ca6c93a”,
“start_timestamp”: 1613610928,
“deploy_path”: “/tidata/tidb-deploy/tikv-20160/bin”,
“last_heartbeat”: 1613613814164172934,
“state_name”: “Offline”
},
“status”: {
“capacity”: “10.3GiB”,
“available”: “7.445GiB”,
“used_size”: “31.85MiB”,
“leader_count”: 11,
“leader_weight”: 1,
“leader_score”: 11,
“leader_size”: 11,
“region_count”: 23,
“region_weight”: 1,
“region_score”: 23,
“region_size”: 23,
“start_ts”: “2021-02-18T09:15:28+08:00”,
“last_heartbeat_ts”: “2021-02-18T10:03:34.164172934+08:00”,
“uptime”: “48m6.164172934s”
}
},
{
“store”: {
“id”: 4,
“address”: “172.20.146.145:20160”,
“state”: 1,
“version”: “4.0.10”,
“status_address”: “172.20.146.145:20180”,
“git_hash”: “2ea4e608509150f8110b16d6e8af39284ca6c93a”,
“start_timestamp”: 1612853601,
“deploy_path”: “/tidata/tidb-deploy/tikv-20160/bin”,
“last_heartbeat”: 1612856011555390181,
“state_name”: “Offline”
},
“status”: {
“capacity”: “0B”,
“available”: “0B”,
“used_size”: “0B”,
“leader_count”: 0,
“leader_weight”: 1,
“leader_score”: 0,
“leader_size”: 0,
“region_count”: 23,
“region_weight”: 1,
“region_score”: 23,
“region_size”: 23,
“start_ts”: “2021-02-09T14:53:21+08:00”,
“last_heartbeat_ts”: “2021-02-09T15:33:31.555390181+08:00”,
“uptime”: “40m10.555390181s”
}
},
{
“store”: {
“id”: 5,
“address”: “172.20.146.143:20160”,
“version”: “4.0.10”,
“status_address”: “172.20.146.143:20180”,
“git_hash”: “2ea4e608509150f8110b16d6e8af39284ca6c93a”,
“start_timestamp”: 1613610925,
“deploy_path”: “/tidata/tidb-deploy/tikv-20160/bin”,
“last_heartbeat”: 1613632038142316460,
“state_name”: “Up”
},
“status”: {
“capacity”: “10.3GiB”,
“available”: “6.616GiB”,
“used_size”: “32.19MiB”,
“leader_count”: 12,
“leader_weight”: 1,
“leader_score”: 12,
“leader_size”: 12,
“region_count”: 23,
“region_weight”: 1,
“region_score”: 23,
“region_size”: 23,
“start_ts”: “2021-02-18T09:15:25+08:00”,
“last_heartbeat_ts”: “2021-02-18T15:07:18.14231646+08:00”,
“uptime”: “5h51m53.14231646s”
}
}
]
}
我是要把所有为down的region都删掉嘛？额，这个没弄过。。。。。

GangShen · 2021 年2 月 18 日 08:15

与手动删除tidb-data下面的pd和tikv目录，如何恢复是同一个集群，参考帖子中方式进行恢复。