TiKV 节点无法下线，leader_count 和 region_count 已经为 0

honor100 · 2021 年9 月 3 日 06:23

【 TiDB 使用环境】
【概述】：tikv节点无法下线
【背景】：tidb 版本v4.0.6时下线1个tikv节点，leader_count和region_count已经为0但是一直是下线中Offline状态；期间在其他节点上做过"tikv-ctl --db /data1/tidb-deploy/data/tikv-20160/db/ unsafe-recover remove-fail-stores -s 89241 --all-regions" 但是依旧无法下线；升级到v5.1.1依旧无法解决；再次升级到v5.2.0，在线/离线修复，问题依旧。
【现象】：业务正常，但是访问information_schema较慢
【问题】：tikv一个节点始终无法下线
【业务影响】：无影响
【TiDB 版本】：v5.2.0
【TiDB Operator 版本】：
【K8s 版本】：
【附件】：

相关日志、配置文件、Grafana 监控（https://metricstool.pingcap.com/)

附件1：
dashboard里看到的tikv错误日志：
2021-09-03 11:05:52 ERROR TiKV 172.21.20.87:20160
[raft_client.rs:407] [“connection aborted”] [addr=172.21.20.98:20160] [receiver_err=“Some(RpcFailure(RpcStatus { code: 14-UNAVAILABLE, message: “failed to connect to all addresses”, details: [] }))”] [sink_error=“Some(RpcFinished(Some(RpcStatus { code: 14-UNAVAILABLE, message: “failed to connect to all addresses”, details: [] })))”] [store_id=89241]

2021-09-03 11:05:52 ERROR TiKV 172.21.20.87:20160 [raft_client.rs:707] [“connection abort”] [addr=172.21.20.98:20160] [store_id=89241]

升级到v5.2.0版本后，remove-fail-stores 报错：
./tikv-ctl --data-dir /data1/tidb-deploy/data/tikv-20160/db/ --config /data1/tidb-deploy/tikv-20160/conf/tikv.toml unsafe-recover remove-fail-stores -s 89241 --all-regions
[2021/09/03 03:33:06.791 +00:00] [INFO] [mod.rs:118] [“encryption: none of key dictionary and file dictionary are found.”]
[2021/09/03 03:33:06.792 +00:00] [INFO] [mod.rs:479] [“encryption is disabled.”]
[2021/09/03 03:33:06.796 +00:00] [WARN] [config.rs:587] [“compaction guard is disabled due to region info provider not available”]
[2021/09/03 03:33:06.796 +00:00] [WARN] thread ‘[main’ panicked at ‘config.rscalled Result::unwrap() on an Err value: Os { code: 2, kind: NotFound, message: “No such file or directory” }:’, 682cmd/tikv-ctl/src/main.rs]: [121":compaction guard is disabled due to region info provider not available57"
note: run with RUST_BACKTRACE=1 environment variable to display a backtrace
]

store信息：
Starting component ctl: /home/tidb/.tiup/components/ctl/v5.2.0/ctl pd -u 172.21.11.59:2379 store
{
“count”: 5,
“stores”: [
{
“store”: {
“id”: 89241,
“address”: “172.21.20.98:20160”,
“state”: 1,
“version”: “4.0.6”,
“status_address”: “172.21.20.98:20180”,
“git_hash”: “ca2475bfbcb49a7c34cf783596acb3edd05fc88f”,
“start_timestamp”: 1611283801,
“deploy_path”: “/data1/tidb-deploy/tikv-20160/bin”,
“last_heartbeat”: 1611382452859916876,
“state_name”: “Offline”
},
“status”: {
“capacity”: “0B”,
“available”: “0B”,
“used_size”: “0B”,
“leader_count”: 0,
“leader_weight”: 0,
“leader_score”: 0,
“leader_size”: 0,
“region_count”: 0,
“region_weight”: 0,
“region_score”: 0,
“region_size”: 0,
“slow_score”: 0,
“start_ts”: “2021-01-22T02:50:01Z”,
“last_heartbeat_ts”: “2021-01-23T06:14:12.859916876Z”,
“uptime”: “27h24m11.859916876s”
}
},
{
“store”: {
“id”: 103455,
“address”: “172.21.20.87:20160”,
“version”: “5.2.0”,
“status_address”: “172.21.20.87:20180”,
“git_hash”: “556783c314a9bfca36c818256182eeef364120d7”,
“start_timestamp”: 1630478941,
“deploy_path”: “/data1/tidb-deploy/tikv-20160/bin”,
“last_heartbeat”: 1630649848209204836,
“state_name”: “Up”
},
“status”: {
“capacity”: “3.401TiB”,
“available”: “1.107TiB”,
“used_size”: “1.889TiB”,
“leader_count”: 43497,
“leader_weight”: 1,
“leader_score”: 43497,
“leader_size”: 3628871,
“region_count”: 129897,
“region_weight”: 1,
“region_score”: 405987906.9967766,
“region_size”: 10827076,
“slow_score”: 1,
“start_ts”: “2021-09-01T06:49:01Z”,
“last_heartbeat_ts”: “2021-09-03T06:17:28.209204836Z”,
“uptime”: “47h28m27.209204836s”
}
},
{
“store”: {
“id”: 135592,
“address”: “172.21.10.201:20160”,
“version”: “5.2.0”,
“status_address”: “172.21.10.201:20180”,
“git_hash”: “556783c314a9bfca36c818256182eeef364120d7”,
“start_timestamp”: 1630490575,
“deploy_path”: “/data1/tidb-deploy/tikv-20160/bin”,
“last_heartbeat”: 1630649852080571655,
“state_name”: “Up”
},
“status”: {
“capacity”: “3.401TiB”,
“available”: “1.106TiB”,
“used_size”: “1.897TiB”,
“leader_count”: 43494,
“leader_weight”: 1,
“leader_score”: 43494,
“leader_size”: 3551219,
“region_count”: 135567,
“region_weight”: 1,
“region_score”: 407231097.17386436,
“region_size”: 11097255,
“slow_score”: 1,
“start_ts”: “2021-09-01T10:02:55Z”,
“last_heartbeat_ts”: “2021-09-03T06:17:32.080571655Z”,
“uptime”: “44h14m37.080571655s”
}
},
{
“store”: {
“id”: 151301,
“address”: “172.21.30.237:20160”,
“version”: “5.2.0”,
“status_address”: “172.21.30.237:20180”,
“git_hash”: “556783c314a9bfca36c818256182eeef364120d7”,
“start_timestamp”: 1630479587,
“deploy_path”: “/data1/tidb-deploy/tikv-20160/bin”,
“last_heartbeat”: 1630649848812707098,
“state_name”: “Up”
},
“status”: {
“capacity”: “3.401TiB”,
“available”: “1.109TiB”,
“used_size”: “1.898TiB”,
“leader_count”: 43497,
“leader_weight”: 1,
“leader_score”: 43497,
“leader_size”: 3622096,
“region_count”: 129571,
“region_weight”: 1,
“region_score”: 402675171.5219846,
“region_size”: 10767655,
“slow_score”: 1,
“start_ts”: “2021-09-01T06:59:47Z”,
“last_heartbeat_ts”: “2021-09-03T06:17:28.812707098Z”,
“uptime”: “47h17m41.812707098s”
}
},
{
“store”: {
“id”: 918785,
“address”: “172.21.11.22:20160”,
“version”: “5.2.0”,
“status_address”: “172.21.11.22:20180”,
“git_hash”: “556783c314a9bfca36c818256182eeef364120d7”,
“start_timestamp”: 1630481421,
“deploy_path”: “/data1/tidb-deploy/tikv-20160/bin”,
“last_heartbeat”: 1630649857317595538,
“state_name”: “Up”
},
“status”: {
“capacity”: “3.401TiB”,
“available”: “1.095TiB”,
“used_size”: “1.945TiB”,
“leader_count”: 43500,
“leader_weight”: 1,
“leader_score”: 43500,
“leader_size”: 3827843,
“region_count”: 126935,
“region_weight”: 1,
“region_score”: 424140721.9742522,
“region_size”: 11198308,
“slow_score”: 1,
“start_ts”: “2021-09-01T07:30:21Z”,
“last_heartbeat_ts”: “2021-09-03T06:17:37.317595538Z”,
“uptime”: “46h47m16.317595538s”
}
}
]
}

这道题我不会 · 2021 年9 月 6 日 09:28

1.不太理解你这边 5 个 tikv 只挂掉一个为何要执行 unsafe-recover ？正常情况下是不会出现多副本丢失，不需要执行这个操作；
2.麻烦反馈下 tiup cluster display {cluster-name} 的结果。

honor100 · 2021 年9 月 14 日 06:04

这道题我不会 · 2021 年9 月 14 日 08:30

可以尝试强制再下线 offline 节点，命令：tiup cluster scale-in {cluster-name} -N 172.21.20.98:20160 --force

honor100 · 2021 年10 月 8 日 03:06

没用的，这招早使用过了。

tiup cluster scale-in bigdata-tidb -N 172.21.20.98:20160 --force
Starting component cluster: /home/tidb/.tiup/components/cluster/v1.5.6/tiup-cluster scale-in bigdata-tidb -N 172.21.20.98:20160 --force
Forcing scale in is unsafe and may result in data loss for stateful components.
The process is irreversible and could NOT be cancelled.
Only use --force when some of the servers are already permanently offline.
Are you sure to continue? [y/N]:(default=N) y
This operation will delete the 172.21.20.98:20160 nodes in bigdata-tidb and all their data.
Do you want to continue? [y/N]:(default=N) y
Scale-in nodes…

[ Serial ] - SSHKeySet: privateKey=/home/tidb/.tiup/storage/cluster/clusters/bigdata-tidb/ssh/id_rsa, publicKey=/home/tidb/.tiup/storage/cluster/clusters/bigdata-tidb/ssh/id_rsa.pub
[Parallel] - UserSSH: user=tidb, host=172.21.11.70
[Parallel] - UserSSH: user=tidb, host=172.21.31.27
[Parallel] - UserSSH: user=tidb, host=172.21.21.82
[Parallel] - UserSSH: user=tidb, host=172.21.11.59
[Parallel] - UserSSH: user=tidb, host=172.21.20.87
[Parallel] - UserSSH: user=tidb, host=172.21.10.201
[Parallel] - UserSSH: user=tidb, host=172.21.30.237
[Parallel] - UserSSH: user=tidb, host=172.21.11.22
[Parallel] - UserSSH: user=tidb, host=172.21.11.85
[Parallel] - UserSSH: user=tidb, host=172.21.31.112
[Parallel] - UserSSH: user=tidb, host=172.21.11.70
[Parallel] - UserSSH: user=tidb, host=172.21.11.70
[ Serial ] - ClusterOperate: operation=ScaleInOperation, options={Roles: Nodes:[172.21.20.98:20160] Force:true SSHTimeout:5 OptTimeout:120 APITimeout:300 IgnoreConfigCheck:false NativeSSH:false SSHType: CleanupData:false CleanupLog:false RetainDataRoles: RetainDataNodes: ShowUptime:false JSON:false Operation:StartOperation}

Error: failed to scale in: cannot find node id ‘172.21.20.98:20160’ in topology

Verbose debug logs has been written to /home/tidb/.tiup/logs/tiup-cluster-debug-2021-10-08-03-05-17.log.
Error: run /home/tidb/.tiup/components/cluster/v1.5.6/tiup-cluster (wd:/home/tidb/.tiup/data/SlDMZ2z) failed: exit status 1

这道题我不会 · 2021 年10 月 8 日 03:26

现在通过 pd-ctl 中还是能看到 172.21.20.98:20160 节点处于 offline 状态吗？然后 tiup cluster display 又看不到这个节点信息？

honor100 · 2021 年10 月 9 日 03:23

对对对，就是这么个情况：tiup ctl:v5.2.0 pd -u 172.21.11.59:2379 store 能看到, 并且处于offline状态，
{
“store”: {
“id”: 89241,
“address”: “172.21.20.98:20160”,
“state”: 1,
“version”: “4.0.6”,
“status_address”: “172.21.20.98:20180”,
“git_hash”: “ca2475bfbcb49a7c34cf783596acb3edd05fc88f”,
“start_timestamp”: 1611283801,
“deploy_path”: “/data1/tidb-deploy/tikv-20160/bin”,
“last_heartbeat”: 1611382452859916876,
“state_name”: “Offline”
},
“status”: {
“capacity”: “0B”,
“available”: “0B”,
“used_size”: “0B”,
“leader_count”: 0,
“leader_weight”: 0,
“leader_score”: 0,
“leader_size”: 0,
“region_count”: 0,
“region_weight”: 0,
“region_score”: 0,
“region_size”: 0,
“slow_score”: 0,
“start_ts”: “2021-01-22T02:50:01Z”,
“last_heartbeat_ts”: “2021-01-23T06:14:12.859916876Z”,
“uptime”: “27h24m11.859916876s”
}
}

但是 tiup cluster display 看不到

这道题我不会 · 2021 年10 月 9 日 06:51

从上面信息看，这个 store 的 region 和 leader 已经全部为 0 且版本信息为 v4.0.6 ，想确认下其他 tikv 节点还会去和该节点通信吗？可以从 tikv 日志中检索下有无该节点 IP 信息，如果没有的话，尝试下通过 pd-ctl 执行下
store delete 89241，看下有无效果。

honor100 · 2021 年10 月 11 日 02:08

store delete 89241, 这个动作也执行过很多次了没用啊？

通信还是有的，如果没有通讯没错误日志才懒得管呢

这道题我不会 · 2021 年10 月 11 日 05:23

使用下面这个接口，看下能否强制该节点调整为 tombstone 状态：
curl -X POST 'http://{pd_ip}:{pd_port}/pd/api/v1/store/{store_id}/state?state=Tombstone'
把实际的 pd leader IP 和 port 替换下，store_id 值即为上面的 89241

honor100 · 2021 年10 月 11 日 06:11

curl -X POST ‘http://172.21.21.82:2379/pd/api/v1/store/89241/state?state=Offline’
“The store’s state is updated.”
curl -X POST ‘http://172.21.21.82:2379/pd/api/v1/store/89241/state?state=Tombstone’
“invalid state Tombstone”

您好，Tombstone这个状态值好像不允许设置哈

这道题我不会 · 2021 年10 月 11 日 06:32

额，看来这个 API 接口在该版本中已经被禁用了。怀疑和你之前无法执行 unsafe recover 有关系，导致这个节点 Offline 卡主了，如果方便的话可以尝试重启下集群。

honor100 · 2021 年10 月 11 日 06:43

Offline 卡住大年了，还是在4.0.6版本的时候；

重启过好多次了，前几天才从5.2.0升级到5.2.1，升级过程中重启过了；

现在还有办法吗？

这道题我不会 · 2021 年10 月 11 日 06:45

也是，刚才忽略掉这点了，这个节点的 store 上次心跳信息都已经是 2021-01-23T06:14:12.859916876Z，我们这边再确认下有没有其他方式来下掉这个节点。

honor100 · 2021 年10 月 11 日 06:46

好的，好的；麻烦你们了

这道题我不会 · 2021 年10 月 11 日 07:30

麻烦再确认下其他 4 个 tikv 节点和该 offline 节点都有通信失败的日志，还是仅部分节点有，具体的报错日志也辛苦提供下

Meditator · 2021 年10 月 11 日 12:06

疑问：知道在什么情况下需要执行
tikv-ctl --db /data1/tidb-deploy/data/tikv-20160/db/ unsafe-recover remove-fail-stores -s 89241 --all-regions
1、什么情况需要这个命令吗？
2、执行这个命令时需要什么条件吗？
3、执行后需要做其他的操作吗？

回答：
1、在集群有种有两个或者两个以上的tikv节点同时挂掉的情况，某些region失去两个peer，导致集群启动报错，无法拉起时执行的；
2、需要在其他正常的tikv节点执行这个命令，执行时，这个正常的tikv实例是关闭的，同时应该提前关闭自动调度；
3、执行完后，需要重启PD集群，重启目的是清理之前存在的元数据，包括dead tikv node的，然后把再重启正常的tikv节点，重新上报心跳信息到PD集群

Meditator · 2021 年10 月 11 日 12:07

这个命令不要随便去执行，除非你知道这个命令到底做了什么操作

honor100 · 2021 年10 月 11 日 12:14

您好，还真只有1台tikv上有通信失败的日志：

这道题我不会 · 2021 年10 月 11 日 14:15

麻烦再提供下pd-ctl region store 89241 的结果，谢谢