TiKV 节点无法下线,leader_count 和 region_count 已经为 0

【 TiDB 使用环境】
【概述】:tikv节点无法下线
【背景】:tidb 版本v4.0.6时下线1个tikv节点,leader_count和region_count已经为0但是一直是下线中Offline状态;期间在其他节点上做过"tikv-ctl --db /data1/tidb-deploy/data/tikv-20160/db/ unsafe-recover remove-fail-stores -s 89241 --all-regions" 但是依旧无法下线; 升级到v5.1.1依旧无法解决; 再次升级到v5.2.0,在线/离线修复,问题依旧。
【现象】:业务正常,但是访问information_schema较慢
【问题】:tikv一个节点始终无法下线
【业务影响】:无影响
【TiDB 版本】:v5.2.0
【TiDB Operator 版本】:
【K8s 版本】:
【附件】:

附件1:
dashboard里看到的tikv错误日志:
2021-09-03 11:05:52 ERROR TiKV 172.21.20.87:20160
[raft_client.rs:407] [“connection aborted”] [addr=172.21.20.98:20160] [receiver_err=“Some(RpcFailure(RpcStatus { code: 14-UNAVAILABLE, message: “failed to connect to all addresses”, details: [] }))”] [sink_error=“Some(RpcFinished(Some(RpcStatus { code: 14-UNAVAILABLE, message: “failed to connect to all addresses”, details: [] })))”] [store_id=89241]

2021-09-03 11:05:52 ERROR TiKV 172.21.20.87:20160 [raft_client.rs:707] [“connection abort”] [addr=172.21.20.98:20160] [store_id=89241]

升级到v5.2.0版本后,remove-fail-stores 报错:
./tikv-ctl --data-dir /data1/tidb-deploy/data/tikv-20160/db/ --config /data1/tidb-deploy/tikv-20160/conf/tikv.toml unsafe-recover remove-fail-stores -s 89241 --all-regions
[2021/09/03 03:33:06.791 +00:00] [INFO] [mod.rs:118] [“encryption: none of key dictionary and file dictionary are found.”]
[2021/09/03 03:33:06.792 +00:00] [INFO] [mod.rs:479] [“encryption is disabled.”]
[2021/09/03 03:33:06.796 +00:00] [WARN] [config.rs:587] [“compaction guard is disabled due to region info provider not available”]
[2021/09/03 03:33:06.796 +00:00] [WARN] thread ‘[main’ panicked at ‘config.rscalled Result::unwrap() on an Err value: Os { code: 2, kind: NotFound, message: “No such file or directory” }:’, 682cmd/tikv-ctl/src/main.rs]: [121":compaction guard is disabled due to region info provider not available57"
note: run with RUST_BACKTRACE=1 environment variable to display a backtrace
]

store信息:
Starting component ctl: /home/tidb/.tiup/components/ctl/v5.2.0/ctl pd -u 172.21.11.59:2379 store
{
“count”: 5,
“stores”: [
{
“store”: {
“id”: 89241,
“address”: “172.21.20.98:20160”,
“state”: 1,
“version”: “4.0.6”,
“status_address”: “172.21.20.98:20180”,
“git_hash”: “ca2475bfbcb49a7c34cf783596acb3edd05fc88f”,
“start_timestamp”: 1611283801,
“deploy_path”: “/data1/tidb-deploy/tikv-20160/bin”,
“last_heartbeat”: 1611382452859916876,
“state_name”: “Offline”
},
“status”: {
“capacity”: “0B”,
“available”: “0B”,
“used_size”: “0B”,
“leader_count”: 0,
“leader_weight”: 0,
“leader_score”: 0,
“leader_size”: 0,
“region_count”: 0,
“region_weight”: 0,
“region_score”: 0,
“region_size”: 0,
“slow_score”: 0,
“start_ts”: “2021-01-22T02:50:01Z”,
“last_heartbeat_ts”: “2021-01-23T06:14:12.859916876Z”,
“uptime”: “27h24m11.859916876s”
}
},
{
“store”: {
“id”: 103455,
“address”: “172.21.20.87:20160”,
“version”: “5.2.0”,
“status_address”: “172.21.20.87:20180”,
“git_hash”: “556783c314a9bfca36c818256182eeef364120d7”,
“start_timestamp”: 1630478941,
“deploy_path”: “/data1/tidb-deploy/tikv-20160/bin”,
“last_heartbeat”: 1630649848209204836,
“state_name”: “Up”
},
“status”: {
“capacity”: “3.401TiB”,
“available”: “1.107TiB”,
“used_size”: “1.889TiB”,
“leader_count”: 43497,
“leader_weight”: 1,
“leader_score”: 43497,
“leader_size”: 3628871,
“region_count”: 129897,
“region_weight”: 1,
“region_score”: 405987906.9967766,
“region_size”: 10827076,
“slow_score”: 1,
“start_ts”: “2021-09-01T06:49:01Z”,
“last_heartbeat_ts”: “2021-09-03T06:17:28.209204836Z”,
“uptime”: “47h28m27.209204836s”
}
},
{
“store”: {
“id”: 135592,
“address”: “172.21.10.201:20160”,
“version”: “5.2.0”,
“status_address”: “172.21.10.201:20180”,
“git_hash”: “556783c314a9bfca36c818256182eeef364120d7”,
“start_timestamp”: 1630490575,
“deploy_path”: “/data1/tidb-deploy/tikv-20160/bin”,
“last_heartbeat”: 1630649852080571655,
“state_name”: “Up”
},
“status”: {
“capacity”: “3.401TiB”,
“available”: “1.106TiB”,
“used_size”: “1.897TiB”,
“leader_count”: 43494,
“leader_weight”: 1,
“leader_score”: 43494,
“leader_size”: 3551219,
“region_count”: 135567,
“region_weight”: 1,
“region_score”: 407231097.17386436,
“region_size”: 11097255,
“slow_score”: 1,
“start_ts”: “2021-09-01T10:02:55Z”,
“last_heartbeat_ts”: “2021-09-03T06:17:32.080571655Z”,
“uptime”: “44h14m37.080571655s”
}
},
{
“store”: {
“id”: 151301,
“address”: “172.21.30.237:20160”,
“version”: “5.2.0”,
“status_address”: “172.21.30.237:20180”,
“git_hash”: “556783c314a9bfca36c818256182eeef364120d7”,
“start_timestamp”: 1630479587,
“deploy_path”: “/data1/tidb-deploy/tikv-20160/bin”,
“last_heartbeat”: 1630649848812707098,
“state_name”: “Up”
},
“status”: {
“capacity”: “3.401TiB”,
“available”: “1.109TiB”,
“used_size”: “1.898TiB”,
“leader_count”: 43497,
“leader_weight”: 1,
“leader_score”: 43497,
“leader_size”: 3622096,
“region_count”: 129571,
“region_weight”: 1,
“region_score”: 402675171.5219846,
“region_size”: 10767655,
“slow_score”: 1,
“start_ts”: “2021-09-01T06:59:47Z”,
“last_heartbeat_ts”: “2021-09-03T06:17:28.812707098Z”,
“uptime”: “47h17m41.812707098s”
}
},
{
“store”: {
“id”: 918785,
“address”: “172.21.11.22:20160”,
“version”: “5.2.0”,
“status_address”: “172.21.11.22:20180”,
“git_hash”: “556783c314a9bfca36c818256182eeef364120d7”,
“start_timestamp”: 1630481421,
“deploy_path”: “/data1/tidb-deploy/tikv-20160/bin”,
“last_heartbeat”: 1630649857317595538,
“state_name”: “Up”
},
“status”: {
“capacity”: “3.401TiB”,
“available”: “1.095TiB”,
“used_size”: “1.945TiB”,
“leader_count”: 43500,
“leader_weight”: 1,
“leader_score”: 43500,
“leader_size”: 3827843,
“region_count”: 126935,
“region_weight”: 1,
“region_score”: 424140721.9742522,
“region_size”: 11198308,
“slow_score”: 1,
“start_ts”: “2021-09-01T07:30:21Z”,
“last_heartbeat_ts”: “2021-09-03T06:17:37.317595538Z”,
“uptime”: “46h47m16.317595538s”
}
}
]
}

2 个赞

1.不太理解你这边 5 个 tikv 只挂掉一个为何要执行 unsafe-recover ?正常情况下是不会出现多副本丢失,不需要执行这个操作;
2.麻烦反馈下 tiup cluster display {cluster-name} 的结果。

可以尝试强制再下线 offline 节点,命令:tiup cluster scale-in {cluster-name} -N 172.21.20.98:20160 --force

没用的,这招早使用过了。

tiup cluster scale-in bigdata-tidb -N 172.21.20.98:20160 --force
Starting component cluster: /home/tidb/.tiup/components/cluster/v1.5.6/tiup-cluster scale-in bigdata-tidb -N 172.21.20.98:20160 --force
Forcing scale in is unsafe and may result in data loss for stateful components.
The process is irreversible and could NOT be cancelled.
Only use --force when some of the servers are already permanently offline.
Are you sure to continue? [y/N]:(default=N) y
This operation will delete the 172.21.20.98:20160 nodes in bigdata-tidb and all their data.
Do you want to continue? [y/N]:(default=N) y
Scale-in nodes…

  • [ Serial ] - SSHKeySet: privateKey=/home/tidb/.tiup/storage/cluster/clusters/bigdata-tidb/ssh/id_rsa, publicKey=/home/tidb/.tiup/storage/cluster/clusters/bigdata-tidb/ssh/id_rsa.pub
  • [Parallel] - UserSSH: user=tidb, host=172.21.11.70
  • [Parallel] - UserSSH: user=tidb, host=172.21.31.27
  • [Parallel] - UserSSH: user=tidb, host=172.21.21.82
  • [Parallel] - UserSSH: user=tidb, host=172.21.11.59
  • [Parallel] - UserSSH: user=tidb, host=172.21.20.87
  • [Parallel] - UserSSH: user=tidb, host=172.21.10.201
  • [Parallel] - UserSSH: user=tidb, host=172.21.30.237
  • [Parallel] - UserSSH: user=tidb, host=172.21.11.22
  • [Parallel] - UserSSH: user=tidb, host=172.21.11.85
  • [Parallel] - UserSSH: user=tidb, host=172.21.31.112
  • [Parallel] - UserSSH: user=tidb, host=172.21.11.70
  • [Parallel] - UserSSH: user=tidb, host=172.21.11.70
  • [ Serial ] - ClusterOperate: operation=ScaleInOperation, options={Roles: Nodes:[172.21.20.98:20160] Force:true SSHTimeout:5 OptTimeout:120 APITimeout:300 IgnoreConfigCheck:false NativeSSH:false SSHType: CleanupData:false CleanupLog:false RetainDataRoles: RetainDataNodes: ShowUptime:false JSON:false Operation:StartOperation}

Error: failed to scale in: cannot find node id ‘172.21.20.98:20160’ in topology

Verbose debug logs has been written to /home/tidb/.tiup/logs/tiup-cluster-debug-2021-10-08-03-05-17.log.
Error: run /home/tidb/.tiup/components/cluster/v1.5.6/tiup-cluster (wd:/home/tidb/.tiup/data/SlDMZ2z) failed: exit status 1

现在通过 pd-ctl 中还是能看到 172.21.20.98:20160 节点处于 offline 状态吗?然后 tiup cluster display 又看不到这个节点信息?

对对对,就是这么个情况:tiup ctl:v5.2.0 pd -u 172.21.11.59:2379 store 能看到, 并且处于offline状态,
{
“store”: {
“id”: 89241,
“address”: “172.21.20.98:20160”,
“state”: 1,
“version”: “4.0.6”,
“status_address”: “172.21.20.98:20180”,
“git_hash”: “ca2475bfbcb49a7c34cf783596acb3edd05fc88f”,
“start_timestamp”: 1611283801,
“deploy_path”: “/data1/tidb-deploy/tikv-20160/bin”,
“last_heartbeat”: 1611382452859916876,
“state_name”: “Offline”
},
“status”: {
“capacity”: “0B”,
“available”: “0B”,
“used_size”: “0B”,
“leader_count”: 0,
“leader_weight”: 0,
“leader_score”: 0,
“leader_size”: 0,
“region_count”: 0,
“region_weight”: 0,
“region_score”: 0,
“region_size”: 0,
“slow_score”: 0,
“start_ts”: “2021-01-22T02:50:01Z”,
“last_heartbeat_ts”: “2021-01-23T06:14:12.859916876Z”,
“uptime”: “27h24m11.859916876s”
}
}

但是 tiup cluster display 看不到

从上面信息看,这个 store 的 region 和 leader 已经全部为 0 且版本信息为 v4.0.6 ,想确认下其他 tikv 节点还会去和该节点通信吗?可以从 tikv 日志中检索下有无该节点 IP 信息,如果没有的话,尝试下通过 pd-ctl 执行下
store delete 89241,看下有无效果。

store delete 89241, 这个动作也执行过很多次了 没用啊?

通信还是有的,如果没有通讯 没错误日志 才懒得管呢

使用下面这个接口,看下能否强制该节点调整为 tombstone 状态:
curl -X POST 'http://{pd_ip}:{pd_port}/pd/api/v1/store/{store_id}/state?state=Tombstone'
把实际的 pd leader IP 和 port 替换下,store_id 值即为上面的 89241

curl -X POST ‘http://172.21.21.82:2379/pd/api/v1/store/89241/state?state=Offline
“The store’s state is updated.”
curl -X POST ‘http://172.21.21.82:2379/pd/api/v1/store/89241/state?state=Tombstone
“invalid state Tombstone”

您好,Tombstone这个状态值好像不允许设置哈:rofl:

额,看来这个 API 接口在该版本中已经被禁用了。怀疑和你之前无法执行 unsafe recover 有关系,导致这个节点 Offline 卡主了,如果方便的话可以尝试重启下集群 。

Offline 卡住大年了,还是在4.0.6版本的时候;

重启过好多次了,前几天才从5.2.0升级到5.2.1,升级过程中重启过了;

现在还有办法吗?

也是,刚才忽略掉这点了,这个节点的 store 上次心跳信息都已经是 2021-01-23T06:14:12.859916876Z,我们这边再确认下有没有其他方式来下掉这个节点。

好的,好的;麻烦你们了

麻烦再确认下其他 4 个 tikv 节点和该 offline 节点都有通信失败的日志,还是仅部分节点有,具体的报错日志也辛苦提供下

疑问: 知道在什么情况下需要执行
tikv-ctl --db /data1/tidb-deploy/data/tikv-20160/db/ unsafe-recover remove-fail-stores -s 89241 --all-regions
1、什么情况需要这个命令吗?
2、执行这个命令时需要什么条件吗?
3、执行后需要做其他的操作吗?

回答:
1、在集群有种有两个或者两个以上的tikv节点同时挂掉的情况,某些region失去两个peer,导致集群启动报错,无法拉起时执行的;
2、需要在其他正常的tikv节点执行这个命令,执行时,这个正常的tikv实例是关闭的,同时应该提前关闭自动调度;
3、执行完后,需要重启PD集群,重启目的是清理之前存在的元数据,包括dead tikv node的,然后把再重启正常的tikv节点,重新上报心跳信息到PD集群

这个命令不要随便去执行,除非你知道这个命令到底做了什么操作

您好,还真只有1台tikv上有通信失败的日志:

麻烦再提供下pd-ctl region store 89241 的结果,谢谢