TiKV缩容一直处于Pending Offline状态,强制下线后,pd依旧存在该tikv信息

【 TiDB 使用环境】生产环境 /测试/ Poc
【 TiDB 版本】v5.0.3
【复现路径】做过哪些操作出现的问题
【遇到的问题:问题现象及影响】
【资源配置】
【附件:截图/日志/监控】

昨天对Tikv缩容了两个节点(原来7个节点),之后一直处于Pending Offline状态,然后按照如下的方式处理:
1、确认缩容的两个tikv上region和leader的数量,,leader=0,region=1
2、确认缩容的两个tikv上region上仅有的1个region:9783357,发现是空的region
3、尝试新增 Region 9783357 的副本以及remove-peer 9783357都不起作用
4、尝试强制下线其中一个tikv节点,tiup cluster scale-in xxxx -N 10.30.xx.xx:20160 --force
之后,tiup里面确实没了,但是pd-ctl中查看,发现依旧存在该store信息

接下来就不知道该如何处理了,目前该store信息一直在pd中有记录,请各位大佬指导,谢谢

recreate这个空region,然后状态成为tombstone后remove-tombstone

1 个赞

./pd-ctl -u http://XXX.XXX.XXX.XXX:2379 -d store remove-tombstone

但是recreate这个空region的时候报错:
./tikv-ctl --db /data/tidb-data/tikv-20160/db recreate-region -p 10.30.xx.xx:2379 -r 9783357
error while open kvdb: Storage Engine IO error: While lock file: /data/tidb-data/tikv-20160/db/LOCK: Resource temporarily unavailable
LOCK file conflict indicates TiKV process is running. Do NOT delete the LOCK file and force the command to run. Doing so could cause data corruption.

需要停止region 所在的tikv, pd-ctl region xxxx 可以查看

./tikv-ctl --db /data/tidb-data/tikv-20160/db recreate-region -p ‘10.30.xx.xx:2379’ -r 9783357

initing empty region 10113953 with peer_id 10113954…
Debugger::recreate_region: “[src/server/debug.rs:639]: "[src/server/debug.rs:664]: region still exists id: 10113953 start_key: 7480000000000000FF375F698000000000FF0000040380000000FF0D2F659003800000FF0000000002038000FF00009043FDAD0000FD end_key: 7480000000000008FF875F72FC00000019FF18E0020000000000FA region_epoch { conf_ver: 1 version: 15792 } peers { id: 10113954 store_id: 8009882 }"”

下面是region信息
» region 9783357
{
“id”: 9783357,
“start_key”: “7480000000000000FF375F698000000000FF0000040380000000FF0D2F659003800000FF0000000002038000FF00009043FDAD0000FD”,
“end_key”: “7480000000000008FF875F72FC00000019FF18E0020000000000FA”,
“epoch”: {
“conf_ver”: 8012,
“version”: 15791
},
“peers”: [
{
“id”: 9783358,
“store_id”: 8009882,
“role_name”: “Voter”
},
{
“id”: 9783359,
“store_id”: 6,
“role_name”: “Voter”
},
{
“id”: 9783360,
“store_id”: 8009881,
“role_name”: “Voter”
},
{
“id”: 10113880,
“store_id”: 1,
“role”: 1,
“role_name”: “Learner”,
“is_learner”: true
}
],
“leader”: {
“id”: 9783359,
“store_id”: 6,
“role_name”: “Voter”
},
“down_peers”: [
{
“down_seconds”: 4967,
“peer”: {
“id”: 9783360,
“store_id”: 8009881,
“role_name”: “Voter”
}
},
{
“down_seconds”: 317,
“peer”: {
“id”: 10113880,
“store_id”: 1,
“role”: 1,
“role_name”: “Learner”,
“is_learner”: true
}
}
],
“pending_peers”: [
{
“id”: 10113880,
“store_id”: 1,
“role”: 1,
“role_name”: “Learner”,
“is_learner”: true
}
],
“written_bytes”: 0,
“read_bytes”: 0,
“written_keys”: 0,
“read_keys”: 0,
“approximate_size”: 1,
“approximate_keys”: 0
}]

好像还是不行
“store_id”: 8009881是一直在tiup看不到,但是pd-ctl还可以看到的,“store_id”: 8009882一直处于Pending Offline状态的

https://docs.pingcap.com/zh/tidb/stable/pd-control#unsafe-remove-failed-stores-store-ids--show
delete 可能删除后还在,确定不需要了使用unsafe 命令强制删除,元数据中就没这个了

再试试revome-peer从8009882 store 删除,再不行就再试试

还是不行
» operator add remove-peer 9783357 8009881
Failed! [500] “failed to add operator, maybe already have one”
» operator add remove-peer 9783357 8009882
Failed! [500] “failed to add operator, maybe already have one”
» operator add remove-peer 9783357 6
Failed! [500] “fail to build operator: plan is empty, maybe no valid leader”

./tikv-ctl --db /data/tidb-data/tikv-20160/db tombstone -p ‘10.30.xx.xx:2379’ -r 9783357 --force

region: 9783357, error: “[src/server/debug.rs:1190]: invalid conf_ver: please make sure you have removed the peer by PD”

pd-ctl region看下现在的状态呢

» region 9783357
{
“id”: 9783357,
“start_key”: “7480000000000000FF375F698000000000FF0000040380000000FF0D2F659003800000FF0000000002038000FF00009043FDAD0000FD”,
“end_key”: “7480000000000008FF875F72FC00000019FF18E0020000000000FA”,
“epoch”: {
“conf_ver”: 8018,
“version”: 15791
},
“peers”: [
{
“id”: 9783358,
“store_id”: 8009882,
“role_name”: “Voter”
},
{
“id”: 9783359,
“store_id”: 6,
“role_name”: “Voter”
},
{
“id”: 9783360,
“store_id”: 8009881,
“role_name”: “Voter”
},
{
“id”: 10114097,
“store_id”: 1,
“role”: 1,
“role_name”: “Learner”,
“is_learner”: true
}
],
“leader”: {
“id”: 9783359,
“store_id”: 6,
“role_name”: “Voter”
},
“down_peers”: [
{
“down_seconds”: 300,
“peer”: {
“id”: 9783360,
“store_id”: 8009881,
“role_name”: “Voter”
}
}
],
“pending_peers”: [
{
“id”: 9783360,
“store_id”: 8009881,
“role_name”: “Voter”
},
{
“id”: 10114097,
“store_id”: 1,
“role”: 1,
“role_name”: “Learner”,
“is_learner”: true
}
],
“written_bytes”: 0,
“read_bytes”: 0,
“written_keys”: 0,
“read_keys”: 0,
“approximate_size”: 1,
“approximate_keys”: 0
}

是个空region
$ curl http://10.30.xx.xx:10080/regions/9783357
{
“start_key”: “dIAAAAAAAAA3X2mAAAAAAAAABAOAAAAADS9lkAOAAAAAAAAAAgOAAAAAkEP9rQ==”,
“end_key”: “dIAAAAAAAAiHX3L8AAAAGRjgAg==”,
“start_key_hex”: “7480000000000000375f69800000000000000403800000000d2f659003800000000000000203800000009043fdad”,
“end_key_hex”: “7480000000000008875f72fc0000001918e002”,
“region_id”: 9783357,
“frames”: null
}

image
只能用这个了 -s指定这几个store_id和-r region_id

这个需要停掉8009882,6,8009881,1这个四个store?,1和6是正常tikv节点,停掉的话,会影响正常的服务吧

./tikv-ctl --db /data/tidb-data/tikv-20160/db unsafe-recover remove-fail-stores -s 8009882,6,8009881,1 -r 9783357

error while open kvdb: Storage Engine IO error: While lock file: /data/tidb-data/tikv-20160/db/LOCK: Resource temporarily unavailable
LOCK file conflict indicates TiKV process is running. Do NOT delete the LOCK file and force the command to run. Doing so could cause data corruption.

你前面的操作都没停涉及到的tikv吗? 这些操作都需要停止region peer所在的tikv。对于unsafe recover 如果region数量少可以一个个的停涉及的tikv,向你这种就1个region了。如果涉及region很多 一般都把所有tikv停了。 停tikv就是会迁移leader,有一定的波动。超过max-store-down-time后会自动再其他节点补副本 可以临时调大该参数pd-ctl config set

这个要必须同时停掉这四个store?一个停掉之后,执行./tikv-ctl --db /data/tidb-data/tikv-20160/db unsafe-recover remove-fail-stores -s 6 -r 9783357 操作不行吗?然后起来之后,在停掉一个,这样不行吗?

没这样试过,你可以试试.。我估计是不行,要不6.1也不用搞online unsafe-recover

好像确实不行,但是都停掉的话,会对业务有影响

现在5个可用节点,停掉1、6 store 2个后,leader迁移可用性不会影响,会影响性能

全部停掉之后,执行./tikv-ctl --db /data/tidb-data/tikv-20160/db unsafe-recover remove-fail-stores -s 6 -r 9783357之后,还想还是不行

每个涉及的sotre都这样处理-s 可以指定多个-s 1,2,3,4