tikv缩容和扩容问题

tikv-ctl --db /path/to/tikv/db size -r 84
thread ‘main’ panicked at ‘called Result::unwrap() on an Err value: RocksDb(“IO error: While lock file: /data/xxx/db/LOCK: Resource temporarily unavailable”)’, src/libcore/result.rs:1188:5
note: run with RUST_BACKTRACE=1 environment variable to display a backtrace.
在store 5上执行的,即便是关闭store 5,还是报同样的错误

缩容的两个 tikv 节点的 store_id 分别是多少?

store id 4和5

剩余 3 个 region 的 pending peer 都无法通过手动 remove 的方式去掉么?都是一样的 operator 已经存在的错误?

是的,跟remove 84 一样的错误。

>> operator add add-peer 1 2                            // 在 store 2 上新增 Region 1 的一个副本

在非 store id 为 4 和 5 的节点上,添加 peer 可以正常添加出来么?

» operator add add-peer 84 5888
Failed! [500] “failed to add operator, maybe already have one”

看一下对应 tikv 上的日志呢?也是 not_leader 么?

curl http://{TiDBIP}:10080/regions/{regionID}

另外看下这几个 region 的信息,属于哪个表。

tikv节点上很少日志,貌似在执行了store delete store_id后,基本都没有什么日志

{
“end_key”: “dIAAAAAAAAAr”,
“end_key_hex”: “74800000000000002b”,
“frames”: [
{
“db_name”: “mysql”,
“is_record”: true,
“table_id”: 41,
“table_name”: “expr_pushdown_blacklist”
}
],
“region_id”: 84,
“start_key”: “dIAAAAAAAAAp”,
“start_key_hex”: “748000000000000029”
}

mysql> select * from expr_pushdown_blacklist limit 10;
±---------±-----------±-------------------------------------------------------------------+
| name | store_type | reason |
±---------±-----------±-------------------------------------------------------------------+
| date_add | tiflash | DST(daylight saving time) does not take effect in TiFlash date_add |
±---------±-----------±-------------------------------------------------------------------+

» region 84
{
“id”: 84,
“start_key”: “7480000000000000FF2900000000000000F8”,
“end_key”: “7480000000000000FF2B00000000000000F8”,
“epoch”: {
“conf_ver”: 5,
“version”: 21
},
“peers”: [
{
“id”: 85,
“store_id”: 1
},
{
“id”: 86,
“store_id”: 4
},
{
“id”: 87,
“store_id”: 5
}
],
“leader”: {
“id”: 85,
“store_id”: 1
},
“pending_peers”: [
{
“id”: 87,
“store_id”: 5
}
],
“written_bytes”: 0,
“read_bytes”: 0,
“written_keys”: 0,
“read_keys”: 0,
“approximate_size”: 1,
“approximate_keys”: 7166
}

输出信息显示 region 84的 peer id=85,并且落在store_id为1的tikv上,但是

#tikv-ctl --host ${store_id_1_host}:${store_id_1_port} consistency-check -r 84
DebugClient::check_region_consistency: RpcFailure: 2-UNKNOWN “Leader is on store 4”

#tikv-ctl --host ${store_id_4_host}:${store_id_4_port} consistency-check -r 84
success!

那就是region 84 这个命令输出的leader peer信息可能不正确。

on store 5上执行:
#tikv-ctl --db /path/to/tikv/db bad-regions
all regions are healthy

这个集群之前有通过 tikv-ctl 执行过 unsafe-recover 操作么

没有做过,背景是这样的:
1、3节点的tikv集群,store id分别为1 、4、5
2、执行 store delete 4 和 5
3、发现操作有问题后,就又扩容了两个tikv节点
4、扩容1周后,原来的 offline 状态的 store 4和store 5的region逐渐减少,但是还是有peeding状态的region

目前看 执行 operator add remove-peer 84 5 失败,region 84的在store 5上的peer check也是正常的,意思这个peer 没有被损坏,但是如何处理呢

这种pending_peers状态的region 是tikv在send或者recive snap的时候出问题了吗?
另外
» config set enable-replace-offline-replica false

» operator add remove-peer 84 5
Success!

» region check 84
admin-remove-peer {rm peer: store [5]} (kind:region,admin, region:84(21,5), createAt:2021-04-18 10:05:31.000701127 +0800 CST m=+1377450.425976932, startAt:2021-04-18 10:05:31.000877357 +0800 CST m=+1377450.426153175, currentStep:0, steps:[remove peer on store 5])”

» region 84
{
“id”: 84,
“start_key”: “7480000000000000FF2900000000000000F8”,
“end_key”: “7480000000000000FF2B00000000000000F8”,
“epoch”: {
“conf_ver”: 5,
“version”: 21
},
“peers”: [
{
“id”: 85,
“store_id”: 1
},
{
“id”: 86,
“store_id”: 4
},
{
“id”: 87,
“store_id”: 5
}
],
“leader”: {
“id”: 85,
“store_id”: 1
},
“pending_peers”: [
{
“id”: 87,
“store_id”: 5
}
],
“written_bytes”: 0,
“read_bytes”: 0,
“written_keys”: 0,
“read_keys”: 0,
“approximate_size”: 1,
“approximate_keys”: 7166
}

还是没有去掉peeding状态的peer
“enable-replace-offline-replica”: “true” 后 所有的都又恢复到原样

"replace-offline-replica {mv peer: store [4] to [2117]} (kind:region,replica, region:84(21,5), createAt:2021-04-18 10:15:35.141556649 +0800 CST m=+1378054.566832439, startAt:2021-04-18 10:15:35.141678938 +0800 CST m=+1378054.566954737, currentStep:0, steps:[add learner peer 9784 on store 2117, promote learner peer 9784 on store 2117 to voter, remove peer on store 4])"

@GangShen 请问这个怎么处理呢

执行下 tiup cluster scale-in --force 将 store 5 节点缩容看下

有没有curl命令?没有用过tiup