请问强制下线StoreId指令参数是什么了?

为提高效率,请提供以下信息,问题描述清晰能够更快得到解决:

【概述】curl -X POST http://192.168.100.94:2379/pd/api/v1/store/70059/state?state=Tombstone 执行这个后提示,“invalid state Tombstone”
参数为 Up是可以的,根据文档这个参数应该是 Tombstone 实际无法执行成功

【背景】节点store异常

[root@server ~]# tiup ctl:v5.0.1 pd -u http://192.168.100.94:2379 store 70059
Starting component ctl: /root/.tiup/components/ctl/v5.0.1/ctl pd -u http://192.168.100.94:2379 store 70059
{
“store”: {
“id”: 70059,
“address”: “192.168.100.187:20160”,
“version”: “5.0.1”,
“status_address”: “192.168.100.187:20180”,
“git_hash”: “e26389a278116b2f61addfa9f15ca25ecf38bc80”,
“start_timestamp”: 1623136089,
“deploy_path”: “/home/tidb-deploy/tikv-20160/bin”,
“last_heartbeat”: 1623751882105124348,
“state_name”: “Down”
},
“status”: {
“capacity”: “0B”,
“available”: “0B”,
“used_size”: “0B”,
“leader_count”: 0,
“leader_weight”: 1,
“leader_score”: 0,
“leader_size”: 0,
“region_count”: 949,
“region_weight”: 1,
“region_score”: 0,
“region_size”: 0,
“start_ts”: “2021-06-08T15:08:09+08:00”,
“last_heartbeat_ts”: “2021-06-15T18:11:22.105124348+08:00”,
“uptime”: “171h3m13.105124348s”
}
}

执行 delete 后一直是 offine 状态,“region_count”:949 这个一直都是这个值,不会变成0

【现象】节点异常一直无法下线

【业务影响】

【TiDB 版本】5.0.1

【附件】

  1. TiUP Cluster Display 信息

  2. TiUP Cluster Edit Config 信息

  3. TiDB- Overview 监控

  • 对应模块日志(包含问题前后1小时日志)

若提问为性能优化、故障排查类问题,请下载脚本运行。终端输出的打印结果,请务必全选并复制粘贴上传。

1 个赞

因为已经执行强制 tombstone 以及 delete,所以不会再进行 region 的迁移。当前这种情况对使用应该没有影响,三副本的数据在其他 store 上应该已经补齐了。

是没影响,但是 dashboard里面,这个节点一直都存在,怎么把他从dashboard去掉
而且这个机器重装系统后,就一直无法重新加入集群了

该节点重新加入集群提示这个

[2021/06/17 16:56:03.566 +08:00] [INFO] [util.rs:536] [“connecting to PD endpoint”] [endpoints=http://192.168.100.94:2379]
[2021/06/17 16:56:03.568 +08:00] [INFO] [util.rs:536] [“connecting to PD endpoint”] [endpoints=http://192.168.100.94:2379]
[2021/06/17 16:56:03.569 +08:00] [INFO] [util.rs:650] [“connected to PD member”] [endpoints=http://192.168.100.94:2379]
[2021/06/17 16:56:03.569 +08:00] [INFO] [util.rs:196] [“heartbeat sender and receiver are stale, refreshing …”]
[2021/06/17 16:56:03.570 +08:00] [INFO] [util.rs:223] [“update pd client”] [via=] [leader=http://192.168.100.94:2379] [prev_via=] [prev_leader=http://192.168.100.94:2379]
[2021/06/17 16:56:03.570 +08:00] [INFO] [util.rs:354] [“trying to update PD client done”] [spend=3.8344ms]
[2021/06/17 16:56:03.572 +08:00] [ERROR] [util.rs:457] [“request failed”] [err_code=KV:PD:gRPC] [err=“Grpc(RpcFailure(RpcStatus { status: 2-UNKNOWN, details: Some(“duplicated store address: id:9172667 address:\“192.168.100.187:20160\” version:\“5.0.1\” status_address:\“192.168.100.187:20180\” git_hash:\“e26389a278116b2f61addfa9f15ca25ecf38bc80\” start_timestamp:1623920159 deploy_path:\”/home/tidb-deploy/tikv-20160/bin\” , already registered by id:70059 address:\“192.168.100.187:20160\” state:Offline version:\“5.0.1\” status_address:\“192.168.100.187:20180\” git_hash:\“e26389a278116b2f61addfa9f15ca25ecf38bc80\” start_timestamp:1623136089 deploy_path:\"/home/tidb-deploy/tikv-20160/bin\" last_heartbeat:1623751882105124348 “) }))”]
[2021/06/17 16:56:03.572 +08:00] [ERROR] [util.rs:466] [“reconnect failed”] [err_code=KV:PD:Unknown] [err=“Other(”[components/pd_client/src/util.rs:301]: cancel reconnection due to too small interval")"]
[2021/06/17 16:56:04.574 +08:00] [ERROR] [util.rs:457] [“request failed”] [err_code=KV:PD:gRPC] [err=“Grpc(RpcFailure(RpcStatus { status: 2-UNKNOWN, details: Some(“duplicated store address: id:9172667 address:\“192.168.100.187:20160\” version:\“5.0.1\” status_address:\“192.168.100.187:20180\” git_hash:\“e26389a278116b2f61addfa9f15ca25ecf38bc80\” start_timestamp:1623920159 deploy_path:\”/home/tidb-deploy/tikv-20160/bin\” , already registered by id:70059 address:\“192.168.100.187:20160\” state:Offline version:\“5.0.1\” status_address:\“192.168.100.187:20180\” git_hash:\“e26389a278116b2f61addfa9f15ca25ecf38bc80\” start_timestamp:1623136089 deploy_path:\"/home/tidb-deploy/tikv-20160/bin\" last_heartbeat:1623751882105124348 “) }))”]
[2021/06/17 16:56:04.574 +08:00] [FATAL] [server.rs:698] [“failed to start node: Grpc(RpcFailure(RpcStatus { status: 2-UNKNOWN, details: Some(“duplicated store address: id:9172667 address:\“192.168.100.187:20160\” version:\“5.0.1\” status_address:\“192.168.100.187:20180\” git_hash:\“e26389a278116b2f61addfa9f15ca25ecf38bc80\” start_timestamp:1623920159 deploy_path:\”/home/tidb-deploy/tikv-20160/bin\” , already registered by id:70059 address:\“192.168.100.187:20160\” state:Offline version:\“5.0.1\” status_address:\“192.168.100.187:20180\” git_hash:\“e26389a278116b2f61addfa9f15ca25ecf38bc80\” start_timestamp:1623136089 deploy_path:\"/home/tidb-deploy/tikv-20160/bin\" last_heartbeat:1623751882105124348 “) }))”]
^C

tiup cluster display ${cluster_name} 看下,其次尝试使用./pd-ctl stores remove-tombstone -u http://${pd_id}:${pd_port} 清理下 tombstone 的节点信息,清理完之后应该就不会在 dashboard 出现。

tiup cluster display ${cluster_name} 早就没有这个的信息了

dashboard:

执行
./pd-ctl stores remove-tombstone 后

没啥用

可以看下这里:如何删除已下线实例在information_schema中信息
1、强制 tombstone 的接口在 5.0 以及以上版本应该没法用了
2、重起一个 TiKV server 报错是因为之前的端口信息等没有清理掉,可以尝试换个端口起一个 TiKV
3、dashboard 不清理问题还存在吗?

这个解决方案我无效啊.我已经是三个kv节点了.
但是那个异常节点信息还是在的,怎么都删不掉

dashboard 就是没正常显示…

哎.就没人能帮忙看下吗?

  1. 请问这个节点是如何变为 down 状态的? 是由于这个节点异常,还是执行过缩容命令?
  2. 目前如果执行 scale-in --force 是不是会提升不存在这个节点?
  3. 请问这个集群是测试环境吗? 能否试试重启下呢?

1.是执行缩容没成功,一开始是执行不带 --force,一直就停留在 pendingoffline,过了好几天,等不了了重新执行了 缩容指令 带上 --force参数
2.是的
3.非测试环境,重启集群已经试过多次了,问题依然存在

目前集群使用正常
如上面截图所示,在dashboard一直存在该节点,然后该节点重装后,尝试再次加入集群,就提示节点已经存在

麻烦帮忙执行下 tiup ctl:v5.0.1 pd 命令查下当前完整 store 信息,多谢。

[root@server ~]# tiup ctl:v5.0.1 pd -u http://192.168.100.94:2379 store
Starting component ctl: /root/.tiup/components/ctl/v5.0.1/ctl pd -u http://192.168.100.94:2379 store
{
“count”: 7,
“stores”: [
{
“store”: {
“id”: 3651246,
“address”: “192.168.100.187:3930”,
“state”: 1,
“labels”: [
{
“key”: “engine”,
“value”: “tiflash”
}
],
“version”: “v5.0.1”,
“peer_address”: “192.168.100.187:20170”,
“status_address”: “192.168.100.187:20292”,
“git_hash”: “1821cf655bc90e1fab6e6154cfe994c19c75d377”,
“start_timestamp”: 1623142668,
“deploy_path”: “/home/tidb-deploy/tiflash-9000/bin/tiflash”,
“last_heartbeat”: 1623751836900362022,
“state_name”: “Offline”
},
“status”: {
“capacity”: “0B”,
“available”: “0B”,
“used_size”: “0B”,
“leader_count”: 0,
“leader_weight”: 1,
“leader_score”: 0,
“leader_size”: 0,
“region_count”: 0,
“region_weight”: 1,
“region_score”: 0,
“region_size”: 0,
“start_ts”: “2021-06-08T16:57:48+08:00”,
“last_heartbeat_ts”: “2021-06-15T18:10:36.900362022+08:00”,
“uptime”: “169h12m48.900362022s”
}
},
{
“store”: {
“id”: 3691328,
“address”: “192.168.100.186:3930”,
“labels”: [
{
“key”: “engine”,
“value”: “tiflash”
}
],
“version”: “v5.0.1”,
“peer_address”: “192.168.100.186:20170”,
“status_address”: “192.168.100.186:20292”,
“git_hash”: “1821cf655bc90e1fab6e6154cfe994c19c75d377”,
“start_timestamp”: 1624323601,
“deploy_path”: “/home/tidb-deploy/tiflash-9000/bin/tiflash”,
“last_heartbeat”: 1624334734049585406,
“state_name”: “Up”
},
“status”: {
“capacity”: “44.16GiB”,
“available”: “41.24GiB”,
“used_size”: “2.921GiB”,
“leader_count”: 0,
“leader_weight”: 1,
“leader_score”: 0,
“leader_size”: 0,
“region_count”: 131,
“region_weight”: 1,
“region_score”: 7752,
“region_size”: 7752,
“start_ts”: “2021-06-22T09:00:01+08:00”,
“last_heartbeat_ts”: “2021-06-22T12:05:34.049585406+08:00”,
“uptime”: “3h5m33.049585406s”
}
},
{
“store”: {
“id”: 7332002,
“address”: “192.168.100.126:20160”,
“version”: “5.0.1”,
“status_address”: “192.168.100.126:20180”,
“git_hash”: “e26389a278116b2f61addfa9f15ca25ecf38bc80”,
“start_timestamp”: 1624323577,
“deploy_path”: “/home/tidb-deploy/tikv-20160/bin”,
“last_heartbeat”: 1624334730925288853,
“state_name”: “Up”
},
“status”: {
“capacity”: “440.9GiB”,
“available”: “405GiB”,
“used_size”: “12.71GiB”,
“leader_count”: 3625,
“leader_weight”: 1,
“leader_score”: 3625,
“leader_size”: 23609,
“region_count”: 10883,
“region_weight”: 1,
“region_score”: 78120,
“region_size”: 78120,
“start_ts”: “2021-06-22T08:59:37+08:00”,
“last_heartbeat_ts”: “2021-06-22T12:05:30.925288853+08:00”,
“uptime”: “3h5m53.925288853s”
}
},
{
“store”: {
“id”: 7982622,
“address”: “192.168.100.127:20160”,
“version”: “5.0.1”,
“status_address”: “192.168.100.127:20180”,
“git_hash”: “e26389a278116b2f61addfa9f15ca25ecf38bc80”,
“start_timestamp”: 1624323577,
“deploy_path”: “/home/tidb-deploy/tikv-20160/bin”,
“last_heartbeat”: 1624334730334522569,
“state_name”: “Up”
},
“status”: {
“capacity”: “440.9GiB”,
“available”: “405.1GiB”,
“used_size”: “12.63GiB”,
“leader_count”: 3625,
“leader_weight”: 1,
“leader_score”: 3625,
“leader_size”: 26096,
“region_count”: 10883,
“region_weight”: 1,
“region_score”: 78120,
“region_size”: 78120,
“start_ts”: “2021-06-22T08:59:37+08:00”,
“last_heartbeat_ts”: “2021-06-22T12:05:30.334522569+08:00”,
“uptime”: “3h5m53.334522569s”
}
},
{
“store”: {
“id”: 9235861,
“address”: “192.168.100.94:20160”,
“version”: “5.0.1”,
“status_address”: “192.168.100.94:20180”,
“git_hash”: “e26389a278116b2f61addfa9f15ca25ecf38bc80”,
“start_timestamp”: 1624323576,
“deploy_path”: “/home/tidb-deploy/tikv-20160/bin”,
“last_heartbeat”: 1624334728907333613,
“state_name”: “Up”
},
“status”: {
“capacity”: “494.7GiB”,
“available”: “385.1GiB”,
“used_size”: “12.64GiB”,
“leader_count”: 3633,
“leader_weight”: 1,
“leader_score”: 3633,
“leader_size”: 28415,
“region_count”: 10883,
“region_weight”: 1,
“region_score”: 78120,
“region_size”: 78120,
“start_ts”: “2021-06-22T08:59:36+08:00”,
“last_heartbeat_ts”: “2021-06-22T12:05:28.907333613+08:00”,
“uptime”: “3h5m52.907333613s”
}
},
{
“store”: {
“id”: 70002,
“address”: “192.168.100.186:20160”,
“state”: 1,
“version”: “5.0.1”,
“status_address”: “192.168.100.186:20180”,
“git_hash”: “e26389a278116b2f61addfa9f15ca25ecf38bc80”,
“start_timestamp”: 1624323576,
“deploy_path”: “/home/tidb-deploy/tikv-20160/bin”,
“last_heartbeat”: 1624334728528215947,
“state_name”: “Offline”
},
“status”: {
“capacity”: “44.16GiB”,
“available”: “19.91GiB”,
“used_size”: “8.49GiB”,
“leader_count”: 0,
“leader_weight”: 1,
“leader_score”: 0,
“leader_size”: 0,
“region_count”: 0,
“region_weight”: 1,
“region_score”: 0,
“region_size”: 0,
“start_ts”: “2021-06-22T08:59:36+08:00”,
“last_heartbeat_ts”: “2021-06-22T12:05:28.528215947+08:00”,
“uptime”: “3h5m52.528215947s”
}
},
{
“store”: {
“id”: 70059,
“address”: “192.168.100.187:20160”,
“version”: “5.0.1”,
“status_address”: “192.168.100.187:20180”,
“git_hash”: “e26389a278116b2f61addfa9f15ca25ecf38bc80”,
“start_timestamp”: 1623136089,
“deploy_path”: “/home/tidb-deploy/tikv-20160/bin”,
“last_heartbeat”: 1623751882105124348,
“state_name”: “Down”
},
“status”: {
“capacity”: “0B”,
“available”: “0B”,
“used_size”: “0B”,
“leader_count”: 0,
“leader_weight”: 1,
“leader_score”: 0,
“leader_size”: 0,
“region_count”: 0,
“region_weight”: 1,
“region_score”: 0,
“region_size”: 0,
“start_ts”: “2021-06-08T15:08:09+08:00”,
“last_heartbeat_ts”: “2021-06-15T18:11:22.105124348+08:00”,
“uptime”: “171h3m13.105124348s”
}
}
]
}

有问题的这个 storeid = 70059
其实我最上面也有发出来了
前面是:执行 delete 后一直是 offine 状态,“region_count”:949 这个一直都是这个值,不会变成0

现在region count 为0了.但是还是在.删不掉

1、直接在目标服务器上 ,手动 关闭这个进程就行(systemctl stop tikv—xxx.service),然后用 pd-ctl store 70059 命令观察一下 就 ok
2、等到 pd-ctl store 70059 命令,发现该节点变为 tombstone 状态后,执行上面的删除 tombstone状态的节点就OK(./pd-ctl stores remove-tombstone)

image
1.目标服务器上已经没有这个进程了.这个已经是上周就删除了对应的tikv组件了
2.之前就是几天了.都是 offline 不会变成 tombstone

使用 pd control 手动执行一下 store delete 70059 呢?另外 pd 有打印类似 “bury store failed” 或者 “store may not turn into Tombstone” 的日志吗?

@johnwa-CD 帮忙反馈下上面的信息,delete 怎么样,多谢。