怎么把tikv上的一个region给删除掉?可以允许一部分数据丢失
为什么要删除一个 region ?是遇到什么报错么
如果是想要让 region 从单副本强制提供服务的话,可以参考官方文档:
https://docs.pingcap.com/zh/tidb/stable/tikv-control#强制-region-从多副本失败状态恢复服务
region坏了,所有副本都坏了
无主region
» region 306940
{
"id": 306940,
"start_key": "7480000000000001FF985F728000000011FF0EB4510000000000FA",
"end_key": "7480000000000001FFBC00000000000000F8",
"epoch": {
"conf_ver": 22493,
"version": 884
},
"peers": [
{
"id": 495193,
"store_id": 11
},
{
"id": 495216,
"store_id": 6
},
{
"id": 495376,
"store_id": 9
}
],
"written_bytes": 60187,
"read_bytes": 0,
"written_keys": 666,
"read_keys": 0,
"approximate_size": 0,
"approximate_keys": 0
}
确认一下几个信息:
- 集群版本是多少?
- 通过 pd-ctl 执行 region --jq=’.regions[] | select(has(“leader”)|not) | {id:.id,peer_stores: [.peers[].store_id]}’ 命令查一下没有 leader 的 region 有哪些
- 可以在 tikv 日志中 grep “306940” 看下这个 region 为什么没有选举 leader ,是不是选举失败了
- 出现这个问题之前是有做什么操作吗?
1、版本是4.1.0-alpha
2、只有306940没有leader,
3、应该是没有足够的票数了,因为我把其它两个实例都下线掉了
4、做了很多的操作,因为机器断电了,使用下面这个命令也不行
[ 18:16:17-root@sea3:bin ]#./tikv-ctl --db /data/disk4/tikv/store/db unsafe-recover remove-fail-stores -s 9 -r 306940
removing stores [9] from configurations...
Debugger::remove_fail_stores: "Store 9 in the failed list"
目前除了这个region怎么删除都报错,其它两个实例也有这个副本,我把他们stop,但是一直不能从集群中抹除掉,就是因为这个region存在;所有region都迁移走了,除了这个region;有没有什么办法可以强制清除region?
目前 pd-ctl 执行 store 命令,看到的情况是怎么样的?剩余几个健康节点?
执行 unsafe-recovery 的时候是在所有剩余的健康 tikv 实例上执行的,执行的时候要停止 tikv 实例,-s 指定的是故障节点的 store id
» store
{
"count": 6,
"stores": [
{
"store": {
"id": 1,
"address": "10.59.108.120:20160",
"version": "4.1.0-alpha",
"status_address": "10.59.108.120:20180",
"git_hash": "36dab75da84ec57374d364a4a4af9146ec31df07",
"start_timestamp": 1600183753,
"deploy_path": "/usr/local/tidb/bin/tikv-server",
"last_heartbeat": 1603103398099173265,
"state_name": "Up"
},
"status": {
"capacity": "2.727TiB",
"available": "2.566TiB",
"used_size": "88.47GiB",
"leader_count": 1707,
"leader_weight": 1,
"leader_score": 1707,
"leader_size": 113079,
"region_count": 5249,
"region_weight": 1,
"region_score": 350966,
"region_size": 350966,
"start_ts": "2020-09-15T23:29:13+08:00",
"last_heartbeat_ts": "2020-10-19T18:29:58.099173265+08:00",
"uptime": "811h0m45.099173265s"
}
},
{
"store": {
"id": 4,
"address": "10.59.108.120:20161",
"version": "4.1.0-alpha",
"status_address": "10.59.108.120:20181",
"git_hash": "36dab75da84ec57374d364a4a4af9146ec31df07",
"start_timestamp": 1600184473,
"deploy_path": "/usr/local/tidb/bin/tikv-server",
"last_heartbeat": 1603103397934070474,
"state_name": "Up"
},
"status": {
"capacity": "2.727TiB",
"available": "2.57TiB",
"used_size": "86.22GiB",
"leader_count": 1712,
"leader_weight": 1,
"leader_score": 1712,
"leader_size": 123631,
"region_count": 5003,
"region_weight": 1,
"region_score": 351395,
"region_size": 351395,
"start_ts": "2020-09-15T23:41:13+08:00",
"last_heartbeat_ts": "2020-10-19T18:29:57.934070474+08:00",
"uptime": "810h48m44.934070474s"
}
},
{
"store": {
"id": 6,
"address": "10.59.108.121:20160",
"state": 1,
"version": "4.1.0-alpha",
"status_address": "10.59.108.121:20180",
"git_hash": "36dab75da84ec57374d364a4a4af9146ec31df07",
"start_timestamp": 1603092338,
"deploy_path": "/usr/local/tidb/bin/tikv-server",
"last_heartbeat": 1603092322405633810,
"state_name": "Offline"
},
"status": {
"capacity": "0B",
"available": "0B",
"used_size": "0B",
"leader_count": 0,
"leader_weight": 1,
"leader_score": 0,
"leader_size": 0,
"region_count": 0,
"region_weight": 1,
"region_score": 0,
"region_size": 0,
"start_ts": "2020-10-19T15:25:38+08:00",
"last_heartbeat_ts": "2020-10-19T15:25:22.40563381+08:00"
}
},
{
"store": {
"id": 9,
"address": "10.59.108.121:20161",
"version": "4.1.0-alpha",
"status_address": "10.59.108.121:20181",
"git_hash": "36dab75da84ec57374d364a4a4af9146ec31df07",
"start_timestamp": 1603077067,
"deploy_path": "/usr/local/tidb/bin/tikv-server",
"last_heartbeat": 1603077009186794384,
"state_name": "Down"
},
"status": {
"capacity": "0B",
"available": "0B",
"used_size": "0B",
"leader_count": 0,
"leader_weight": 1,
"leader_score": 0,
"leader_size": 0,
"region_count": 0,
"region_weight": 1,
"region_score": 0,
"region_size": 0,
"start_ts": "2020-10-19T11:11:07+08:00",
"last_heartbeat_ts": "2020-10-19T11:10:09.186794384+08:00"
}
},
{
"store": {
"id": 10,
"address": "10.59.108.122:20160",
"labels": [
{
"key": "host",
"value": "10.59.108.122"
}
],
"version": "4.1.0-alpha",
"status_address": "10.59.108.122:20180",
"git_hash": "36dab75da84ec57374d364a4a4af9146ec31df07",
"start_timestamp": 1600180326,
"deploy_path": "/usr/local/tidb/bin/tikv-server",
"last_heartbeat": 1603103405065431782,
"state_name": "Up"
},
"status": {
"capacity": "2.727TiB",
"available": "2.576TiB",
"used_size": "80.61GiB",
"leader_count": 1710,
"leader_weight": 1,
"leader_score": 1710,
"leader_size": 111794,
"region_count": 5170,
"region_weight": 1,
"region_score": 351163,
"region_size": 351163,
"start_ts": "2020-09-15T22:32:06+08:00",
"last_heartbeat_ts": "2020-10-19T18:30:05.065431782+08:00",
"uptime": "811h57m59.065431782s"
}
},
{
"store": {
"id": 11,
"address": "10.59.108.122:20161",
"version": "4.1.0-alpha",
"status_address": "10.59.108.122:20181",
"git_hash": "36dab75da84ec57374d364a4a4af9146ec31df07",
"start_timestamp": 1600224819,
"deploy_path": "/usr/local/tidb/bin/tikv-server",
"last_heartbeat": 1603103407170117792,
"state_name": "Up"
},
"status": {
"capacity": "2.727TiB",
"available": "2.574TiB",
"used_size": "83.71GiB",
"leader_count": 1712,
"leader_weight": 1,
"leader_score": 1712,
"leader_size": 119822,
"region_count": 5102,
"region_weight": 1,
"region_score": 351454,
"region_size": 351454,
"start_ts": "2020-09-16T10:53:39+08:00",
"last_heartbeat_ts": "2020-10-19T18:30:07.170117792+08:00",
"uptime": "799h36m28.170117792s"
}
}
]
}
region 306940分别是在store 11,store 9和store 6;这种情况是要在store 11上执行这条命令么,执行这条命令需要停止store 11还是store 9,还是都停止?
./tikv-ctl --db /data/disk4/tikv/store/db unsafe-recover remove-fail-stores -s 9 -r 306940
通过结果看是 9 和 6 节点是下线的节点,那么需要在剩余其他状态为 up 的节点上执行 unsafe-recover 操作,操作命令中 -s 指定 9 和 6 。且执行命令时,操作的 tikv 实例需要停止。至于是停止所有健康节点,然后一起在健康节点上操作,还是停止一个健康节点,执行命令,拉起这个健康节点,然后继续下一个,两种方案都可以,看选择。
另外需要确保 store 9 和 store 6 的 tikv 进程后续不会再启动恢复了,否则会有影响。
不是太明白,这种为什么要在所有的健康节点上做操作,306940 这个region只在store 11,store 9,store 6上存在,这时候也要在其它的健康节点上操作么?
» region 306940
{
"id": 306940,
"start_key": "7480000000000001FF985F728000000011FF0EB4510000000000FA",
"end_key": "7480000000000001FFBC00000000000000F8",
"epoch": {
"conf_ver": 22493,
"version": 884
},
"peers": [
{
"id": 495193,
"store_id": 11
},
{
"id": 495216,
"store_id": 6
},
{
"id": 495376,
"store_id": 9
}
],
"written_bytes": 60187,
"read_bytes": 0,
"written_keys": 666,
"read_keys": 0,
"approximate_size": 0,
"approximate_keys": 0
}
这样可以处理所有在 store 9 和 6 上存在 peer 的 region。
如果你可以确认只有 306940 这个 region 有 peer 在 store 9 或者 store 6 上,那只要处理这个 region 就可以,只在 store 11 上执行命令
ok,多谢多谢
那就是说只要停止store 11实例,然后在store 11上执行unsafe-recover命令即可,之后再启动store 11实例
嗯,是的
此话题已在最后回复的 1 分钟后被自动关闭。不再允许新回复。