怎么强制把一个unaviable region删除掉?

怎么把tikv上的一个region给删除掉?可以允许一部分数据丢失

为什么要删除一个 region ?是遇到什么报错么
如果是想要让 region 从单副本强制提供服务的话,可以参考官方文档:

https://docs.pingcap.com/zh/tidb/stable/tikv-control#强制-region-从多副本失败状态恢复服务

region坏了,所有副本都坏了

无主region

» region 306940
{
  "id": 306940,
  "start_key": "7480000000000001FF985F728000000011FF0EB4510000000000FA",
  "end_key": "7480000000000001FFBC00000000000000F8",
  "epoch": {
    "conf_ver": 22493,
    "version": 884
  },
  "peers": [
    {
      "id": 495193,
      "store_id": 11
    },
    {
      "id": 495216,
      "store_id": 6
    },
    {
      "id": 495376,
      "store_id": 9
    }
  ],
  "written_bytes": 60187,
  "read_bytes": 0,
  "written_keys": 666,
  "read_keys": 0,
  "approximate_size": 0,
  "approximate_keys": 0
}

确认一下几个信息:

  1. 集群版本是多少?
  2. 通过 pd-ctl 执行 region --jq=’.regions[] | select(has(“leader”)|not) | {id:.id,peer_stores: [.peers[].store_id]}’ 命令查一下没有 leader 的 region 有哪些
  3. 可以在 tikv 日志中 grep “306940” 看下这个 region 为什么没有选举 leader ,是不是选举失败了
  4. 出现这个问题之前是有做什么操作吗?

1、版本是4.1.0-alpha

2、只有306940没有leader,

3、应该是没有足够的票数了,因为我把其它两个实例都下线掉了

4、做了很多的操作,因为机器断电了,使用下面这个命令也不行

[ 18:16:17-root@sea3:bin ]#./tikv-ctl --db /data/disk4/tikv/store/db unsafe-recover remove-fail-stores -s 9 -r 306940
removing stores [9] from configurations...
Debugger::remove_fail_stores: "Store 9 in the failed list"

目前除了这个region怎么删除都报错,其它两个实例也有这个副本,我把他们stop,但是一直不能从集群中抹除掉,就是因为这个region存在;所有region都迁移走了,除了这个region;有没有什么办法可以强制清除region?

目前 pd-ctl 执行 store 命令,看到的情况是怎么样的?剩余几个健康节点?
执行 unsafe-recovery 的时候是在所有剩余的健康 tikv 实例上执行的,执行的时候要停止 tikv 实例,-s 指定的是故障节点的 store id

» store
{
  "count": 6,
  "stores": [
    {
      "store": {
        "id": 1,
        "address": "10.59.108.120:20160",
        "version": "4.1.0-alpha",
        "status_address": "10.59.108.120:20180",
        "git_hash": "36dab75da84ec57374d364a4a4af9146ec31df07",
        "start_timestamp": 1600183753,
        "deploy_path": "/usr/local/tidb/bin/tikv-server",
        "last_heartbeat": 1603103398099173265,
        "state_name": "Up"
      },
      "status": {
        "capacity": "2.727TiB",
        "available": "2.566TiB",
        "used_size": "88.47GiB",
        "leader_count": 1707,
        "leader_weight": 1,
        "leader_score": 1707,
        "leader_size": 113079,
        "region_count": 5249,
        "region_weight": 1,
        "region_score": 350966,
        "region_size": 350966,
        "start_ts": "2020-09-15T23:29:13+08:00",
        "last_heartbeat_ts": "2020-10-19T18:29:58.099173265+08:00",
        "uptime": "811h0m45.099173265s"
      }
    },
    {
      "store": {
        "id": 4,
        "address": "10.59.108.120:20161",
        "version": "4.1.0-alpha",
        "status_address": "10.59.108.120:20181",
        "git_hash": "36dab75da84ec57374d364a4a4af9146ec31df07",
        "start_timestamp": 1600184473,
        "deploy_path": "/usr/local/tidb/bin/tikv-server",
        "last_heartbeat": 1603103397934070474,
        "state_name": "Up"
      },
      "status": {
        "capacity": "2.727TiB",
        "available": "2.57TiB",
        "used_size": "86.22GiB",
        "leader_count": 1712,
        "leader_weight": 1,
        "leader_score": 1712,
        "leader_size": 123631,
        "region_count": 5003,
        "region_weight": 1,
        "region_score": 351395,
        "region_size": 351395,
        "start_ts": "2020-09-15T23:41:13+08:00",
        "last_heartbeat_ts": "2020-10-19T18:29:57.934070474+08:00",
        "uptime": "810h48m44.934070474s"
      }
    },
    {
      "store": {
        "id": 6,
        "address": "10.59.108.121:20160",
        "state": 1,
        "version": "4.1.0-alpha",
        "status_address": "10.59.108.121:20180",
        "git_hash": "36dab75da84ec57374d364a4a4af9146ec31df07",
        "start_timestamp": 1603092338,
        "deploy_path": "/usr/local/tidb/bin/tikv-server",
        "last_heartbeat": 1603092322405633810,
        "state_name": "Offline"
      },
      "status": {
        "capacity": "0B",
        "available": "0B",
        "used_size": "0B",
        "leader_count": 0,
        "leader_weight": 1,
        "leader_score": 0,
        "leader_size": 0,
        "region_count": 0,
        "region_weight": 1,
        "region_score": 0,
        "region_size": 0,
        "start_ts": "2020-10-19T15:25:38+08:00",
        "last_heartbeat_ts": "2020-10-19T15:25:22.40563381+08:00"
      }
    },
    {
      "store": {
        "id": 9,
        "address": "10.59.108.121:20161",
        "version": "4.1.0-alpha",
        "status_address": "10.59.108.121:20181",
        "git_hash": "36dab75da84ec57374d364a4a4af9146ec31df07",
        "start_timestamp": 1603077067,
        "deploy_path": "/usr/local/tidb/bin/tikv-server",
        "last_heartbeat": 1603077009186794384,
        "state_name": "Down"
      },
      "status": {
        "capacity": "0B",
        "available": "0B",
        "used_size": "0B",
        "leader_count": 0,
        "leader_weight": 1,
        "leader_score": 0,
        "leader_size": 0,
        "region_count": 0,
        "region_weight": 1,
        "region_score": 0,
        "region_size": 0,
        "start_ts": "2020-10-19T11:11:07+08:00",
        "last_heartbeat_ts": "2020-10-19T11:10:09.186794384+08:00"
      }
    },
    {
      "store": {
        "id": 10,
        "address": "10.59.108.122:20160",
        "labels": [
          {
            "key": "host",
            "value": "10.59.108.122"
          }
        ],
        "version": "4.1.0-alpha",
        "status_address": "10.59.108.122:20180",
        "git_hash": "36dab75da84ec57374d364a4a4af9146ec31df07",
        "start_timestamp": 1600180326,
        "deploy_path": "/usr/local/tidb/bin/tikv-server",
        "last_heartbeat": 1603103405065431782,
        "state_name": "Up"
      },
      "status": {
        "capacity": "2.727TiB",
        "available": "2.576TiB",
        "used_size": "80.61GiB",
        "leader_count": 1710,
        "leader_weight": 1,
        "leader_score": 1710,
        "leader_size": 111794,
        "region_count": 5170,
        "region_weight": 1,
        "region_score": 351163,
        "region_size": 351163,
        "start_ts": "2020-09-15T22:32:06+08:00",
        "last_heartbeat_ts": "2020-10-19T18:30:05.065431782+08:00",
        "uptime": "811h57m59.065431782s"
      }
    },
    {
      "store": {
        "id": 11,
        "address": "10.59.108.122:20161",
        "version": "4.1.0-alpha",
        "status_address": "10.59.108.122:20181",
        "git_hash": "36dab75da84ec57374d364a4a4af9146ec31df07",
        "start_timestamp": 1600224819,
        "deploy_path": "/usr/local/tidb/bin/tikv-server",
        "last_heartbeat": 1603103407170117792,
        "state_name": "Up"
      },
      "status": {
        "capacity": "2.727TiB",
        "available": "2.574TiB",
        "used_size": "83.71GiB",
        "leader_count": 1712,
        "leader_weight": 1,
        "leader_score": 1712,
        "leader_size": 119822,
        "region_count": 5102,
        "region_weight": 1,
        "region_score": 351454,
        "region_size": 351454,
        "start_ts": "2020-09-16T10:53:39+08:00",
        "last_heartbeat_ts": "2020-10-19T18:30:07.170117792+08:00",
        "uptime": "799h36m28.170117792s"
      }
    }
  ]
}

region 306940分别是在store 11,store 9和store 6;这种情况是要在store 11上执行这条命令么,执行这条命令需要停止store 11还是store 9,还是都停止?

./tikv-ctl --db /data/disk4/tikv/store/db unsafe-recover remove-fail-stores -s 9 -r 306940

通过结果看是 9 和 6 节点是下线的节点,那么需要在剩余其他状态为 up 的节点上执行 unsafe-recover 操作,操作命令中 -s 指定 9 和 6 。且执行命令时,操作的 tikv 实例需要停止。至于是停止所有健康节点,然后一起在健康节点上操作,还是停止一个健康节点,执行命令,拉起这个健康节点,然后继续下一个,两种方案都可以,看选择。

另外需要确保 store 9 和 store 6 的 tikv 进程后续不会再启动恢复了,否则会有影响。

1 个赞

不是太明白,这种为什么要在所有的健康节点上做操作,306940 这个region只在store 11,store 9,store 6上存在,这时候也要在其它的健康节点上操作么?

» region 306940
{
  "id": 306940,
  "start_key": "7480000000000001FF985F728000000011FF0EB4510000000000FA",
  "end_key": "7480000000000001FFBC00000000000000F8",
  "epoch": {
    "conf_ver": 22493,
    "version": 884
  },
  "peers": [
    {
      "id": 495193,
      "store_id": 11
    },
    {
      "id": 495216,
      "store_id": 6
    },
    {
      "id": 495376,
      "store_id": 9
    }
  ],
  "written_bytes": 60187,
  "read_bytes": 0,
  "written_keys": 666,
  "read_keys": 0,
  "approximate_size": 0,
  "approximate_keys": 0
}

这样可以处理所有在 store 9 和 6 上存在 peer 的 region。
如果你可以确认只有 306940 这个 region 有 peer 在 store 9 或者 store 6 上,那只要处理这个 region 就可以,只在 store 11 上执行命令

ok,多谢多谢

那就是说只要停止store 11实例,然后在store 11上执行unsafe-recover命令即可,之后再启动store 11实例

嗯,是的

此话题已在最后回复的 1 分钟后被自动关闭。不再允许新回复。