TiKV Tombstone Store 无法正常删除

【 TiDB 使用环境】测试
【 TiDB 版本】v7.1.0
【遇到的问题:问题现象及影响】

我有一个TiKV 集群, 出现了下面非常奇怪的状态。
现在一共有 5个Region, 3 个处于UP 状态的Store 1,4, 1295 以及若干处于Tombstone 的Store。然而,region 2 似乎由于某种原因缺少一个副本

在PD上面执行 pd-ctl region
获得下面的Payload, 可以发现region 2 有两个Peer Store ID 13 和 468。通过pd-ctl store 已经发现这两个Store 处于Tombstone状态

现在我希望region 2 和其他region 一样拥有3副本并且分布在1,4,1295上。但无法找到正确的操作步骤

我尝试过pd-ctl store remove-tombstone
但是得到Failed to remove tombstone store [500] "failed stores: 13, 468"的错误

请教如何将region 2 恢复到正确的3副本状态

{
  "count": 5,
  "regions": [
    {
      "id": 11,
      "start_key": "7800000000000000FB",
      "end_key": "7800000100000000FB",
      "epoch": {
        "conf_ver": 65,
        "version": 5
      },
      "peers": [
        {
          "id": 12,
          "store_id": 1,
          "role_name": "Voter"
        },
        {
          "id": 1383,
          "store_id": 1295,
          "role_name": "Voter"
        },
        {
          "id": 1390,
          "store_id": 4,
          "role_name": "Voter"
        }
      ],
      "leader": {
        "id": 12,
        "store_id": 1,
        "role_name": "Voter"
      },
      "cpu_usage": 0,
      "written_bytes": 117,
      "read_bytes": 0,
      "written_keys": 2,
      "read_keys": 0,
      "approximate_size": 1,
      "approximate_keys": 0
    },
    {
      "id": 7,
      "start_key": "7200000000000000FB",
      "end_key": "7200000100000000FB",
      "epoch": {
        "conf_ver": 41,
        "version": 5
      },
      "peers": [
        {
          "id": 16,
          "store_id": 4,
          "role_name": "Voter"
        },
        {
          "id": 1299,
          "store_id": 1295,
          "role_name": "Voter"
        },
        {
          "id": 1388,
          "store_id": 1,
          "role_name": "Voter"
        }
      ],
      "leader": {
        "id": 16,
        "store_id": 4,
        "role_name": "Voter"
      },
      "cpu_usage": 0,
      "written_bytes": 117,
      "read_bytes": 0,
      "written_keys": 2,
      "read_keys": 0,
      "approximate_size": 1,
      "approximate_keys": 0
    },
    {
      "id": 5,
      "start_key": "",
      "end_key": "7200000000000000FB",
      "epoch": {
        "conf_ver": 65,
        "version": 5
      },
      "peers": [
        {
          "id": 6,
          "store_id": 1,
          "role_name": "Voter"
        },
        {
          "id": 514,
          "store_id": 4,
          "role_name": "Voter"
        },
        {
          "id": 1298,
          "store_id": 1295,
          "role_name": "Voter"
        }
      ],
      "leader": {
        "id": 514,
        "store_id": 4,
        "role_name": "Voter"
      },
      "cpu_usage": 0,
      "written_bytes": 106,
      "read_bytes": 0,
      "written_keys": 2,
      "read_keys": 0,
      "approximate_size": 1,
      "approximate_keys": 0
    },
    {
      "id": 9,
      "start_key": "7200000100000000FB",
      "end_key": "7800000000000000FB",
      "epoch": {
        "conf_ver": 83,
        "version": 5
      },
      "peers": [
        {
          "id": 43,
          "store_id": 4,
          "role_name": "Voter"
        },
        {
          "id": 297,
          "store_id": 1,
          "role_name": "Voter"
        },
        {
          "id": 1389,
          "store_id": 1295,
          "role_name": "Voter"
        }
      ],
      "leader": {
        "id": 297,
        "store_id": 1,
        "role_name": "Voter"
      },
      "cpu_usage": 0,
      "written_bytes": 117,
      "read_bytes": 0,
      "written_keys": 2,
      "read_keys": 0,
      "approximate_size": 1,
      "approximate_keys": 0
    },
    {
      "id": 2,
      "start_key": "7800000100000000FB",
      "end_key": "",
      "epoch": {
        "conf_ver": 46,
        "version": 5
      },
      "peers": [
        {
          "id": 14,
          "store_id": 4,
          "role_name": "Voter"
        },
        {
          "id": 44,
          "store_id": 13,
          "role_name": "Voter"
        },
        {
          "id": 469,
          "store_id": 468,
          "role_name": "Voter"
        },
        {
          "id": 479,
          "store_id": 1,
          "role": 1,
          "role_name": "Learner",
          "is_learner": true
        }
      ],
      "leader": {
        "id": 14,
        "store_id": 4,
        "role_name": "Voter"
      },
      "pending_peers": [
        {
          "id": 469,
          "store_id": 468,
          "role_name": "Voter"
        }
      ],
      "cpu_usage": 0,
      "written_bytes": 39,
      "read_bytes": 0,
      "written_keys": 1,
      "read_keys": 0,
      "approximate_size": 1,
      "approximate_keys": 0
    }
  ]
}

operator add transfer-peer 2 13 1
operator add transfer-peer 2 468 1295
这样执行下

1.首先,尝试使用PD的unsafe操作来移除失败的Store:
pd-ctl unsafe remove-failed-stores 13,468 这个命令会强制移除失败的Store,并尝试在剩余的健康Store上重建副本
2.如果上述命令成功执行,等待一段时间让PD自动调度。然后检查Region 2的状态:
pd-ctl region 2
3.如果Region 2仍然没有正确的副本分布,可以尝试手动添加Peer:
pd-ctl operator add add-peer 2 1
pd-ctl operator add add-peer 2 4
pd-ctl operator add add-peer 2 1295
4.如果添加Peer成功,但旧的Tombstone Peer仍然存在,可以尝试移除它们:
pd-ctl operator add remove-peer 2 13
pd-ctl operator add remove-peer 2 468
操作需谨慎,最好在业务低峰期进行,且先备份重要数据。

1 个赞

tiup cluster display看下缩容的tikv节点是不是都变为Tombstone,是不是还有出于pending offline的tikv

感谢。按照你说的方法。

成功恢复了 集群状态

此话题已在最后回复的 7 天后被自动关闭。不再允许新回复。