tikv找不到store

【 TiDB 使用环境】生产环境
【 TiDB 版本】
【复现路径】
【遇到的问题:问题现象及影响】
【资源配置】
【附件:截图/日志/监控】
机房断电导致tikv报错
[2023/11/21 13:59:44.969 +08:00] [WARN] [endpoint.rs:780] [error-response] [err=“Region error (will back off and retry) message: "peer is not leader for region 9942146, leader may None" not_leader { region_id: 9942146 }”]
[2023/11/21 13:59:45.447 +08:00] [ERROR] [raft_client.rs:796] [“resolve store address failed”] [err_code=KV:Unknown] [err=“Other("[src/server/resolve.rs:100]: unknown error \"[components/pd_client/src/util.rs:878]: invalid store ID 49005, not found\"")”] [store_id=49005]
tidb服务一直启动不了数据库也启动不了

集群状况怎么样了 tiup display看下,看起来像磁盘出问题了

display下看看起来几个节点

感觉象是PD上面的TiKV(region)信息过时造成的

我尝试使用unsafe remove-failed-stores 2个store的id分别在5/6 6/7的时候中断过,也试过unsafe-recover remove-fail-stores、recreate-region均无法启动 :face_with_head_bandage:

原来有5个节点,包括2个offline节点,正好断电了,我把他缩容成3个了。

kubectl get -n advanced-tidb tidbcluster advanced-tidb -ojson | jq '.status.tikv.stores'
{
  "1": {
    "id": "1",
    "ip": "advanced-tidb-tikv-0.advanced-tidb-tikv-peer.advanced-tidb.svc",
    "lastTransitionTime": "2023-11-21T05:45:05Z",
    "leaderCount": 113,
    "podName": "advanced-tidb-tikv-0",
    "state": "Up"
  },
  "4": {
    "id": "4",
    "ip": "advanced-tidb-tikv-1.advanced-tidb-tikv-peer.advanced-tidb.svc",
    "lastTransitionTime": "2023-11-21T05:48:04Z",
    "leaderCount": 93,
    "podName": "advanced-tidb-tikv-1",
    "state": "Up"
  },
  "5": {
    "id": "5",
    "ip": "advanced-tidb-tikv-2.advanced-tidb-tikv-peer.advanced-tidb.svc",
    "lastTransitionTime": "2023-11-21T05:46:50Z",
    "leaderCount": 112,
    "podName": "advanced-tidb-tikv-2",
    "state": "Up"
  }
}

display看一下

缩容节点的操作完成了吗,再看看各个节点的时间

最终检查结果如何?

:thinking:强烈建议机房必备UPS,我们这边吃过两次亏了~