【 TiDB 使用环境】生产环境
【 TiDB 版本】
【复现路径】
【遇到的问题:问题现象及影响】
【资源配置】
【附件:截图/日志/监控】
机房断电导致tikv报错
[2023/11/21 13:59:44.969 +08:00] [WARN] [endpoint.rs:780] [error-response] [err=“Region error (will back off and retry) message: "peer is not leader for region 9942146, leader may None" not_leader { region_id: 9942146 }”]
[2023/11/21 13:59:45.447 +08:00] [ERROR] [raft_client.rs:796] [“resolve store address failed”] [err_code=KV:Unknown] [err=“Other("[src/server/resolve.rs:100]: unknown error \"[components/pd_client/src/util.rs:878]: invalid store ID 49005, not found\"")”] [store_id=49005]
tidb服务一直启动不了数据库也启动不了
集群状况怎么样了 tiup display看下,看起来像磁盘出问题了
display下看看起来几个节点
感觉象是PD上面的TiKV(region)信息过时造成的
我尝试使用unsafe remove-failed-stores 2个store的id分别在5/6 6/7的时候中断过,也试过unsafe-recover remove-fail-stores、recreate-region均无法启动
原来有5个节点,包括2个offline节点,正好断电了,我把他缩容成3个了。
kubectl get -n advanced-tidb tidbcluster advanced-tidb -ojson | jq '.status.tikv.stores'
{
"1": {
"id": "1",
"ip": "advanced-tidb-tikv-0.advanced-tidb-tikv-peer.advanced-tidb.svc",
"lastTransitionTime": "2023-11-21T05:45:05Z",
"leaderCount": 113,
"podName": "advanced-tidb-tikv-0",
"state": "Up"
},
"4": {
"id": "4",
"ip": "advanced-tidb-tikv-1.advanced-tidb-tikv-peer.advanced-tidb.svc",
"lastTransitionTime": "2023-11-21T05:48:04Z",
"leaderCount": 93,
"podName": "advanced-tidb-tikv-1",
"state": "Up"
},
"5": {
"id": "5",
"ip": "advanced-tidb-tikv-2.advanced-tidb-tikv-peer.advanced-tidb.svc",
"lastTransitionTime": "2023-11-21T05:46:50Z",
"leaderCount": 112,
"podName": "advanced-tidb-tikv-2",
"state": "Up"
}
}
display看一下
缩容节点的操作完成了吗,再看看各个节点的时间
最终检查结果如何?
强烈建议机房必备UPS,我们这边吃过两次亏了~