TIKV挂掉之后无法启动

【 TiDB 使用环境】生产环境
【 TiDB 版本】 v2.0.5
【遇到的问题:问题现象及影响】

  1. tikv节点突然挂掉,我们启动之后无法启动,报错tomestone region,大概启动后3分钟左右panic
  2. operator add transfer-peer 99465046 55845281 11562387 执行了这个命令,将这个节点上的peers 移动到其他节点上,重启还是提示这个region tomestone
  3. 手动执行./bin/tikv-ctl --db ./data/db tombstone -p 10.3.240.23:2379 -r 99465045 在这个节点上将这个region 修改为tombstone
  4. 程序启动成功,但是一直报 ERROR] send raft msg to raft store fail: Transport(Discard(“Failed to send Raft Message due to full”)) 这个错误,查询也无法查询 提示(9005, ‘Region is unavailable[try again later]’)
    【附件:截图/日志/监控】

v2 已经过了 eol,请及时升级。

节点不可用,先扩容节点到至少3个节点可用状态,再把坏掉的节点缩容掉。

你说的无法启动

是节点无法启动,还是集群无法启动?

是不是磁盘满了

发一下对应 TiKV 宕机的完整日支看一下? 不看判断是网卡打满还是 TiKV Panic 问题

» store
{
“count”: 6,
“stores”: [
{
“store”: {
“id”: 11562387,
“address”: “10.3.240.41:20160”,
“state_name”: “Up”
},
“status”: {
“capacity”: “2.9 TiB”,
“available”: “2.6 TiB”,
“leader_count”: 119,
“leader_weight”: 1,
“leader_score”: 8484,
“leader_size”: 8484,
“region_count”: 292,
“region_weight”: 1,
“region_score”: 21669,
“region_size”: 21669,
“start_ts”: “2024-10-10T23:25:31+08:00”,
“last_heartbeat_ts”: “2024-10-11T20:19:29.287205073+08:00”,
“uptime”: “20h53m58.287205073s”
}
},
{
“store”: {
“id”: 11562389,
“address”: “10.3.240.42:20160”,
“state_name”: “Up”
},
“status”: {
“capacity”: “2.9 TiB”,
“available”: “535 GiB”,
“leader_count”: 69000,
“leader_weight”: 1,
“leader_score”: 6359410,
“leader_size”: 6359410,
“region_count”: 142629,
“region_weight”: 1,
“region_score”: 11650944,
“region_size”: 11650944,
“start_ts”: “2024-10-10T21:51:50+08:00”,
“last_heartbeat_ts”: “2024-10-11T20:19:33.739264382+08:00”,
“uptime”: “22h27m43.739264382s”
}
},
{
“store”: {
“id”: 55845281,
“address”: “10.3.240.40:20160”,
“state”: 1,
“state_name”: “Offline”
},
“status”: {
“capacity”: “2.9 TiB”,
“available”: “513 GiB”,
“leader_count”: 5143,
“leader_weight”: 1,
“leader_score”: 461389,
“leader_size”: 461389,
“region_count”: 146554,
“region_weight”: 1,
“region_score”: 11870790,
“region_size”: 11870790,
“start_ts”: “2024-10-11T18:55:41+08:00”,
“last_heartbeat_ts”: “2024-10-11T20:19:29.784650151+08:00”,
“uptime”: “1h23m48.784650151s”
}
},
{
“store”: {
“id”: 1248143,
“address”: “10.3.240.39:20160”,
“state”: 1,
“state_name”: “Offline”
},
“status”: {
“capacity”: “1.5 TiB”,
“available”: “896 GiB”,
“leader_count”: 12862,
“leader_weight”: 1,
“leader_score”: 956919,
“leader_size”: 956919,
“region_count”: 13604,
“region_weight”: 1,
“region_score”: 1012910,
“region_size”: 1012910,
“start_ts”: “2024-10-11T14:48:50+08:00”,
“last_heartbeat_ts”: “2024-10-11T20:19:32.396749377+08:00”,
“uptime”: “5h30m42.396749377s”
}
},
{
“store”: {
“id”: 11879321,
“address”: “10.3.240.43:20160”,
“state”: 1,
“state_name”: “Offline”
},
“status”: {
“capacity”: “1.5 TiB”,
“available”: “608 GiB”,
“leader_count”: 13720,
“leader_weight”: 1,
“leader_score”: 1025219,
“leader_size”: 1025219,
“region_count”: 14450,
“region_weight”: 1,
“region_score”: 1081581,
“region_size”: 1081581,
“start_ts”: “2024-10-10T16:41:21+08:00”,
“last_heartbeat_ts”: “2024-10-11T20:19:28.628857931+08:00”,
“uptime”: “27h38m7.628857931s”
}
},
{
“store”: {
“id”: 11879322,
“address”: “10.3.240.45:20160”,
“state_name”: “Up”
},
“status”: {
“capacity”: “1.5 TiB”,
“available”: “538 GiB”,
“leader_count”: 46806,
“leader_weight”: 1,
“leader_score”: 3129308,
“leader_size”: 3129308,
“region_count”: 48008,
“region_weight”: 1,
“region_score”: 3234663,
“region_size”: 3234663,
“start_ts”: “2024-10-11T11:46:23+08:00”,
“last_heartbeat_ts”: “2024-10-11T20:19:36.359835825+08:00”,
“uptime”: “8h33m13.359835825s”
}
}
]
}
现在已经扩容到3个可用节点了,目前的想法是走下线流程把55845281这个store下线掉

现在节点已经启动了,再走下线流程,但是tidb查询的时候无法查询

磁盘没有满

确定两个问题:

  1. TiKV 节点突然挂掉了几个节点?因为看到 PD 记录 store 状态时 3 个 TiKV 同时在做 offline;
  2. 可以一下 Region is unavailable 得 Region id 所在得 store 是哪几个,通过 region id 在 PD 里面可以查到。
1 个赞

进入到 TiDB Dashboard -集群信息 (Cluster Info) -主机(Hosts) 截图此页面
看下你的配置~

你一共6个tikv节点,直接下3个?你原来的拓扑结构是什么样的?

  • 检查网络连接:确保所有节点之间的网络连接正常,没有丢包或延迟。
  • 查看日志:通过查看 TikV 的日志文件,确认是否有其他错误信息或者异常警告。

是因为不够3副本使用的节点数量或者存储了吧