有个Region出问题，导致集群的tikv其中2个节点起不来，处于offline

晓峰008 · 2024 年10 月 21 日 01:24

【 TiDB 使用环境】生产环境
【 TiDB 版本】v4.0.7
【复现路径】集群的datanode3服务器由于硬件问题导致宕机，宕机持续8H左右。
起来后，主要做了以下动作：
1）排查节点日志，发现出现“failed to start”，确认结果为region 786763068的元数据不一致导致
2）删除执行涉及的region，如下操作
– 成功
tiup ctl:v4.0.7 pd -u http://xx.2:2379 operator add remove-peer 786763068 147
tiup ctl:v4.0.7 pd -u http://xx.2:2379 operator add remove-peer 786763068 146
tiup ctl:v4.0.7 pd -u http://xx.2:2379 operator add remove-peer 786763068 155

– 不成功失败
tiup ctl:v4.0.7 pd -u http://xx.2:2379 operator add remove-peer 786763068 8
– 不成功失败：提示opertator not found
tiup ctl:v4.0.7 pd -u http://xx.2:2379 operator remove 786763068

– 3，重启tikv，参照：https://ztn.feishu.cn/wiki/wikcnSEPzpX1PrZRyBLeFtVOwfb
tiup cluster restart tidb -R tikv

– 3，核实启动情况
tiup cluster display tidb
– 以上结果后：
1）原xx.6:20162 kv节点，从up反而变为offline
2）xx.8:20164 kv节点，还是处于down

– 4，发现xx.8还是down，做了scale-in下线操作
tiup cluster scale-in tidb -N xx.8:20164
– 以上结果后：
1）xx.6:20162 kv节点维持offline状态
2）xx.8:20164 kv节点，从down变成Pending Offline
3）以上2者虽然变成了类offline状态，但底层的数据并没有预想的快速合并

【遇到的问题：】
现象：region id由于服务器长时间宕机，导致集群起来有个节点的元数据不一致，进而导致集群读取数据异常
影响：经过节点offline操作处理后，数据虽然能访问，但底层出现严重的数据不一致、不完整等问题。
【资源配置】进入到 TiDB Dashboard -集群信息 (Cluster Info) -主机(Hosts) 截图此页面
【附件：截图/日志/监控】

Lucien-卢西恩 · 2024 年10 月 21 日 02:45

Region 不过自己出问题的，除非环境发生异常变化和不合规操作导致。梳理一下问题，当前问题看起来 2 TiKV 手动下线过程中，可能有其他 region 故障导致无法下线问题？可以看一下 TiKV 日志中的报错 region id 并检查 region group 状态是否正常，有没有副本丢失情况。

有猫万事足 · 2024 年10 月 21 日 02:58

xx.6:20162 kv节点

这个节点的storeid 是多少？

本来就坏了一个tikv，然后你上来第一步就要删除这个region 786763068的副本，可能直接把这个xx.6:20162上的副本删掉了。3个副本丢两个，raft组就不可用了。xx.6:20162也就offline了。

小龙虾爱大龙虾 · 2024 年10 月 21 日 03:33

TiDB 宕机时间再长也不会这样的，你不各种强制删除就不会有问题，底层有机制保证 Region 元数据正确的，TiDB 集群能自动判断哪个 Region 是更新的，老旧的无用的 Region 会由 TiDB 自动的删除。参考：TiKV 源码解析系列文章（二十）Region Split 源码解析 | PingCAP 平凯星辰

晓峰008 · 2024 年10 月 21 日 03:35

1，store id这边删除了145，146和147这3个。这3个要怎么进行定位他是在哪个region或tikv上。
2，我通过region check miss-peer查询到为空，通过tiup ctl:v4.0.7 pd -u http://xx.2:2379 region down-peer和pending-peer可以查询出很多数据

晓峰008 · 2024 年10 月 21 日 03:37

报错的region id已确认。目前为786763068这个id。如下：
Starting component ctl: /root/.tiup/components/ctl/v4.0.7/ctl pd -u http://xx.2:2379 region 786763068
{
“id”: 786763068,
“start_key”: “7480000000000169FF815F698000000000FF0000050419B4A884FFC90000000399A3B9FF6E02E04002000000FC”,
“end_key”: “7480000000000169FF815F698000000000FF00000604199844CBFF9000000003800000FF000001AC4F000000FC”,
“epoch”: {
“conf_ver”: 425259,
“version”: 87113
},
“peers”: [
{
“id”: 1322070660,
“store_id”: 8
}
],
“leader”: {
“id”: 1322070660,
“store_id”: 8
},
“written_bytes”: 1709890,
“read_bytes”: 0,
“written_keys”: 13755,
“read_keys”: 0,
“approximate_size”: 56,
“approximate_keys”: 853036
}

Lucien-卢西恩 · 2024 年10 月 21 日 03:38

@晓峰008 参考一下楼上的建议，另外如果是调度比较慢，可以通过调整 PD 的 scheduler region 和 leader 调度加速处理一下。
参考一下 pd-ctl 的 json 过滤一下 region 副本数量统计，看看是不是已经有 region 只剩下 1 个副本了。

https://docs.pingcap.com/zh/tidb/stable/pd-control#根据副本数过滤-region

晓峰008 · 2024 年10 月 21 日 03:43

好的。昨天同步做了调度上面的加速，不知到过不过。
我们环境：36个tikv节点，其中现有offine和pending offline各1个，每个tikv有750G空间。
以上节点部署在6台服务器，服务器为64核560G内存。
– 调大leader-schedule-limit，region-schedule-limit与replica-schedule-limit
– 由40调整为240
tiup ctl:v4.0.7 pd -u http://xx.2:2379 config set leader-schedule-limit 240
– 由120调整到720
tiup ctl:v4.0.7 pd -u http://xx.2:2379 config set region-schedule-limit 720
– 由64调整到384
tiup ctl:v4.0.7 pd -u http://xx.2:2379 config set replica-schedule-limit 384

Lucien-卢西恩 · 2024 年10 月 21 日 03:50

调整以后，region 的调度速度加快了吗？现在 offline 的 region count 还剩下多少？另外读写影响角度是什么样的？

晓峰008 · 2024 年10 月 21 日 03:56

1，速度有快些。但整体没有降很多。原先是750G，现在660G左右。读写影响不大。本身磁盘全是固态盘。现在继续给大参数？
2，pending offine 与offine什么区别呢？涉及的region count要怎么统计剩下多少呢？
3，tiup ctl:v4.0.7 pd -u http://xx.2:2379 region --jq=“.regions | {id: .id, peer_stores: [.peers.store_id] | select(length != 3)}” 过滤副本数，信息如下：{“id”:89431513,“peer_stores”:[165,146,169,151]}
{“id”:62742555,“peer_stores”:[166,4,148,156]}
{“id”:341410570,“peer_stores”:[159,1,169,156]}
{“id”:785817163,“peer_stores”:[169,1,6,161]}
{“id”:1168013973,“peer_stores”:[157,141,169,159]}
{“id”:62745616,“peer_stores”:[158,151,150,7]}
{“id”:627056241,“peer_stores”:[141,151,159,156]}
{“id”:852130366,“peer_stores”:[165,6,169,144]}
{“id”:849948423,“peer_stores”:[6,159,170,148]}
{“id”:369228200,“peer_stores”:[158,152,148,144]}

Lucien-卢西恩 · 2024 年10 月 21 日 03:59

主要看看 region count ，offline 状态 tikv 的空间容量暂时不用看看的。
offline 是正在下线，pending offline 是挂起，需要看看 PD leader 日志为啥挂起这个。region count 统计依赖的 pd-ctl store id 查询当前 store 的 region count 结果，注意一下 offline 状态的 tikv 节点除非是物理机器无法恢复了，不然必须是一个 tikv 启动的状态进行。
需要安装 jq 的工具哈，不是 pd-ctl 自带的。

晓峰008 · 2024 年10 月 21 日 04:03

1，region count这个指令要怎么调用计算呢
2，副本数过滤出来的id与peer_stores分别是代表region id核store id是吗？region id 和store id是什么联系的呢

Lucien-卢西恩 · 2024 年10 月 21 日 04:04

不用计算，pd-ctl store 查询就可以看到 region 剩余数量；
看结果应该是 region 状态是安全的，没有小于 3 个副本的 region，可以先看一下 pd 自己调度情况，再看看 region group 里面 peer 不等于 3 的情况。

晓峰008 · 2024 年10 月 21 日 04:10

1，上面的结果筛选副本数不等3的，还有9个region id。昨天排查的异常region 786763068 由1副本变为3副本了
2，pd信息如下：

图中Status中的Up|L的就代表是leader对吧
3，调度的数量，同步增加10倍
– 由240调整为2400
tiup ctl:v4.0.7 pd -u http://xx.2:2379 config set leader-schedule-limit 2400
– 由720调整到7200
tiup ctl:v4.0.7 pd -u http://xx.2:2379 config set region-schedule-limit 7200
– 由384调整到3840
tiup ctl:v4.0.7 pd -u http://xx.2:2379 config set replica-schedule-limit 3840

晓峰008 · 2024 年10 月 21 日 05:36

pd leader日志通过哪些关键字查询pending offline信息呢

system · 2024 年10 月 28 日 05:37

此话题已在最后回复的 7 天后被自动关闭。不再允许新回复。