Tikv节点挂掉后，启动报错“[region 32] 33 to_commit 405937 is out of range [last_index 405933]”

wfxxh · 2019 年11 月 29 日 02:11

为提高效率，提问时请提供以下信息，问题描述清晰可优先响应。

【TiDB 版本】：v3.0.2
【问题描述】：tikv节点挂掉之后，启动不了 [region 32] 33 to_commit 405937 is out of range [last_index 405933]

报错部分日志：tikv异常日志.txt (5.1 KB)

GangShen · 2019 年11 月 29 日 02:24

sync-log 参数开了吗

https://pingcap.com/docs-cn/stable/reference/configuration/tikv-server/configuration-file/#sync-log

wfxxh · 2019 年11 月 29 日 02:25

没开。

GangShen · 2019 年11 月 29 日 02:29

麻烦提供一下完整一点的 tikv 日志，我确认一下问题，目前看起来想是 sync-log 没开导致日志丢失，导致 peer commit 一条不存在的日志，导致报错

wfxxh · 2019 年11 月 29 日 02:34

完整日志太大了，有1.2G。我导出了最后1000行

mylog.log (148.7 KB)

GangShen · 2019 年11 月 29 日 02:42

在挂掉的 tikv 节点上执行一下下面的命令，找一下 bad region,db 目录为当前损坏 tikv 节点的目录: ./tikv-ctl --db /home/tidb/deploy/data/db bad-regions

会输出所有存在问题的 region，如果没有输出结果，需要通过 grep panic tikv.log | grep region 获取损坏的region

通过 pd-ctl 获取一下 store 的信息

wfxxh · 2019 年11 月 29 日 03:01

执行tikv-ctl后输出为

再次执行该命令输出为：all regions are healthy

pd-ctl store.log (1.7 KB)

grep panic 日志 panic.log (4.1 MB)

GangShen · 2019 年11 月 29 日 03:28

是在挂掉的 tikv 节点上执行的吗？执行命令的时候 tikv 是关闭的吧

wfxxh · 2019 年11 月 29 日 03:34

启动不了。

GangShen · 2019 年11 月 29 日 03:38

那麻烦检查一下副本数不为 3 的 region：

https://pingcap.com/docs-cn/stable/reference/configuration/tikv-server/configuration-file/#compression-per-level

GangShen · 2019 年11 月 29 日 04:53

结果上传一下

wfxxh · 2019 年11 月 29 日 04:56

neThree.out (520.9 KB)

GangShen · 2019 年11 月 29 日 05:10

执行一下将 region 设置为 tombstone 的操作

要求：在掉点故障的 TiKV 上执行；TiKV 处于关闭状态 tikv-ctl --db /path/to/tikv-data/db tombstone -r <region-id> --force

从 panic 日志中看到损坏的 region 是 32，所以这边 region-id 填 32 即可

执行成功之后尝试重启一下 tikv 实例，看下能否启动

wfxxh · 2019 年11 月 29 日 05:41

谢谢，成功了，是因为sync-log没开导致的吗？

GangShen · 2019 年11 月 29 日 05:47

是的，sync-log 参数没有开启，导致 tikv 节点宕机的时候，raft log 丢失，tikv 节点重启时 commit 了不存在的 log，导致报错。

建议还是将 sync-log 参数开启

fredchenbj · 2020 年2 月 29 日 01:15

@gangshen-PingCAP 我也遇到了这样的问题，为什么 /tikv-ctl --db /home/tidb/deploy/data/db bad-regions 没有返回出问题的 region 呢？每次启动报一个 region 错误，然后设置 tombstone，比较蛮烦

GangShen · 2020 年2 月 29 日 17:13

通过 pd-ctl 调整 PD 的配置，禁用相关调度,在故障解决之后恢复参数

pd-ctl>> config show
pd-ctl>> config set region-schedule-limit 0
pd-ctl>> config set replica-schedule-limit 0
pd-ctl>> config set leader-schedule-limit 0
pd-ctl>> config set merge-schedule-limit 0

滚动重启所有可以正常启动的 TiKV，可以先恢复 commit index is out of range 的错误
尝试启动之前无法启动的 TiKV，并观察日志确定是哪种类型的数据损坏。如果日志中有 last index,commit index 等相关描述，则可以确定是 Raft 状态机的损坏，如果可以看到 “Sst file size mismatch" 等错误，则可以确定是 RocksDB 在 Apply Snapshot 时出现了损坏

fredchenbj · 2020 年3 月 2 日 02:33

步骤 2 是说对于 commit index is out of range 错误，可以通过滚动重启来解决么？

fredchenbj · 2020 年3 月 2 日 02:35

applied index 、 commit index 、 last index 这三种 index 有什么区别和关系呢？是要求 applied index <= commit index <= last index 么？

飞与非-PingCAP · 2020 年3 月 2 日 04:09

删除掉有问题的 region，然后通过重启的方式能绕过问题（不再报错）