Tikv 意外崩溃, 重启失败 panic_mark_file

最强王者 · 2024 年1 月 25 日 12:46

【 TiDB 使用环境】生产环境 /测试/ Poc
【 TiDB 版本】
5.2.3
【复现路径】做过哪些操作出现的问题

[2024/01/25 20:46:37.618 +08:00] [FATAL] [server.rs:405] [“panic_mark_file tidb/tidb-data/tikv-20160/panic_mark_file exists, there must be somethingwrong with the db. Do not remove the panic_mark_file and force the TiKV node to restart. Please contact TiKV maintainers to investigate the issue. If needed, use scale in and scale out to replace the TiKV node. https://docs.pingcap.com/tidb/stable/scale-tidb-using-tiup”]
【遇到的问题：问题现象及影响】
Tikv 挂掉，内部一直再重启，手动重启失败
【资源配置】进入到 TiDB Dashboard -集群信息 (Cluster Info) -主机(Hosts) 截图此页面
【附件：截图/日志/监控】

像风一样的男子 · 2024 年1 月 25 日 12:48

看着像触发了某个bug

tidb狂热爱好者 · 2024 年1 月 25 日 12:48

靠抢分呀回复这么快

WalterWj · 2024 年1 月 25 日 12:49

暴力点，这个 panic 文件可以 mv 走。

最强王者 · 2024 年1 月 25 日 12:51

日志提示不是不能移走吗

像风一样的男子 · 2024 年1 月 25 日 12:51

也顺便查下是否有sst文件损坏
https://docs.pingcap.com/zh/tidb/stable/tikv-control#打印损坏的-sst-文件信息

最强王者 · 2024 年1 月 25 日 12:52

好的稍等我查看下

最强王者 · 2024 年1 月 25 日 12:53

这是在中控机上面执行吗？ tikv-ctl --data-dir </path/to/tikv> bad-ssts --pd

WalterWj · 2024 年1 月 25 日 12:55

所以说的是：暴力点。

尽量不删除。可以考虑扩缩容。
然后匹配下 tikv 日志中 panic 关键字。发出堆栈，可能是已知 bug

像风一样的男子 · 2024 年1 月 25 日 12:57

在出问题的kv上执行

像风一样的男子 · 2024 年1 月 25 日 12:59

做最坏打算修复数据

最强王者 · 2024 年1 月 25 日 13:00

m","max-resource-groups":2000,"precision":"1s"}}"]
[2024/01/25 18:06:43.640 +08:00] [FATAL] [server.rs:405] [“panic_mark_file /data1/tidb/tidb-data/tikv-20160/panic_mark_file exists, there must be something
wrong with the db. Do not remove the panic_mark_file and force the TiKV node to restart. Please contact TiKV maintainers to investigate the issue. If needed
, use scale in and scale out to replace the TiKV node. https://docs.pingcap.com/tidb/stable/scale-tidb-using-tiup”]

关键字–panic 每过一段时间就会出现

最强王者 · 2024 年1 月 25 日 13:01

好的谢谢

tidb狂热爱好者 · 2024 年1 月 25 日 13:08

你坏了的节点都是hhd的吧

最强王者 · 2024 年1 月 25 日 13:11

/root/.tiup/components/ctl/v5.2.3/ctl tikv --data-dir /data1/tidb/tidb-data/tikv-20160 bad-ssts --pd ip:2379 执行后没有信息

最强王者 · 2024 年1 月 25 日 13:12

是ssd

最强王者 · 2024 年1 月 25 日 13:12

执行没有返回值磁盘应该没有问题

tidb狂热爱好者 · 2024 年1 月 25 日 13:15

那移除试试

江湖故人 · 2024 年1 月 25 日 14:27

panic_mark_file里面有些什么有用信息吗？看其他帖子好像出现这个文件就是要准备异常恢复了。

dba远航 · 2024 年1 月 27 日 07:05

感觉是原来的进程文件还没停止干净