3虚拟机6实例中的1个实例down了,询问应该如何处理

【TiDB 版本】
5.0
【问题描述】
一台TiKV突然Down了,具体时间不确定,也没去翻监控。

尝试使用tiup cluster reload tidb-test --node=192.168.1.229:20161,报错信息如下

翻了下tidb的的用户手册,没找到针对单个实例down掉后处理的方案,所以来这发帖问一下,应该如何处理


若提问为性能优化、故障排查类问题,请下载脚本运行。终端输出的打印结果,请务必全选并复制粘贴上传。

restart -N tikvip:tikvport 试下。

你的这个报错可以用 reload -R prometheus,grafana 看看有用吗

restart 是成功了,但是那个instance还是down的状态

reload -R prometheus,grafana 应该没用吧?那两个不是监控的节点么~

翻了一下日志,最早出现的错误如下

[2021/04/30 05:50:20.258 +08:00] [ERROR] [server.rs:862] ["failed to init io snooper"] [err_code=KV:Unknown] [err="\"IO snooper is not started due to not compiling with BCC\""]
[2021/04/30 05:50:22.043 +08:00] [ERROR] [region.rs:412] ["failed to apply snap!!!"] [err_code=KV:Raftstore:SnapUnknown] [err="Other(\"[components/raftstore/src/store/snap.rs:826]: \\\"[components/raftstore/src/store/snap.rs:297]: invalid checksum 3542642250 for snapshot cf file /data/tikv-20161/snap/rev_12381_32_54_default.sst, expected 321221928\\\"\")"]
[2021/04/30 05:50:38.502 +08:00] [ERROR] [server.rs:862] ["failed to init io snooper"] [err_code=KV:Unknown] [err="\"IO snooper is not started due to not compiling with BCC\""]
[2021/04/30 05:50:40.241 +08:00] [ERROR] [region.rs:412] ["failed to apply snap!!!"] [err_code=KV:Raftstore:SnapUnknown] [err="Other(\"[components/raftstore/src/store/snap.rs:826]: \\\"[components/raftstore/src/store/snap.rs:297]: invalid checksum 3542642250 for snapshot cf file /data/tikv-20161/snap/rev_12381_32_54_default.sst, expected 321221928\\\"\")"]
[2021/04/30 05:50:58.042 +08:00] [ERROR] [server.rs:862] ["failed to init io snooper"] [err_code=KV:Unknown] [err="\"IO snooper is not started due to not compiling with BCC\""]
[2021/04/30 05:50:58.395 +08:00] [ERROR] [region.rs:412] ["failed to apply snap!!!"] [err_code=KV:Raftstore:SnapUnknown] [err="Other(\"[components/raftstore/src/store/snap.rs:826]: \\\"[components/raftstore/src/store/snap.rs:297]: invalid checksum 3542642250 for snapshot cf file /data/tikv-20161/snap/rev_12381_32_54_default.sst, expected 321221928\\\"\")"]
[2021/04/30 05:51:14.775 +08:00] [ERROR] [server.rs:862] ["failed to init io snooper"] [err_code=KV:Unknown] [err="\"IO snooper is not started due to not compiling with BCC\""]
[2021/04/30 05:51:16.405 +08:00] [ERROR] [region.rs:412] ["failed to apply snap!!!"] [err_code=KV:Raftstore:SnapUnknown] [err="Other(\"[components/raftstore/src/store/snap.rs:826]: \\\"[components/raftstore/src/store/snap.rs:297]: invalid checksum 3542642250 for snapshot cf file /data/tikv-20161/snap/rev_12381_32_54_default.sst, expected 321221928\\\"\")"]
[2021/04/30 05:51:34.059 +08:00] [ERROR] [server.rs:862] ["failed to init io snooper"] [err_code=KV:Unknown] [err="\"IO snooper is not started due to not compiling with BCC\""]
[2021/04/30 05:51:34.529 +08:00] [ERROR] [region.rs:412] ["failed to apply snap!!!"] [err_code=KV:Raftstore:SnapUnknown] [err="Other(\"[components/raftstore/src/store/snap.rs:826]: \\\"[components/raftstore/src/store/snap.rs:297]: invalid checksum 3542642250 for snapshot cf file /data/tikv-20161/snap/rev_12381_32_54_default.sst, expected 321221928\\\"\")"]
[2021/04/30 05:51:51.019 +08:00] [ERROR] [server.rs:862] ["failed to init io snooper"] [err_code=KV:Unknown] [err="\"IO snooper is not started due to not compiling with BCC\""]

今天的报错信息是一样的

[2021/05/06 14:28:09.554 +08:00] [ERROR] [region.rs:412] ["failed to apply snap!!!"] [err_code=KV:Raftstore:SnapUnknown] [err="Other(\"[components/raftstore/src/store/snap.rs:826]: \\\"[components/raftstore/src/store/snap.rs:297]: invalid checksum 3542642250 for snapshot cf file /data/tikv-20161/snap/rev_12381_32_54_default.sst, expected 321221928\\\"\")"]
[2021/05/06 14:28:25.909 +08:00] [ERROR] [server.rs:862] ["failed to init io snooper"] [err_code=KV:Unknown] [err="\"IO snooper is not started due to not compiling with BCC\""]
[2021/05/06 14:28:27.547 +08:00] [ERROR] [region.rs:412] ["failed to apply snap!!!"] [err_code=KV:Raftstore:SnapUnknown] [err="Other(\"[components/raftstore/src/store/snap.rs:826]: \\\"[components/raftstore/src/store/snap.rs:297]: invalid checksum 3542642250 for snapshot cf file /data/tikv-20161/snap/rev_12381_32_54_default.sst, expected 321221928\\\"\")"]
[2021/05/06 14:28:43.920 +08:00] [ERROR] [server.rs:862] ["failed to init io snooper"] [err_code=KV:Unknown] [err="\"IO snooper is not started due to not compiling with BCC\""]
[2021/05/06 14:28:45.570 +08:00] [ERROR] [region.rs:412] ["failed to apply snap!!!"] [err_code=KV:Raftstore:SnapUnknown] [err="Other(\"[components/raftstore/src/store/snap.rs:826]: \\\"[components/raftstore/src/store/snap.rs:297]: invalid checksum 3542642250 for snapshot cf file /data/tikv-20161/snap/rev_12381_32_54_default.sst, expected 321221928\\\"\")"]
[2021/05/06 14:29:02.040 +08:00] [ERROR] [server.rs:862] ["failed to init io snooper"] [err_code=KV:Unknown] [err="\"IO snooper is not started due to not compiling with BCC\""]
[2021/05/06 14:29:03.697 +08:00] [ERROR] [region.rs:412] ["failed to apply snap!!!"] [err_code=KV:Raftstore:SnapUnknown] [err="Other(\"[components/raftstore/src/store/snap.rs:826]: \\\"[components/raftstore/src/store/snap.rs:297]: invalid checksum 3542642250 for snapshot cf file /data/tikv-20161/snap/rev_12381_32_54_default.sst, expected 321221928\\\"\")"]
[2021/05/06 14:29:20.195 +08:00] [ERROR] [server.rs:862] ["failed to init io snooper"] [err_code=KV:Unknown] [err="\"IO snooper is not started due to not compiling with BCC\""]
[2021/05/06 14:29:21.811 +08:00] [ERROR] [region.rs:412] ["failed to apply snap!!!"] [err_code=KV:Raftstore:SnapUnknown] [err="Other(\"[components/raftstore/src/store/snap.rs:826]: \\\"[components/raftstore/src/store/snap.rs:297]: invalid checksum 3542642250 for snapshot cf file /data/tikv-20161/snap/rev_12381_32_54_default.sst, expected 321221928\\\"\")"]
[2021/05/06 14:29:38.208 +08:00] [ERROR] [server.rs:862] ["failed to init io snooper"] [err_code=KV:Unknown] [err="\"IO snooper is not started due to not compiling with BCC\""]
[2021/05/06 14:29:39.815 +08:00] [ERROR] [region.rs:412] ["failed to apply snap!!!"] [err_code=KV:Raftstore:SnapUnknown] [err="Other(\"[components/raftstore/src/store/snap.rs:826]: \\\"[components/raftstore/src/store/snap.rs:297]: invalid checksum 3542642250 for snapshot cf file /data/tikv-20161/snap/rev_12381_32_54_default.sst, expected 321221928\\\"\")"]

把 192.168.1.229:20161 下掉后,192.168.1.230:20161 也 Down了

错误日志如下

[2021/05/06 18:33:46.374 +08:00] [ERROR] [server.rs:862] ["failed to init io snooper"] [err_code=KV:Unknown] [err="\"IO snooper is not started due to not compiling with BCC\""]
[2021/05/06 18:33:46.702 +08:00] [ERROR] [region.rs:412] ["failed to apply snap!!!"] [err_code=KV:Raftstore:SnapUnknown] [err="Other(\"[components/raftstore/src/store/snap.rs:826]: \\\"[components/raftstore/src/store/snap.rs:297]: invalid checksum 2834656378 for snapshot cf file /data/tikv-20161/snap/rev_14665_35_56_default.sst, expected 651506583\\\"\")"]
[2021/05/06 18:34:03.336 +08:00] [ERROR] [server.rs:862] ["failed to init io snooper"] [err_code=KV:Unknown] [err="\"IO snooper is not started due to not compiling with BCC\""]
[2021/05/06 18:34:04.875 +08:00] [ERROR] [region.rs:412] ["failed to apply snap!!!"] [err_code=KV:Raftstore:SnapUnknown] [err="Other(\"[components/raftstore/src/store/snap.rs:826]: \\\"[components/raftstore/src/store/snap.rs:297]: invalid checksum 2834656378 for snapshot cf file /data/tikv-20161/snap/rev_14665_35_56_default.sst, expected 651506583\\\"\")"]
[2021/05/06 18:34:21.338 +08:00] [ERROR] [server.rs:862] ["failed to init io snooper"] [err_code=KV:Unknown] [err="\"IO snooper is not started due to not compiling with BCC\""]
[2021/05/06 18:34:22.949 +08:00] [ERROR] [region.rs:412] ["failed to apply snap!!!"] [err_code=KV:Raftstore:SnapUnknown] [err="Other(\"[components/raftstore/src/store/snap.rs:826]: \\\"[components/raftstore/src/store/snap.rs:297]: invalid checksum 2834656378 for snapshot cf file /data/tikv-20161/snap/rev_14665_35_56_default.sst, expected 651506583\\\"\")"]
[2021/05/06 18:34:40.860 +08:00] [ERROR] [server.rs:862] ["failed to init io snooper"] [err_code=KV:Unknown] [err="\"IO snooper is not started due to not compiling with BCC\""]
[2021/05/06 18:34:41.216 +08:00] [ERROR] [region.rs:412] ["failed to apply snap!!!"] [err_code=KV:Raftstore:SnapUnknown] [err="Other(\"[components/raftstore/src/store/snap.rs:826]: \\\"[components/raftstore/src/store/snap.rs:297]: invalid checksum 2834656378 for snapshot cf file /data/tikv-20161/snap/rev_14665_35_56_default.sst, expected 651506583\\\"\")"]
[2021/05/06 18:34:57.847 +08:00] [ERROR] [server.rs:862] ["failed to init io snooper"] [err_code=KV:Unknown] [err="\"IO snooper is not started due to not compiling with BCC\""]
[2021/05/06 18:34:59.607 +08:00] [ERROR] [region.rs:412] ["failed to apply snap!!!"] [err_code=KV:Raftstore:SnapUnknown] [err="Other(\"[components/raftstore/src/store/snap.rs:826]: \\\"[components/raftstore/src/store/snap.rs:297]: invalid checksum 2834656378 for snapshot cf file /data/tikv-20161/snap/rev_14665_35_56_default.sst, expected 651506583\\\"\")"]

辛苦检查下192.168.1.230:20161 的磁盘空间

抱歉,实在等不及了,我把原来扩容出来的3个节点先下了,然后重新扩容了,扩容前升级到了5.0.1,现在正常了,之后再观察下情况。
磁盘空间应该是没有问题的,从图上看,感觉上次扩容的时候,192.168.1.229:20161就一直没成功上线的样子

1.229:20161 已经缩容了?
弱弱的问下 log 有保留吗。。。

缩容后全被清掉了~, 是否可以加个缩容后保留日志的选项~

https://github.com/pingcap/tiup/issues/1352
关注下这个 pr

收到 :grinning:

1 个赞

:smiling_face_with_three_hearts:
[2021/05/06 18:33:46.374 +08:00] [ERROR] [server.rs:862] [“failed to init io snooper”] [err_code=KV:Unknown] [err=“"IO snooper is not started due to not compiling with BCC"”]
这条日志不影响启动,

[2021/05/06 18:33:46.702 +08:00] [ERROR] [region.rs:412] [“failed to apply snap!!!”] [err_code=KV:Raftstore:SnapUnknown] [err=“Other("[components/raftstore/src/store/snap.rs:826]: \"[components/raftstore/src/store/snap.rs:297]: invalid checksum 2834656378 for snapshot cf file /data/tikv-20161/snap/rev_14665_35_56_default.sst, expected 651506583\"")”]

这条建议用磁盘检查工具看下磁盘是否正常,了解到是 VM 环境,建议检查下。

invalid checksum 2834656378 for snapshot cf file /data/tikv-20161/snap/rev_14665_35_56_default.sst, expected 651506583

这个的意思是 follower apply snapshot 时计算出来的 checksum 和 leader 生成 snapshot 时计算的 checksum 不同,大概率是盘有问题了。建议先查盘的问题,如果查不到问题还想恢复这个 tikv 实例的话,可以用 https://docs.pingcap.com/zh/tidb/stable/tikv-control#设置一个-region-副本为-tombstone-状态 把出问题的 region tombstone 掉,rev_14665_35_56_default.sstrev_{region_id}_{term}_{index}_{cf}.sst 的格式

1 个赞

收到~不过文件在缩容的时候都干掉了。。。

建议下次把日志 copy 到其他目录保留,方便排查,多谢。