tikv 触发 rocksdb background 错误 panic 无法重启

【 TiDB 使用环境】生产环境
【 TiDB 版本】tikv pd 5.2.0
【遇到的问题】tikv 触发 rocksdb background 错误 panic 无法重启
【复现路径】从日志上看unsafe destory range ->compaction ->panic
【问题现象及影响】tikv 无法重启 只能通过扩缩容丢弃文件损坏节点
【附件】

请提供各个组件的 version 信息,如 cdc/tikv,可通过执行 cdc version/tikv-server --version 获取。

[2022/09/26 23:47:30.393 +08:00] [INFO] [gc_worker.rs:389] [“unsafe destroy range started”] [end_key=6D757369632D6B76FF2D7072645F333635FF33363A3030303A44FF3AC9074B08C73145FFD5A393C5D48E442AFFCE00000000000000F8] [start_key=6D757369632D6B76FF2D7072645F333635FF33363A3030303A44FF3AC9074B08C73145FFD5A393C5D48E442AFFCD00000000000000F8]
[2022/09/26 23:47:30.396 +08:00] [INFO] [gc_worker.rs:420] [“unsafe destroy range finished deleting files in range”] [cost_time=2.414008ms] [end_key=6D757369632D6B76FF2D7072645F333635FF33363A3030303A44FF3AC9074B08C73145FFD5A393C5D48E442AFFCE00000000000000F8] [start_key=6D757369632D6B76FF2D7072645F333635FF33363A3030303A44FF3AC9074B08C73145FFD5A393C5D48E442AFFCD00000000000000F8]
[2022/09/26 23:47:30.400 +08:00] [INFO] [gc_worker.rs:454] [“unsafe destroy range finished cleaning up all”] [cost_time=4.394487ms] [end_key=6D757369632D6B76FF2D7072645F333635FF33363A3030303A44FF3AC9074B08C73145FFD5A393C5D48E442AFFCE00000000000000F8] [start_key=6D757369632D6B76FF2D7072645F333635FF33363A3030303A44FF3AC9074B08C73145FFD5A393C5D48E442AFFCD00000000000000F8]
[2022/09/26 23:47:32.247 +08:00] [INFO] [compaction_filter.rs:483] [“Compaction filter reports”] [filtered=204674] [total=1376419]
[2022/09/26 23:47:38.931 +08:00] [INFO] [compaction_filter.rs:483] [“Compaction filter reports”] [filtered=438497] [total=1990713]
[2022/09/26 23:47:43.356 +08:00] [INFO] [compaction_filter.rs:483] [“Compaction filter reports”] [filtered=417840] [total=1878614]
[2022/09/26 23:48:14.374 +08:00] [FATAL] [lib.rs:465] [“rocksdb background error. db: kv, reason: compaction, error: Corruption: block checksum mismatch: expected 1704905625, got 1445134835 in /data/tidb/tikv/11161/tikv/data/db/5989462.sst offset 3196554 size 18449”] [backtrace="stack backtrace:
0: tikv_util::set_panic_hook::{{closure}}
at components/tikv_util/src/lib.rs:464
1: std::panicking::rust_panic_with_hook
at library/std/src/panicking.rs:626
2: std::panicking::begin_panic_handler::{{closure}}
at library/std/src/panicking.rs:519
3: std::sys_common::backtrace::__rust_end_short_backtrace
at library/std/src/sys_common/backtrace.rs:141
4: rust_begin_unwind
at library/std/src/panicking.rs:515
5: std::panicking::begin_panic_fmt
at library/std/src/panicking.rs:457
6: <engine_rocks::event_listener::RocksEventListener as rocksdb::event_listener::EventListener>::on_background_error
7: rocksdb::event_listener::on_background_error
at /rust/git/checkouts/rust-rocksdb-a9a28e74c6ead8ef/4e912a8/src/event_listener.rs:340
8: _ZN24crocksdb_eventlistener_t17OnBackgroundErrorEN7rocksdb21BackgroundErrorReasonEPNS0_6StatusE
at crocksdb/c.cc:2352
9: _ZN7rocksdb7titandb11TitanDBImpl10SetBGErrorERKNS_6StatusE\

 at /rust/git/checkouts/rust-rocksdb-a9a28e74c6ead8ef/4e912a8/librocksdb_sys/libtitan_sys/titan/src/db_impl.cc:1447\

10: _ZN7rocksdb7titandb11TitanDBImpl12BackgroundGCEPNS_9LogBufferEj
at /rust/git/checkouts/rust-rocksdb-a9a28e74c6ead8ef/4e912a8/librocksdb_sys/libtitan_sys/titan/src/db_impl_gc.cc:236
11: _ZN7rocksdb7titandb11TitanDBImpl16BackgroundCallGCEv
at /rust/git/checkouts/rust-rocksdb-a9a28e74c6ead8ef/4e912a8/librocksdb_sys/libtitan_sys/titan/src/db_impl_gc.cc:136
12: _ZNKSt8functionIFvvEEclEv
at /opt/rh/devtoolset-8/root/usr/include/c++/8/bits/std_function.h:687
_ZN7rocksdb14ThreadPoolImpl4Impl8BGThreadEm\

      at /rust/git/checkouts/rust-rocksdb-a9a28e74c6ead8ef/4e912a8/librocksdb_sys/rocksdb/util/threadpool_imp.cc:266\

13: _ZN7rocksdb14ThreadPoolImpl4Impl15BGThreadWrapperEPv
at /rust/git/checkouts/rust-rocksdb-a9a28e74c6ead8ef/4e912a8/librocksdb_sys/rocksdb/util/threadpool_imp.cc:307
14: execute_native_thread_routine
15: start_thread
16: __clone
"] [location=components/engine_rocks/src/event_listener.rs:108] [thread_name=]
[2022/09/26 23:48:30.808 +08:00] [INFO] [lib.rs:80] [“Welcome to TiKV”]
[2022/09/26 23:48:30.808 +08:00] [INFO] [lib.rs:85] [“Release Version: 5.2.0”]

除了扩缩容还有其他处理方法吗?或者如何规避这种现象出现,目前不太清楚触发的原因。

看报错是sst文件损坏,损坏的原因不
看描述基本处理思路
1、 tikv-ctl bad-ssts 扫描tikv 找到损坏的sst文件
2、 根据扫描输出的信息确认region 和删除sst文件
4、 删除sst文件覆盖的region peer



文档中tikv-ctl bad-ssts 参数db被更换为 --data-dir


实际使用的还是–db 输出为

./tikv-ctl bad-ssts --pd <pd> --db ../data/db/
[2022/09/27 14:17:28.303 +08:00] [INFO] [util.rs:544] ["connecting to PD endpoint"] [endpoints=10.193.204.145:12389]
[2022/09/27 14:17:28.303 +08:00] [INFO] [<unknown>] ["Disabling AF_INET6 sockets because ::1 is not available."]
[2022/09/27 14:17:28.303 +08:00] [INFO] [<unknown>] ["TCP_USER_TIMEOUT is available. TCP_USER_TIMEOUT will be used thereafter"]
[2022/09/27 14:17:28.304 +08:00] [INFO] [<unknown>] ["New connected subchannel at 0x7fcf0182e1b0 for subchannel 0x7fcf04612ec0"]
[2022/09/27 14:17:28.304 +08:00] [INFO] [util.rs:544] ["connecting to PD endpoint"] [endpoints=http://10.193.73.145:12387]
[2022/09/27 14:17:28.305 +08:00] [INFO] [<unknown>] ["New connected subchannel at 0x7fcf0182e2d0 for subchannel 0x7fcf04613080"]
[2022/09/27 14:17:28.305 +08:00] [INFO] [util.rs:544] ["connecting to PD endpoint"] [endpoints=http://10.193.204.145:12389]
[2022/09/27 14:17:28.306 +08:00] [INFO] [<unknown>] ["New connected subchannel at 0x7fcf0182e3f0 for subchannel 0x7fcf04612ec0"]
[2022/09/27 14:17:28.306 +08:00] [INFO] [util.rs:668] ["connected to PD member"] [endpoints=http://10.193.204.145:12389]
[2022/09/27 14:17:28.306 +08:00] [INFO] [util.rs:536] ["all PD endpoints are consistent"] [endpoints="[\"10.193.204.145:12389\"]"]
--------------------------------------------------------
corruption analysis has completed

并没有获取到有用信息

损坏的那个tikv你不已经缩容了?

没有删除文件,还可以看一下,这里的bad-ssts的检查不是也要保证关闭当前运行的 TiKV 实例么。

此外,在release-5.2的分支代码上


tikv-ctl与文档 TiKV Control 使用说明 | PingCAP Docs不一致

此话题已在最后回复的 60 天后被自动关闭。不再允许新回复。