从 tikv 日志中看到第一次出现 panic 是因为 SST 文件损坏导致的
[2021/04/21 14:25:10.151 +08:00] [FATAL] [lib.rs:482] ["rocksdb background error. db: kv, reason: compaction, error: Corruption: block checksum mismatch: expected 3552756717, go t 2675153938 in /tidb-data/tikv-20160/db/091787.sst offset 29994297 size 29942"] [backtrace="stack backtrace:\
0: tikv_util::set_panic_hook::{{closure}}\
at com ponents/tikv_util/src/lib.rs:481\
1: std::panicking::rust_panic_with_hook\
at src/libstd/panicking.rs:475\
2: rust_begin_unwind\
at src/libstd/pa nicking.rs:375\
3: std::panicking::begin_panic_fmt\
at src/libstd/panicking.rs:326\
4: <engine_rocks::event_listener::RocksEventListener as rocksdb::event_lis tener::EventListener>::on_background_error\
at components/engine_rocks/src/event_listener.rs:66\
5: rocksdb::event_listener::on_background_error\
at /rust/git/checkouts/rust-rocksdb-a9a28e74c6ead8ef/5345344/src/event_listener.rs:254\
6: _ZN24crocksdb_eventlistener_t17OnBackgroundErrorEN7rocksdb21BackgroundErrorReasonEPNS 0_6StatusE\
at crocksdb/c.cc:2140\
7: _ZN7rocksdb12EventHelpers23NotifyOnBackgroundErrorERKSt6vectorISt10shared_ptrINS_13EventListenerEESaIS4_EENS_21BackgroundEr rorReasonEPNS_6StatusEPNS_17InstrumentedMutexEPb\
at /rust/git/checkouts/rust-rocksdb-a9a28e74c6ead8ef/5345344/librocksdb_sys/rocksdb/db/event_helpers.cc:53\
8: _ZN7rocksdb12ErrorHandler10SetBGErrorERKNS_6StatusENS_21BackgroundErrorReasonE\
at /rust/git/checkouts/rust-rocksdb-a9a28e74c6ead8ef/5345344/librocksdb_sys/rocksdb/ db/error_handler.cc:220\
9: _ZN7rocksdb6DBImpl20BackgroundCompactionEPbPNS_10JobContextEPNS_9LogBufferEPNS0_19PrepickedCompactionENS_3Env8PriorityE\
at /rust/git /checkouts/rust-rocksdb-a9a28e74c6ead8ef/5345344/librocksdb_sys/rocksdb/db/db_impl/db_impl_compaction_flush.cc:2797\
10: _ZN7rocksdb6DBImpl24BackgroundCallCompactionEPNS0_19Pr epickedCompactionENS_3Env8PriorityE\
at /rust/git/checkouts/rust-rocksdb-a9a28e74c6ead8ef/5345344/librocksdb_sys/rocksdb/db/db_impl/db_impl_compaction_flush.cc:2317 \
11: _ZN7rocksdb6DBImpl16BGWorkCompactionEPv\
at /rust/git/checkouts/rust-rocksdb-a9a28e74c6ead8ef/5345344/librocksdb_sys/rocksdb/db/db_impl/db_impl_compaction_f lush.cc:2092\
12: _ZN7rocksdb14ThreadPoolImpl4Impl8BGThreadEm\
at /rust/git/checkouts/rust-rocksdb-a9a28e74c6ead8ef/5345344/librocksdb_sys/rocksdb/util/threadpool _imp.cc:266\
13: _ZN7rocksdb14ThreadPoolImpl4Impl15BGThreadWrapperEPv\
at /rust/git/checkouts/rust-rocksdb-a9a28e74c6ead8ef/5345344/librocksdb_sys/rocksdb/util/th readpool_imp.cc:307\
14: execute_native_thread_routine\
15: start_thread\
16: clone\
"] [location=components/engine_rocks/src/event_listener.rs:66] [thread_name=<unnamed>]
查看系统日志发现有磁盘损坏的情况
Apr 21 13:39:58 tidb-cluster-tidb kernel: ffff8c2cf6bbac00: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
Apr 21 13:39:58 tidb-cluster-tidb kernel: ffff8c2cf6bbac10: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
Apr 21 13:39:58 tidb-cluster-tidb kernel: ffff8c2cf6bbac20: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
Apr 21 13:39:58 tidb-cluster-tidb kernel: ffff8c2cf6bbac30: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
Apr 21 13:39:58 tidb-cluster-tidb kernel: ffff8c2cf6bbac40: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
Apr 21 13:39:58 tidb-cluster-tidb kernel: ffff8c2cf6bbac50: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
Apr 21 13:39:58 tidb-cluster-tidb kernel: ffff8c2cf6bbac60: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
Apr 21 13:39:58 tidb-cluster-tidb kernel: ffff8c2cf6bbac70: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
Apr 21 13:39:58 tidb-cluster-tidb kernel: XFS (vda2): metadata I/O error in "xfs_trans_read_buf_map" at daddr 0x12cdfc01 len 1 error 74
Apr 21 13:39:58 tidb-cluster-tidb kernel: XFS (vda2): page discard on page ffffd2e48e334380, inode 0x1501d40d, offset 202346496.
Apr 21 13:40:28 tidb-cluster-tidb kernel: XFS (vda2): Metadata CRC error detected at xfs_agf_read_verify+0xde/0x100 [xfs], xfs_agf block 0x12cdfc01
Apr 21 13:40:28 tidb-cluster-tidb kernel: XFS (vda2): Unmount and run xfs_repair
Apr 21 13:40:28 tidb-cluster-tidb kernel: XFS (vda2): First 128 bytes of corrupted metadata buffer:
所以应该是磁盘损坏引起的问题,可以尝试修复下磁盘看能不能恢复,不能恢复的话,需要将这个节点重新扩缩容一下。
https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/8/html/managing_file_systems/checking-and-repairing-a-file-system_managing-file-systems