生产环境tikv异常重启

【 TiDB 使用环境】生产环境
【 TiDB 版本】3.0.8

【遇到的问题:问题现象及影响】
今天两个tikv节点内存下降很多,检查发现两台服务器的tikv服务均重启了,时间为11.15日 22:02:00
【资源配置】5kv-2server-3pd
【附件:截图/日志/监控】
tidb节点的tidb.log:


出问题的103 tikv服务器:

其中的FATAL为:

[2022/11/15 22:01:36.060 +08:00] [FATAL] [lib.rs:499] ["index out of bounds: the len is 6 but the index is 6"] [backtrace="stack backtrace:\n   0:     0x56119bd6978d - backtrace::backtrace::libunwind::trace::h958f5f3eb75b2917\n                        at /rust/registry/src/github.com-1ecc6299db9ec823/backtrace-0.2.3/src/backtrace/libunwind.rs:54\n                         - backtrace::backtrace::trace::hdf994f7eb3c12b81\n                        at /rust/registry/src/github.com-1ecc6299db9ec823/backtrace-0.2.3/src/backtrace/mod.rs:70\n   1:     0x56119bd5f490 - tikv_util::set_panic_hook::{{closure}}::hf6c0260b2e4aea39\n                        at /home/jenkins/.target/release/build/backtrace-e20a32a05fd0b8fe/out/capture.rs:79\n   2:     0x56119bf090ff - std::panicking::rust_panic_with_hook::h8d2408723e9a2bd4\n                        at src/libstd/panicking.rs:479\n   3:     0x56119bf08edd - std::panicking::continue_panic_fmt::hb2aaa9386c4e5e80\n                        at src/libstd/panicking.rs:382\n   4:     0x56119bf18745 - rust_begin_unwind\n                        at src/libstd/panicking.rs:309\n   5:     0x56119bf232eb - core::panicking::panic_fmt::h79e840586f23493b\n                        at src/libcore/panicking.rs:85\n   6:     0x56119bf22df3 - core::panicking::panic_bounds_check::h9293ee7846bbb139\n                        at src/libcore/panicking.rs:61\n   7:     0x56119bd675fe - <usize as core::slice::SliceIndex<[T]>>::index_mut::h8595bb10a522f6ec\n                        at /rustc/0e4a56b4b04ea98bb16caada30cb2418dd06e250/src/libcore/slice/mod.rs:2700\n                         - core::slice::<impl core::ops::index::IndexMut<I> for [T]>::index_mut::h72f80a2aa264d71e\n                        at /rustc/0e4a56b4b04ea98bb16caada30cb2418dd06e250/src/libcore/slice/mod.rs:2561\n                         - <alloc::vec::Vec<T> as core::ops::index::IndexMut<I>>::index_mut::hf0b55e5b5c64c8b5\n                        at /rustc/0e4a56b4b04ea98bb16caada30cb2418dd06e250/src/liballoc/vec.rs:1768\n                         - tokio_timer::wheel::Wheel<T>::insert::ha0811b20f18170ea\n                        at /rust/registry/src/github.com-1ecc6299db9ec823/tokio-timer-0.2.12/src/wheel/mod.rs:114\n                         - tokio_timer::timer::Timer<T,N>::add_entry::h417301f6c9308779\n                        at /rust/registry/src/github.com-1ecc6299db9ec823/tokio-timer-0.2.12/src/timer/mod.rs:324\n   8:     0x56119bd666bf - tokio_timer::timer::Timer<T,N>::process_queue::h39928e46093b552d\n                        at /rust/registry/src/github.com-1ecc6299db9ec823/tokio-timer-0.2.12/src/timer/mod.rs:0\n   9:     0x56119bd65e0b - <tokio_timer::timer::Timer<T,N> as tokio_executor::park::Park>::park::hdf0d553584b8cb22\n                        at /rust/registry/src/github.com-1ecc6299db9ec823/tokio-timer-0.2.12/src/timer/mod.rs:361\n                         - tokio_timer::timer::Timer<T,N>::turn::h67305fd216915f0e\n                        at /rust/registry/src/github.com-1ecc6299db9ec823/tokio-timer-0.2.12/src/timer/mod.rs:256\n  10:     0x56119bd654b3 - tikv_util::timer::start_global_timer::{{closure}}::h418812a1f8076ef5\n                        at components/tikv_util/src/timer.rs:94\n  11:     0x56119bd651e5 - std::sys_common::backtrace::__rust_begin_short_backtrace::h7a15993f24510b40\n                        at /rustc/0e4a56b4b04ea98bb16caada30cb2418dd06e250/src/libstd/sys_common/backtrace.rs:77\n  12:     0x56119bd651d5 - std::thread::Builder::spawn_unchecked::{{closure}}::{{closure}}::hc908468389262d3d\n                        at /rustc/0e4a56b4b04ea98bb16caada30cb2418dd06e250/src/libstd/thread/mod.rs:470\n  13:     0x56119bd651c5 - <std::panic::AssertUnwindSafe<F> as core::ops::function::FnOnce<()>>::call_once::hea2226c234dcd674\n                        at /rustc/0e4a56b4b04ea98bb16caada30cb2418dd06e250/src/libstd/panic.rs:309\n  14:     0x56119bd651b8 - std::panicking::try::do_call::h5ae108cfa09964f0\n                        at /rustc/0e4a56b4b04ea98bb16caada30cb2418dd06e250/src/libstd/panicking.rs:294\n                         - std::panicking::try::hb747e5c89d6c749b\n                        at /rustc/0e4a56b4b04ea98bb16caada30cb2418dd06e250//src/libpanic_abort/lib.rs:29\n                         - std::panic::catch_unwind::h9d095dd8e5bc9701\n                        at /rustc/0e4a56b4b04ea98bb16caada30cb2418dd06e250/src/libstd/panic.rs:388\n                         - std::thread::Builder::spawn_unchecked::{{closure}}::hd8b8a97aa6180ba0\n                        at /rustc/0e4a56b4b04ea98bb16caada30cb2418dd06e250/src/libstd/thread/mod.rs:469\n                         - core::ops::function::FnOnce::call_once{{vtable.shim}}::hc4c2684293abcd20\n                        at /rustc/0e4a56b4b04ea98bb16caada30cb2418dd06e250/src/libcore/ops/function.rs:231\n  15:     0x56119bf1774e - <alloc::boxed::Box<F> as core::ops::function::FnOnce<A>>::call_once::he71721d2d956d451\n                        at /rustc/0e4a56b4b04ea98bb16caada30cb2418dd06e250/src/liballoc/boxed.rs:746\n  16:     0x56119bf19a7b - <alloc::boxed::Box<F> as core::ops::function::FnOnce<A>>::call_once::he520045b8d28ce5c\n                        at /rustc/0e4a56b4b04ea98bb16caada30cb2418dd06e250/src/liballoc/boxed.rs:746\n                         - std::sys_common::thread::start_thread::h2e98d1272dc6d74b\n                        at src/libstd/sys_common/thread.rs:13\n                         - std::sys::unix::thread::Thread::new::thread_start::h18485805666ccd3c\n                        at src/libstd/sys/unix/thread.rs:79\n  17:     0x7f42e149add4 - start_thread\n  18:     0x7f42e0ba202c - __clone\n  19:                0x0 - <unknown>"] [location=/rustc/0e4a56b4b04ea98bb16caada30cb2418dd06e250/src/libcore/slice/mod.rs:2700] [thread_name=timer]
[2022/11/15 22:02:01.278 +08:00] [INFO] [mod.rs:26] ["Welcome to TiKV."]

103 节点message日志:

重启的两台kv节点均一致问题,附上监控:
image
image
image
图中异常为出问题的两台服务器
请问这个原因是为什么,一台出问题还好,突然两台都出问题有点慌啊

版本比较老了~

这个是已知的BUG
index out of bounds: the len is 6 but the index is 6

4.X 后面的版本才修复…

内存耗得比较多的话,也可以分析出来的,不用慌~
3.X 也带slow query log,基本上可以帮你定位的

我检查了两台server节点的慢SQL,当时并没有消耗内存很高慢SQL。
image

不过如果这个是bug的话就能确定问题了,能提供一下bug地址吗 谢谢~

可以去 github 上面搜索一下,另外,要更新新版本,
低版本的很多问题在高版本都已经解决了
社区这边最低的版本都要升级到 4.x以上~

生产升级得等明年去了,所以最好是现在看看能不能避免这种情况。
我刚刚搜索了一下github 没找到我遇到的这种情况,只能祈求别再遇到这种问题了
感谢~

此话题已在最后回复的 60 天后被自动关闭。不再允许新回复。