tikv节点异常重启index out of bounds: the len is 6 but the index is 6

生产环境,数据库版本v5.1.2 TIKV节点异常重启,报错日志:

[2024/03/04 05:55:16.243 +08:00] [WARN] [endpoint.rs:633] [error-response] [err="Region error (will back off and retry) message: \"peer is not leader for region 47439535, leader may Some(id: 47439890 store_id: 4680469)\" not_leader { region_id: 47439535 leader { id: 47439890 store_id: 4680469 } }"]
[2024/03/04 05:55:37.772 +08:00] [WARN] [endpoint.rs:633] [error-response] [err="Region error (will back off and retry) message: \"peer is not leader for region 47439535, leader may Some(id: 47439890 store_id: 4680469)\" not_leader { region_id: 47439535 leader { id: 47439890 store_id: 4680469 } }"]
[2024/03/04 05:55:47.186 +08:00] [WARN] [endpoint.rs:633] [error-response] [err="Region error (will back off and retry) message: \"peer is not leader for region 47439535, leader may Some(id: 47439890 store_id: 4680469)\" not_leader { region_id: 47439535 leader { id: 47439890 store_id: 4680469 } }"]
[2024/03/04 05:55:51.506 +08:00] [FATAL] [lib.rs:463] ["index out of bounds: the len is 6 but the index is 6"] [backtrace="stack backtrace:\n   0: tikv_util::set_panic_hook::{{closure}}\n             at components/tikv_util/src/lib.rs:462\n   1: std::panicking::rust_panic_with_hook\n             at library/std/src/panicking.rs:595\n   2: std::panicking::begin_panic_handler::{{closure}}\n             at library/std/src/panicking.rs:497\n   3: std::sys_common::backtrace::__rust_end_short_backtrace\n             at library/std/src/sys_common/backtrace.rs:141\n   4: rust_begin_unwind\n             at library/std/src/panicking.rs:493\n   5: core::panicking::panic_fmt\n             at library/core/src/panicking.rs:92\n   6: core::panicking::panic_bounds_check\n             at library/core/src/panicking.rs:69\n   7: <usize as core::slice::index::SliceIndex<[T]>>::index_mut\n             at /rustc/16bf626a31cb5b121d0bca2baa969b4f67eb0dab/library/core/src/slice/index.rs:188\n      core::slice::index::<impl core::ops::index::IndexMut<I> for [T]>::index_mut\n             at /rustc/16bf626a31cb5b121d0bca2baa969b4f67eb0dab/library/core/src/slice/index.rs:26\n      <alloc::vec::Vec<T,A> as core::ops::index::IndexMut<I>>::index_mut\n             at /rustc/16bf626a31cb5b121d0bca2baa969b4f67eb0dab/library/alloc/src/vec/mod.rs:2398\n      tokio_timer::wheel::Wheel<T>::insert\n             at /rust/registry/src/github.com-1ecc6299db9ec823/tokio-timer-0.2.13/src/wheel/mod.rs:114\n      tokio_timer::timer::Timer<T,N>::add_entry\n             at /rust/registry/src/github.com-1ecc6299db9ec823/tokio-timer-0.2.13/src/timer/mod.rs:324\n   8: tokio_timer::timer::Timer<T,N>::process_queue\n             at /rust/registry/src/github.com-1ecc6299db9ec823/tokio-timer-0.2.13/src/timer/mod.rs:301\n   9: <tokio_timer::timer::Timer<T,N> as tokio_executor::park::Park>::park\n             at /rust/registry/src/github.com-1ecc6299db9ec823/tokio-timer-0.2.13/src/timer/mod.rs:361\n      tokio_timer::timer::Timer<T,N>::turn\n             at /rust/registry/src/github.com-1ecc6299db9ec823/tokio-timer-0.2.13/src/timer/mod.rs:256\n  10: tikv_util::timer::start_global_timer::{{closure}}\n             at components/tikv_util/src/timer.rs:98\n  11: std::sys_common::backtrace::__rust_begin_short_backtrace\n             at /rustc/16bf626a31cb5b121d0bca2baa969b4f67eb0dab/library/std/src/sys_common/backtrace.rs:125\n  12: std::thread::Builder::spawn_unchecked::{{closure}}::{{closure}}\n             at /rustc/16bf626a31cb5b121d0bca2baa969b4f67eb0dab/library/std/src/thread/mod.rs:474\n  13: <std::panic::AssertUnwindSafe<F> as core::ops::function::FnOnce<()>>::call_once\n             at /rustc/16bf626a31cb5b121d0bca2baa969b4f67eb0dab/library/std/src/panic.rs:344\n  14: std::panicking::try::do_call\n             at /rustc/16bf626a31cb5b121d0bca2baa969b4f67eb0dab/library/std/src/panicking.rs:379\n      std::panicking::try\n             at /rustc/16bf626a31cb5b121d0bca2baa969b4f67eb0dab/library/std/src/panicking.rs:343\n      std::panic::catch_unwind\n             at /rustc/16bf626a31cb5b121d0bca2baa969b4f67eb0dab/library/std/src/panic.rs:431\n      std::thread::Builder::spawn_unchecked::{{closure}}\n             at /rustc/16bf626a31cb5b121d0bca2baa969b4f67eb0dab/library/std/src/thread/mod.rs:473\n      core::ops::function::FnOnce::call_once{{vtable.shim}}\n             at /rustc/16bf626a31cb5b121d0bca2baa969b4f67eb0dab/library/core/src/ops/function.rs:227\n  15: <alloc::boxed::Box<F,A> as core::ops::function::FnOnce<Args>>::call_once\n             at /rustc/16bf626a31cb5b121d0bca2baa969b4f67eb0dab/library/alloc/src/boxed.rs:1546\n      <alloc::boxed::Box<F,A> as core::ops::function::FnOnce<Args>>::call_once\n             at /rustc/16bf626a31cb5b121d0bca2baa969b4f67eb0dab/library/alloc/src/boxed.rs:1546\n      std::sys::unix::thread::Thread::new::thread_start\n             at library/std/src/sys/unix/thread.rs:71\n  16: start_thread\n  17: clone\n"] [location=/rust/registry/src/github.com-1ecc6299db9ec823/tokio-timer-0.2.13/src/wheel/mod.rs:114] [thread_name=timer]
[2024/03/04 05:56:21.053 +08:00] [INFO] [lib.rs:81] ["Welcome to TiKV"]
[2024/03/04 05:56:21.053 +08:00] [INFO] [lib.rs:86] ["Release Version:   5.1.2"]
[2024/03/04 05:56:21.053 +08:00] [INFO] [lib.rs:86] ["Edition:           Enterprise"]

系统日志:

Mar  4 05:53:30  systemd-logind: Removed session 1543332.
Mar  4 05:54:01  systemd: Started Session 1543334 of user root.
Mar  4 05:54:01  systemd: Started Session 1543333 of user root.
Mar  4 05:54:51  systemd-logind: New session 1543335 of user shsnc.
Mar  4 05:54:51  systemd: Started Session 1543335 of user shsnc.
Mar  4 05:54:51  systemd-logind: Removed session 1543335.
Mar  4 05:56:01  systemd: Started Session 1543336 of user root.
Mar  4 05:56:01  systemd: Started Session 1543337 of user root.
Mar  4 05:56:05  systemd: tikv-20163.service: main process exited, code=exited, status=1/FAILURE
Mar  4 05:56:05  systemd: Unit tikv-20163.service entered failed state.
Mar  4 05:56:05  systemd: tikv-20163.service failed.
Mar  4 05:56:16  systemd-logind: New session 1543338 of user shsnc.
Mar  4 05:56:16  systemd: Started Session 1543338 of user shsnc.
Mar  4 05:56:16  systemd-logind: Removed session 1543338.
Mar  4 05:56:20  systemd: tikv-20163.service holdoff time over, scheduling restart.
Mar  4 05:56:20  systemd: Stopped tikv service.
Mar  4 05:56:20  systemd: Started tikv service.
Mar  4 05:56:21  run_tikv.sh: sync ...
Mar  4 05:56:21  run_tikv.sh: real#0110m0.037s
Mar  4 05:56:21  run_tikv.sh: user#0110m0.001s
Mar  4 05:56:21  run_tikv.sh: sys#0110m0.035s
Mar  4 05:56:21  run_tikv.sh: ok
Mar  4 05:57:38  systemd-logind: New session 1543339 of user shsnc.

已知BUG吧,建议升级数据库版本
https://github.com/pingcap/tiflash/issues/2705

TiKV异常crash重启 index out of bounds: the len is 6 but the index is 6 - :ringer_planet: TiDB 技术问题 - TiDB 的问答社区 (asktug.com) 确实是BUG

这是老bug了,tikv节点2年没重启了吧?

我看官方介绍说是4.0.版本就已经修复了,现在这个是5.1. 的版本,有点奇怪

是有很久没有重启过,因为运行的一直都比较稳定,如果是bug的话是不是等待其他的节点重启就可以了,因为这台集群部署了6个实例,目前有3个实例发生了重启,还有3个实例没有重启,是否需要做人工干预?我现在比较担心,手动重启再起不来·······

4.0、5.1、5.2、5.3都会有这个bug。

TiDB 还是很稳定的,TiKV 两年没重启的一大堆 :rofl:

我们发现了这个问题之后对线上所有集群做了一个巡检,提前1个月通知出来,然后主动与业务沟通择时重启。如果一个集群从搭建完成后没有重启过,大概率会所有节点会同时异常重启,为减轻对业务的影响,所以加的巡检,同时主动重启。

附:每个集群可以登录到tidb-server执行如下sql找出2年没有重启的tikv节点。
select INSTANCE,START_TIME,UPTIME,TIMESTAMPDIFF(day,START_TIME,now()) from information_schema.CLUSTER_INFO where type=‘tikv’ and TIMESTAMPDIFF(day, START_TIME, now())>365*2

我看这个是tiflash 的bug ,但是集群没有用tiflash 只有tikv

这个贴里有

好的 我看下,不知道是不是长时间没有重启的原因

应该是同一类bug,升级新版本肯定能修复


有机会还是升一下

看下每个节点的上次启动时间,如果还有即将达到2年没重启的,手动依次重启一下吧,这个bug不影响别的,就是单纯计算进程的时间异常.

这是依据BUG出现时间判断的2年没重启么?哈哈 :joy:

TiKV running over 2 years may panic · Issue #11940 · tikv/tikv · GitHub


异常原因在这里,大概是5.3之前的版本都用了tokio-timer导致tikv进程运行2年的时候会因为计算错误导致进程异常重启。

1 个赞

学习了,大佬 :+1:

学习了

1 个赞

重启大法