tikv节点全部异常重启

为提高效率,请提供以下信息,问题描述清晰能够更快得到解决:

【TiDB 版本】
V4.0.9
【问题描述】
tikv节点全部异常重启,日志如下

Feb 25 10:17:27 tidb2 systemd: tikv-20160.service: main process exited, code=exited, status=1/FAILURE
Feb 25 10:17:27 tidb2 systemd: Unit tikv-20160.service entered failed state.
Feb 25 10:17:27 tidb2 systemd: tikv-20160.service failed.
Feb 25 10:17:27 tidb2 systemd: tikv-20162.service: main process exited, code=exited, status=1/FAILURE
Feb 25 10:17:28 tidb2 systemd: Unit tikv-20162.service entered failed state.
Feb 25 10:17:28 tidb2 systemd: tikv-20162.service failed.
Feb 25 10:17:28 tidb2 systemd: tikv-20161.service: main process exited, code=exited, status=1/FAILURE
Feb 25 10:17:29 tidb2 systemd: Unit tikv-20161.service entered failed state.
Feb 25 10:17:29 tidb2 systemd: tikv-20161.service failed.

[2021/02/25 10:17:25.563 +08:00] [FATAL] [lib.rs:482] [“called Option::unwrap() on a None value”] [backtrace="stack backtrace:
0: tikv_util::set_panic_hook::{{closure}}
at components/tikv_util/src/lib.rs:481
1: std::panicking::rust_panic_with_hook
at src/libstd/panicking.rs:475
2: rust_begin_unwind
at src/libstd/panicking.rs:375
3: core::panicking::panic_fmt
at src/libcore/panicking.rs:84
4: core::panicking::panic
at src/libcore/panicking.rs:51
5: tikv::server::service::diagnostics::sys::nic_hardware_info
at /rustc/0de96d37fbcc54978458c18f5067cd9817669bc8/src/libcore/macros/mod.rs:15
tikv::server::service::diagnostics::sys::hardware_info
at src/server/service/diagnostics/sys.rs:417
6: <tikv::server::service::diagnostics::Service as kvproto::protos::diagnosticspb_grpc::Diagnostics>::server_info::{{closure}}
at src/server/service/diagnostics/mod.rs:143
<futures::future::and_then::AndThen<A,B,F> as futures::future::Future>::poll::{{closure}}::{{closure}}
at /rust/registry/src/github.com-1ecc6299db9ec823/futures-0.1.29/src/future/and_then.rs:34
core::result::Result<T,E>::map
at /rustc/0de96d37fbcc54978458c18f5067cd9817669bc8/src/libcore/result.rs:512
<futures::future::and_then::AndThen<A,B,F> as futures::future::Future>::poll::{{closure}}
at /rust/registry/src/github.com-1ecc6299db9ec823/futures-0.1.29/src/future/and_then.rs:33
futures::future::chain::Chain<A,B,C>::poll
at /rust/registry/src/github.com-1ecc6299db9ec823/futures-0.1.29/src/future/chain.rs:39
<futures::future::and_then::AndThen<A,B,F> as futures::future::Future>::poll
at /rust/registry/src/github.com-1ecc6299db9ec823/futures-0.1.29/src/future/and_then.rs:32
futures::future::catch_unwind::<impl futures::future::Future for std::panic::AssertUnwindSafe>::poll
at /rust/registry/src/github.com-1ecc6299db9ec823/futures-0.1.29/src/future/catch_unwind.rs:49
<futures::future::catch_unwind::CatchUnwind as futures::future::Future>::poll::{{closure}}
at /rust/registry/src/github.com-1ecc6299db9ec823/futures-0.1.29/src/future/catch_unwind.rs:32
std::panicking::try::do_call
at /rustc/0de96d37fbcc54978458c18f5067cd9817669bc8/src/libstd/panicking.rs:292
std::panicking::try
at /rustc/0de96d37fbcc54978458c18f5067cd9817669bc8//src/libpanic_unwind/lib.rs:78
std::panic::catch_unwind
at /rustc/0de96d37fbcc54978458c18f5067cd9817669bc8/src/libstd/panic.rs:394
<futures::future::catch_unwind::CatchUnwind as futures::future::Future>::poll
at /rust/registry/src/github.com-1ecc6299db9ec823/futures-0.1.29/src/future/catch_unwind.rs:32
<futures_cpupool::MySender<F,core::result::Result<::Item,::Error>> as futures::future::Future>::poll
at /rust/registry/src/github.com-1ecc6299db9ec823/futures-cpupool-0.1.8/src/lib.rs:325
7: futures_cpupool::Inner::work
at /rust/registry/src/github.com-1ecc6299db9ec823/futures-0.1.29/src/future/mod.rs:113
futures_cpupool::Builder::create::{{closure}}
at /rust/registry/src/github.com-1ecc6299db9ec823/futures-cpupool-0.1.8/src/lib.rs:427
std::sys_common::backtrace::__rust_begin_short_backtrace
at /rustc/0de96d37fbcc54978458c18f5067cd9817669bc8/src/libstd/sys_common/backtrace.rs:136
8: std::thread::Builder::spawn_unchecked::{{closure}}::{{closure}}
at /rustc/0de96d37fbcc54978458c18f5067cd9817669bc8/src/libstd/thread/mod.rs:469
<std::panic::AssertUnwindSafe as core::ops::function::FnOnce<()>>::call_once
at /rustc/0de96d37fbcc54978458c18f5067cd9817669bc8/src/libstd/panic.rs:318
std::panicking::try::do_call
at /rustc/0de96d37fbcc54978458c18f5067cd9817669bc8/src/libstd/panicking.rs:292
std::panicking::try
at /rustc/0de96d37fbcc54978458c18f5067cd9817669bc8//src/libpanic_unwind/lib.rs:78
std::panic::catch_unwind
at /rustc/0de96d37fbcc54978458c18f5067cd9817669bc8/src/libstd/panic.rs:394
std::thread::Builder::spawn_unchecked::{{closure}}
at /rustc/0de96d37fbcc54978458c18f5067cd9817669bc8/src/libstd/thread/mod.rs:468
core::ops::function::FnOnce::call_once{{vtable.shim}}
at /rustc/0de96d37fbcc54978458c18f5067cd9817669bc8/src/libcore/ops/function.rs:232
9: <alloc::boxed::Box as core::ops::function::FnOnce>::call_once
at /rustc/0de96d37fbcc54978458c18f5067cd9817669bc8/src/liballoc/boxed.rs:1022
10: <alloc::boxed::Box as core::ops::function::FnOnce
>::call_once
at /rustc/0de96d37fbcc54978458c18f5067cd9817669bc8/src/liballoc/boxed.rs:1022
std::sys_common::thread::start_thread
at src/libstd/sys_common/thread.rs:13
std::sys::unix::thread::thread::new::thread_start
at src/libstd/sys/unix/thread.rs:80
11: start_thread
12: clone
"] [location=/rustc/0de96d37fbcc54978458c18f5067cd9817669bc8/src/libcore/macros/mod.rs:15] [thread_name=debugger0]


若提问为性能优化、故障排查类问题,请下载脚本运行。终端输出的打印结果,请务必全选并复制粘贴上传。

看堆栈信息应该与这个帖子中的问题是一致的:tikv每天固定时间点发生重启问题

情况不一样,我们是任务启动之后会有tikv异常重启现象,不是固定时间重启,按照之前给出的方法关闭遥测,异常重启现象还是存在

连续发生两次重启 日志如下
Feb 25 16:40:20 tidb2 systemd: tikv-20162.service failed.
Feb 25 16:40:21 tidb2 systemd: tikv-20161.service: main process exited, code=exited, status=1/FAILURE
Feb 25 16:40:21 tidb2 systemd: Unit tikv-20161.service entered failed state.
Feb 25 16:40:21 tidb2 systemd: tikv-20161.service failed.
Feb 25 16:40:21 tidb2 systemd: tikv-20160.service: main process exited, code=exited, status=1/FAILURE
Feb 25 16:40:21 tidb2 systemd: Unit tikv-20160.service entered failed state.
Feb 25 16:40:21 tidb2 systemd: tikv-20160.service failed.
Feb 25 16:40:36 tidb2 run_tikv.sh: sync …
Feb 25 16:40:36 tidb2 run_tikv.sh: real#0110m0.050s
Feb 25 16:40:36 tidb2 run_tikv.sh: user#0110m0.000s
Feb 25 16:40:36 tidb2 run_tikv.sh: sys#0110m0.008s
Feb 25 16:40:36 tidb2 run_tikv.sh: ok
Feb 25 16:40:36 tidb2 run_tikv.sh: sync …
Feb 25 16:40:36 tidb2 run_tikv.sh: real#0110m0.014s
Feb 25 16:40:36 tidb2 run_tikv.sh: user#0110m0.000s
Feb 25 16:40:36 tidb2 run_tikv.sh: sys#0110m0.006s
Feb 25 16:40:36 tidb2 run_tikv.sh: ok
Feb 25 16:40:36 tidb2 run_tikv.sh: sync …
Feb 25 16:40:36 tidb2 run_tikv.sh: real#0110m0.006s
Feb 25 16:40:36 tidb2 run_tikv.sh: user#0110m0.000s
Feb 25 16:40:36 tidb2 run_tikv.sh: sys#0110m0.005s
Feb 25 16:40:36 tidb2 run_tikv.sh: ok
Feb 25 16:46:20 tidb2 systemd: tikv-20162.service: main process exited, code=exited, status=1/FAILURE
Feb 25 16:46:20 tidb2 systemd: Unit tikv-20162.service entered failed state.
Feb 25 16:46:20 tidb2 systemd: tikv-20162.service failed.
Feb 25 16:46:20 tidb2 systemd: tikv-20160.service: main process exited, code=exited, status=1/FAILURE
Feb 25 16:46:20 tidb2 systemd: Unit tikv-20160.service entered failed state.
Feb 25 16:46:21 tidb2 systemd: tikv-20160.service failed.
Feb 25 16:46:21 tidb2 systemd: tikv-20161.service: main process exited, code=exited, status=1/FAILURE
Feb 25 16:46:21 tidb2 systemd: Unit tikv-20161.service entered failed state.
Feb 25 16:46:21 tidb2 systemd: tikv-20161.service failed.
Feb 25 16:46:36 tidb2 run_tikv.sh: sync …
Feb 25 16:46:36 tidb2 run_tikv.sh: real#0110m0.056s

那看下服务器是否存在网卡没有 mac 地址的情况:https://github.com/tikv/tikv/pull/7889

每次打开dashboard就会触发tikv重启,关闭遥测之后也是这种现象


因为这个是在收集系统硬件信息的时候 panic 的,打开 dashboard 的时候会去查询 information_schema.CLUSTER_HARDWARE 表,查询表的时候会触发收集硬件信息。
能不能确认一下网卡是否存在没有 mac 地址的情况?

只有lo没有mac

MySQL [information_schema]> SELECT TYPE,DEVICE_NAME,NAME,VALUE FROM cluster_hardware WHERE device_type=‘net’ and name=‘mac’ and VALUE=’’;
±-----±------------±-----±------+
| TYPE | DEVICE_NAME | NAME | VALUE |
±-----±------------±-----±------+
| tidb | lo | mac | |
| tidb | lo | mac | |
| tidb | lo | mac | |
| pd | lo | mac | |
| pd | lo | mac | |
| pd | lo | mac | |
±-----±------------±-----±------+
6 rows in set, 9 warnings (0.43 sec)

麻烦提供一下每个 tikv 节点上的 ip addr 完整命令输出结果看下