tikv-server 重启故障排查

tikv.zip (9.2 MB) 为提高效率,请提供以下信息,问题描述清晰能够更快得到解决:

【TiDB 版本】
Cluster type: tidb
Cluster name: -TiDB
Cluster version: v4.0.11
SSH type: builtin
v1.3.2 tiup
Go Version: go1.13
Git Branch: release-1.3
GitHash: 2d88460
【问题描述】
前几天从v4.0.7升级到v4.0.11,每天都会出现tikv-server节点重启情况,tidb-server重启故障排查

[2021/03/06 11:03:10.144 +08:00] [INFO] [lib.rs:92] [“Welcome to TiKV”]

dmesg|grep -i kill 一样没有 oom kill 的记录

若提问为性能优化、故障排查类问题,请下载脚本运行。终端输出的打印结果,请务必全选并复制粘贴上传。

集群架构

tiup cluster display -TiDB
Starting component `cluster`: /root/.tiup/components/cluster/v1.3.4/tiup-cluster display -TiDB
Cluster type:       tidb
Cluster name:       -TiDB
Cluster version:    v4.0.11
SSH type:           builtin
Dashboard URL:      http://192.168.157.42:2379/dashboard
ID                    Role          Host            Ports                            OS/Arch       Status   Data Dir                           Deploy Dir
--                    ----          ----            -----                            -------       ------   --------                           ----------
192.168.157.49:9093   alertmanager  192.168.157.49  9093/9094                        linux/x86_64  Up       /data/tidb-data/alertmanager-9093  /data/tidb-deploy/alertmanager-9093
192.168.157.49:3000   grafana       192.168.157.49  3000                             linux/x86_64  Up       -                                  /data/tidb-deploy/grafana-3000
192.168.157.41:2379   pd            192.168.157.41  2379/2380                        linux/x86_64  Up       /data/tidb-data/pd-2379            /data/tidb-deploy/pd-2379
192.168.157.42:2379   pd            192.168.157.42  2379/2380                        linux/x86_64  Up|L|UI  /data/tidb-data/pd-2379            /data/tidb-deploy/pd-2379
192.168.157.43:2379   pd            192.168.157.43  2379/2380                        linux/x86_64  Up       /data/tidb-data/pd-2379            /data/tidb-deploy/pd-2379
192.168.157.49:9090   prometheus    192.168.157.49  9090                             linux/x86_64  Up       /data/tidb-data/prometheus-8249    /data/tidb-deploy/prometheus-8249
192.168.157.45:4000   tidb          192.168.157.45  4000/10080                       linux/x86_64  Up       -                                  /data1/tidb-deploy/tidb-4000
192.168.157.46:4001   tidb          192.168.157.46  4001/10081                       linux/x86_64  Up       -                                  /data/tidb-deploy/tidb-4001
192.168.157.47:4000   tidb          192.168.157.47  4000/10080                       linux/x86_64  Up       -                                  /data/tidb-deploy/tidb-4000
192.168.157.48:4000   tidb          192.168.157.48  4000/10080                       linux/x86_64  Up       -                                  /data/tidb-deploy/tidb-4000
192.168.157.45:9000   tiflash       192.168.157.45  9000/8123/3930/20170/20292/8234  linux/x86_64  Up       /data1/tiflash-9000                /data/tidb-deploy/tiflash-9000
192.168.157.41:20160  tikv          192.168.157.41  20160/20180                      linux/x86_64  Up       /data1/tidb-data/tikv-20160        /data1/tidb-deploy/tikv-20160
192.168.157.41:20161  tikv          192.168.157.41  20161/20181                      linux/x86_64  Up       /data2/tidb-data/tikv-20161        /data2/tidb-deploy/tikv-20161
192.168.157.41:20162  tikv          192.168.157.41  20162/20182                      linux/x86_64  Up       /data3/tidb-data/tikv-20162        /data3/tidb-deploy/tikv-20162
192.168.157.41:20163  tikv          192.168.157.41  20163/20183                      linux/x86_64  Up       /data4/tidb-data/tikv-20163        /data4/tidb-deploy/tikv-20163
192.168.157.42:20160  tikv          192.168.157.42  20160/20180                      linux/x86_64  Up       /data1/tidb-data/tikv-20160        /data1/tidb-deploy/tikv-20160
192.168.157.42:20161  tikv          192.168.157.42  20161/20181                      linux/x86_64  Up       /data2/tidb-data/tikv-20161        /data2/tidb-deploy/tikv-20161
192.168.157.42:20162  tikv          192.168.157.42  20162/20182                      linux/x86_64  Up       /data3/tidb-data/tikv-20162        /data3/tidb-deploy/tikv-20162
192.168.157.42:20163  tikv          192.168.157.42  20163/20183                      linux/x86_64  Up       /data4/tidb-data/tikv-20163        /data4/tidb-deploy/tikv-20163
192.168.157.43:20160  tikv          192.168.157.43  20160/20180                      linux/x86_64  Up       /data1/tidb-data/tikv-20160        /data1/tidb-deploy/tikv-20160
192.168.157.43:20161  tikv          192.168.157.43  20161/20181                      linux/x86_64  Up       /data2/tidb-data/tikv-20161        /data2/tidb-deploy/tikv-20161
192.168.157.43:20162  tikv          192.168.157.43  20162/20182                      linux/x86_64  Up       /data3/tidb-data/tikv-20162        /data3/tidb-deploy/tikv-20162
192.168.157.43:20163  tikv          192.168.157.43  20163/20183                      linux/x86_64  Up       /data4/tidb-data/tikv-20163        /data4/tidb-deploy/tikv-20163
192.168.157.44:20160  tikv          192.168.157.44  20160/20180                      linux/x86_64  Up       /data1/tidb-data/tikv-20160        /data1/tidb-deploy/tikv-20160
192.168.157.44:20161  tikv          192.168.157.44  20161/20181                      linux/x86_64  Up       /data2/tidb-data/tikv-20161        /data2/tidb-deploy/tikv-20161
192.168.157.44:20162  tikv          192.168.157.44  20162/20182                      linux/x86_64  Up       /data3/tidb-data/tikv-20162        /data3/tidb-deploy/tikv-20162
192.168.157.44:20163  tikv          192.168.157.44  20163/20183                      linux/x86_64  Up       /data4/tidb-data/tikv-20163        /data4/tidb-deploy/tikv-20163
Total nodes: 27

在 Welcome 日志之前看到有 FATAL 日志

[2021/03/06 11:02:42.967 +08:00] [FATAL] [lib.rs:482] ["Uniform::sample_single called with low >= high"] [backtrace="stack backtrace:\
   0: tikv_util::set_panic_hook::{{closure}}\
             at components/tikv_util/src/lib.rs:481\
   1: std::panicking::rust_panic_with_hook\
             at src/libstd/panicking.rs:475\
   2: std::panicking::begin_panic\
             at /rustc/0de96d37fbcc54978458c18f5067cd9817669bc8/src/libstd/panicking.rs:404\
   3: <rand::distributions::uniform::UniformInt<usize> as rand::distributions::uniform::UniformSampler>::sample_single\
             at /home/jenkins/agent/workspace/build_tikv_multi_branch_v4.0.11/tikv/<::std::macros::panic macros>:3\
      rand::Rng::gen_range\
             at /rust/registry/src/github.com-1ecc6299db9ec823/rand-0.6.5/src/lib.rs:245\
   4: raftstore::store::worker::split_controller::sample\
             at components/raftstore/src/store/worker/split_controller.rs:86\
      raftstore::store::worker::split_controller::AutoSplitController::flush\
             at components/raftstore/src/store/worker/split_controller.rs:375\
   5: raftstore::store::worker::pd::StatsMonitor::start::{{closure}}\
             at components/raftstore/src/store/worker/pd.rs:342\
      std::sys_common::backtrace::__rust_begin_short_backtrace\
             at /rustc/0de96d37fbcc54978458c18f5067cd9817669bc8/src/libstd/sys_common/backtrace.rs:136\
   6: std::thread::Builder::spawn_unchecked::{{closure}}::{{closure}}\
             at /rustc/0de96d37fbcc54978458c18f5067cd9817669bc8/src/libstd/thread/mod.rs:469\
      <std::panic::AssertUnwindSafe<F> as core::ops::function::FnOnce<()>>::call_once\
             at /rustc/0de96d37fbcc54978458c18f5067cd9817669bc8/src/libstd/panic.rs:318\
      std::panicking::try::do_call\
             at /rustc/0de96d37fbcc54978458c18f5067cd9817669bc8/src/libstd/panicking.rs:292\
      std::panicking::try\
             at /rustc/0de96d37fbcc54978458c18f5067cd9817669bc8//src/libpanic_unwind/lib.rs:78\
      std::panic::catch_unwind\
             at /rustc/0de96d37fbcc54978458c18f5067cd9817669bc8/src/libstd/panic.rs:394\
      std::thread::Builder::spawn_unchecked::{{closure}}\
             at /rustc/0de96d37fbcc54978458c18f5067cd9817669bc8/src/libstd/thread/mod.rs:468\
      core::ops::function::FnOnce::call_once{{vtable.shim}}\
             at /rustc/0de96d37fbcc54978458c18f5067cd9817669bc8/src/libcore/ops/function.rs:232\
   7: <alloc::boxed::Box<F> as core::ops::function::FnOnce<A>>::call_once\
             at /rustc/0de96d37fbcc54978458c18f5067cd9817669bc8/src/liballoc/boxed.rs:1022\
   8: <alloc::boxed::Box<F> as core::ops::function::FnOnce<A>>::call_once\
             at /rustc/0de96d37fbcc54978458c18f5067cd9817669bc8/src/liballoc/boxed.rs:1022\
      std::sys_common::thread::start_thread\
             at src/libstd/sys_common/thread.rs:13\
      std::sys::unix::thread::Thread::new::thread_start\
             at src/libstd/sys/unix/thread.rs:80\
   9: start_thread\
  10: __clone\
"] [location=/rust/registry/src/github.com-1ecc6299db9ec823/rand-0.6.5/src/distributions/uniform.rs:473] [thread_name=stats-monitor]
[2021/03/06 11:03:10.144 +08:00] [INFO] [lib.rs:92] ["Welcome to TiKV"]

TiKV Panic,并且 panic 堆栈有以下类似的错误:
UniformSampler::sample_single: low >= high"] [backtrace="stack backtrace:

这是一个已知 BUG ,当 workload 存在单个 Region 读请求 QPS 高于 split.qps-threshold(默认3000)会触发这个 BUG 导致 TiKV panic.

临时的 workound 是:
将 QPS 阈值设高 [以下是动态配置设置方法,可以调整相应 tikv 的配置文件]:
[mysql] >> set config tikv split.qps-threshold=3000000

相关 issue:https://github.com/tikv/tikv/issues/9733

1 个赞

您好 您现在tidb是什么版本了升级了吗 是否解决了这个问题 split.qps-threshold问题 我的环境有小表 热点读。调小split.qps-threshold 也遇到了这个bug 4.0.11版本

此话题已在最后回复的 1 分钟后被自动关闭。不再允许新回复。