TiKV节点panic

【 TiDB 使用环境】生产环境
【 TiDB 版本】v5.4.3
【复现路径】无
【遇到的问题:问题现象及影响】
TiKV节点重启,dmesg信息未发现OOM信息,日志里有panic信息。这种情况如何排查原因或避免类似情况出现?

[2024/04/07 21:26:25.431 +08:00] [FATAL] [lib.rs:465] ["commit_ts: TimeStamp(448920645354651662), resolved_ts: TimeStamp(448920645996904752)"] [backtrace="   0: tikv_util::set_panic_hook::{{closure}}\n             at /home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/tikv/components/tikv_util/src/lib.rs:464:18\n   1: std::panicking::rust_panic_with_hook\n             at /rustc/2faabf579323f5252329264cc53ba9ff803429a3/library/std/src/panicking.rs:626:17\n   2: std::panicking::begin_panic_handler::{{closure}}\n             at /rustc/2faabf579323f5252329264cc53ba9ff803429a3/library/std/src/panicking.rs:519:13\n   3: std::sys_common::backtrace::__rust_end_short_backtrace\n             at /rustc/2faabf579323f5252329264cc53ba9ff803429a3/library/std/src/sys_common/backtrace.rs:141:18\n   4: rust_begin_unwind\n             at /rustc/2faabf579323f5252329264cc53ba9ff803429a3/library/std/src/panicking.rs:515:5\n   5: std::panicking::begin_panic_fmt\n             at /rustc/2faabf579323f5252329264cc53ba9ff803429a3/library/std/src/panicking.rs:457:5\n   6: cdc::delegate::Delegate::sink_put\n             at /home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/tikv/components/cdc/src/delegate.rs:630:21\n      cdc::delegate::Delegate::sink_data\n             at /home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/tikv/components/cdc/src/delegate.rs:544:21\n   7: cdc::delegate::Delegate::on_batch\n             at /home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/tikv/components/cdc/src/delegate.rs:416:17\n   8: cdc::endpoint::Endpoint<T,E>::on_multi_batch\n             at /home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/tikv/components/cdc/src/endpoint.rs:740:33\n      <cdc::endpoint::Endpoint<T,E> as tikv_util::worker::pool::Runnable>::run\n             at /home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/tikv/components/cdc/src/endpoint.rs:1548:18\n   9: tikv_util::worker::pool::Worker::start_with_timer_impl::{{closure}}\n             at /home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/tikv/components/tikv_util/src/worker/pool.rs:454:25\n      <core::future::from_generator::GenFuture<T> as core::future::future::Future>::poll\n             at /rustc/2faabf579323f5252329264cc53ba9ff803429a3/library/core/src/future/mod.rs:80:19\n      yatp::task::future::RawTask<F>::poll\n             at /rust/git/checkouts/yatp-e704b73c3ee279b6/d564d19/src/task/future.rs:59:9\n  10: yatp::task::future::TaskCell::poll\n             at /rust/git/checkouts/yatp-e704b73c3ee279b6/d564d19/src/task/future.rs:103:9\n      <yatp::task::future::Runner as yatp::pool::runner::Runner>::handle\n             at /rust/git/checkouts/yatp-e704b73c3ee279b6/d564d19/src/task/future.rs:387:20\n  11: <tikv_util::yatp_pool::YatpPoolRunner<T> as yatp::pool::runner::Runner>::handle\n             at /home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/tikv/components/tikv_util/src/yatp_pool/mod.rs:104:24\n      yatp::pool::worker::WorkerThread<T,R>::run\n             at /rust/git/checkouts/yatp-e704b73c3ee279b6/d564d19/src/pool/worker.rs:48:13\n      yatp::pool::builder::LazyBuilder<T>::build::{{closure}}\n             at /rust/git/checkouts/yatp-e704b73c3ee279b6/d564d19/src/pool/builder.rs:91:25\n      std::sys_common::backtrace::__rust_begin_short_backtrace\n             at /rustc/2faabf579323f5252329264cc53ba9ff803429a3/library/std/src/sys_common/backtrace.rs:125:18\n  12: std::thread::Builder::spawn_unchecked::{{closure}}::{{closure}}\n             at /rustc/2faabf579323f5252329264cc53ba9ff803429a3/library/std/src/thread/mod.rs:476:17\n      <std::panic::AssertUnwindSafe<F> as core::ops::function::FnOnce<()>>::call_once\n             at /rustc/2faabf579323f5252329264cc53ba9ff803429a3/library/std/src/panic.rs:347:9\n      std::panicking::try::do_call\n             at /rustc/2faabf579323f5252329264cc53ba9ff803429a3/library/std/src/panicking.rs:401:40\n      std::panicking::try\n             at /rustc/2faabf579323f5252329264cc53ba9ff803429a3/library/std/src/panicking.rs:365:19\n      std::panic::catch_unwind\n             at /rustc/2faabf579323f5252329264cc53ba9ff803429a3/library/std/src/panic.rs:434:14\n      std::thread::Builder::spawn_unchecked::{{closure}}\n             at /rustc/2faabf579323f5252329264cc53ba9ff803429a3/library/std/src/thread/mod.rs:475:30\n      core::ops::function::FnOnce::call_once{{vtable.shim}}\n             at /rustc/2faabf579323f5252329264cc53ba9ff803429a3/library/core/src/ops/function.rs:227:5\n  13: <alloc::boxed::Box<F,A> as core::ops::function::FnOnce<Args>>::call_once\n             at /rustc/2faabf579323f5252329264cc53ba9ff803429a3/library/alloc/src/boxed.rs:1572:9\n      <alloc::boxed::Box<F,A> as core::ops::function::FnOnce<Args>>::call_once\n             at /rustc/2faabf579323f5252329264cc53ba9ff803429a3/library/alloc/src/boxed.rs:1572:9\n      std::sys::unix::thread::Thread::new::thread_start\n             at /rustc/2faabf579323f5252329264cc53ba9ff803429a3/library/std/src/sys/unix/thread.rs:91:17\n  14: start_thread\n  15: clone\n"] [location=components/cdc/src/delegate.rs:630] [thread_name=cdc-0]
[2024/04/07 21:26:45.070 +08:00] [INFO] [lib.rs:81] ["Welcome to TiKV"]
[2024/04/07 21:26:45.070 +08:00] [INFO] [lib.rs:86] ["Release Version:   5.4.3"]
[2024/04/07 21:26:45.071 +08:00] [INFO] [lib.rs:86] ["Edition:           Community"]
[2024/04/07 21:26:45.071 +08:00] [INFO] [lib.rs:86] ["Git Commit Hash:   deb149e42d97743349277ff8741f5cb9ae1c027d"]
[2024/04/07 21:26:45.071 +08:00] [INFO] [lib.rs:86] ["Git Commit Branch: heads/refs/tags/v5.4.3"]

【资源配置】进入到 TiDB Dashboard -集群信息 (Cluster Info) -主机(Hosts) 截图此页面


【附件:截图/日志/监控】

dmesg -T | grep “error” 有啥输出吗

没有信息

在此期间有没有大量的查询,返回的数据量太大,gRPC 的发送速度跟不上 Coprocessor 输出数据的速度,也可能导致内存溢出

1 个赞

看代码有这个assert,但是不了解cdc的代码。

1 个赞

遇到过tidb重启,tikv重启没遇到过。看资源使用情况,cpu貌似遇到了瓶颈。

https://docs.pingcap.com/zh/tidb/v5.4/troubleshoot-tidb-cluster 根据这里,提个 issue 呗。让开发人员看看。

1 个赞

好的,我看看

mark 一下

Panic 可能是遇到 bug 了,试试升级解决。

1 个赞

在其他版本试一试看看

有tikv机器的监控图吗, 看看内存cpu是否打满?有没有大的查询, 到TIDB的日志中也可以看看


cpu,内存好像没有打满

io使用率是比较高的,有瓶颈

这类问题提ISSUE比较保险。毕竟是生产问题。可能官方都已经解决了。

1 个赞

有概率慢语句引起的cpu峰值过高然后崩了

5.X 的版本的周期已经到了,建议升级到新版本上,比如 6.5.x

但是不要直接升级,最好采用新旧两套资源,做完适配测试后,在做切换

1 个赞

检查一下系统的日志是否有异常

1 个赞

只有一台TiKV重启吗?系统设置有没有限制TiKV对系统内存的使用,因为系统内存并没有用尽。
TiKV Panic的问题,可以参考文档
https://docs.pingcap.com/zh/tidb/v5.4/tidb-troubleshooting-map#44-某些-tikv-大量掉-leader

1 个赞

升级吧