tikv莫名其妙的挂了


【 TiDB 版本】
【遇到的问题】
【复现路径】做过哪些操作出现的问题
【问题现象及影响】
tikv节点莫名其妙的挂了,启动不了
【附件】

请提供各个组件的 version 信息,如 cdc/tikv,可通过执行 cdc version/tikv-server --version 获取。

1 个赞

第一行看是无法注册到PD,确认PD是正常的嘛‘?

1 个赞

tiup cluster restart 试一下

display 显示正常
%E5%9B%BE%E7%89%87

在挂掉的这台上ping一下pd的机器呢?

生产环境,没办法重启整个集群

经验告诉我,你kv日志出现报错的时间点如果和pd有关,你可也以去看下pd这个时间点的日志

ping 和telnet 都是通的

我们保存了一天的日志,之前的日志查不到,不知道啥时候挂了

我估计故障已经恢复,你先启动故障kv节点试试,不行的话再接着排查故障时间点pd的日志

尝试手动启动这个kv tiup cluster start clustername -N kv节点 启动不了

这个pd的 up|UI 是什么意思,我看其他的集群上up|L 和UI 是一个pd节点

Dashboard URL, ui的意思是dashboard托管在你的这个pd,

L是PD的leader

你可以尝试在浏览器用其他pd节点和端口访问你的dashboard看板,你会发现ip和端口会跳转至你的ui所在的pd

单节点机器重启行不行?

感谢,明白,手动启动pd节点也是报错
[FATAL] [lib.rs:465] [“[region 128472] 128475 data is corrupted at 260926: WireError(TruncatedMessage)”] [backtrace=" 0: tikv_util::set_panic_hook::{{closure}}
at /home/jenkins/agent/workspace/optimization-build-tidb-linux-amd/go/src/github.com/pingcap/tikv/components/tikv_util/src/lib.rs:464:18
1: std::panicking::rust_panic_with_hook
at /rustc/2faabf579323f5252329264cc53ba9ff803429a3/library/std/src/panicking.rs:626:17
2: std::panicking::begin_panic_handler::{{closure}}
at /rustc/2faabf579323f5252329264cc53ba9ff803429a3/library/std/src/panicking.rs:519:13
3: std::sys_common::backtrace::__rust_end_short_backtrace
at /rustc/2faabf579323f5252329264cc53ba9ff803429a3/library/std/src/sys_common/backtrace.rs:141:18
4: rust_begin_unwind
at /rustc/2faabf579323f5252329264cc53ba9ff803429a3/library/std/src/panicking.rs:515:5
5: std::panicking::begin_panic_fmt
at /rustc/2faabf579323f5252329264cc53ba9ff803429a3/library/std/src/panicking.rs:457:5
6: raftstore::store::util::parse_data_at::{{closure}}
at /home/jenkins/agent/workspace/optimization-build-tidb-linux-amd/go/src/github.com/pingcap/tikv/components/raftstore/src/store/util.rs:666:9
7: core::result::Result<T,E>::unwrap_or_else
at /rustc/2faabf579323f5252329264cc53ba9ff803429a3/library/core/src/result.rs:1065:23
raftstore::store::util::parse_data_at
at /home/jenkins/agent/workspace/optimization-build-tidb-linux-amd/go/src/github.com/pingcap/tikv/components/raftstore/src/store/util.rs:665:5
8: raftstore::store::fsm::apply::ApplyDelegate::handle_raft_entry_normal
at /home/jenkins/agent/workspace/optimization-build-tidb-linux-amd/go/src/github.com/pingcap/tikv/components/raftstore/src/store/fsm/apply.rs:1041:23
raftstore::store::fsm::apply::ApplyDelegate::handle_raft_committed_entries
at /home/jenkins/agent/workspace/optimization-build-tidb-linux-amd/go/src/github.com/pingcap/tikv/components/raftstore/src/store/fsm/apply.rs:969:43
9: raftstore::store::fsm::apply::ApplyFsm::handle_apply
at /home/jenkins/agent/workspace/optimization-build-tidb-linux-amd/go/src/github.com/pingcap/tikv/components/raftstore/src/store/fsm/apply.rs:3290:9
10: raftstore::store::fsm::apply::ApplyFsm::handle_tasks
at /home/jenkins/agent/workspace/optimization-build-tidb-linux-amd/go/src/github.com/pingcap/tikv/components/raftstore/src/store/fsm/apply.rs:3584:25
11: <raftstore::store::fsm::apply::ApplyPoller<EK,W> as batch_system::batch::PollHandler<raftstore::store::fsm::apply::ApplyFsm,raftstore::store::fsm::apply::ControlFsm>>::handle_normal
at /home/jenkins/agent/workspace/optimization-build-tidb-linux-amd/go/src/github.com/pingcap/tikv/components/raftstore/src/store/fsm/apply.rs:3831:9
12: batch_system::batch::Poller<N,C,Handler>::poll
at /home/jenkins/agent/workspace/optimization-build-tidb-linux-amd/go/src/github.com/pingcap/tikv/components/batch-system/src/batch.rs:408:27
batch_system::batch::BatchSystem<N,C>::start_poller::{{closure}}
at /home/jenkins/agent/workspace/optimization-build-tidb-linux-amd/go/src/github.com/pingcap/tikv/components/batch-system/src/batch.rs:536:17
std::sys_common::backtrace::__rust_begin_short_backtrace
at /rustc/2faabf579323f5252329264cc53ba9ff803429a3/library/std/src/sys_common/backtrace.rs:125:18
13: std::thread::Builder::spawn_unchecked::{{closure}}::{{closure}}
at /rustc/2faabf579323f5252329264cc53ba9ff803429a3/library/std/src/thread/mod.rs:476:17
<std::panic::AssertUnwindSafe as core::ops::function::FnOnce<()>>::call_once
at /rustc/2faabf579323f5252329264cc53ba9ff803429a3/library/std/src/panic.rs:347:9
std::panicking::try::do_call
at /rustc/2faabf579323f5252329264cc53ba9ff803429a3/library/std/src/panicking.rs:401:40
std::panicking::try
at /rustc/2faabf579323f5252329264cc53ba9ff803429a3/library/std/src/panicking.rs:365:19
std::panic::catch_unwind
at /rustc/2faabf579323f5252329264cc53ba9ff803429a3/library/std/src/panic.rs:434:14
std::thread::Builder::spawn_unchecked::{{closure}}
at /rustc/2faabf579323f5252329264cc53ba9ff803429a3/library/std/src/thread/mod.rs:475:30
core::ops::function::FnOnce::call_once{{vtable.shim}}
at /rustc/2faabf579323f5252329264cc53ba9ff803429a3/library/core/src/ops/function.rs:227:5
14: <alloc::boxed::Box<F,A> as core::ops::function::FnOnce>::call_once
at /rustc/2faabf579323f5252329264cc53ba9ff803429a3/library/alloc/src/boxed.rs:1572:9
<alloc::boxed::Box<F,A> as core::ops::function::FnOnce>::call_once
at /rustc/2faabf579323f5252329264cc53ba9ff803429a3/library/alloc/src/boxed.rs:1572:9
std::sys::unix::thread::thread::new::thread_start
at /rustc/2faabf579323f5252329264cc53ba9ff803429a3/library/std/src/sys/unix/thread.rs:91:17
15: start_thread
16: __clone
"] [location=/home/jenkins/agent/workspace/optimization-build-tidb-linux-amd/go/src/github.com/pingcap/tikv/components/raftstore/src/store/util.rs:666] [thread_name=apply-1]

看日志说是数据损坏

也不行,tiup 和tikv-server 启动都不行,报上面日志的错误

tiup cluster scale-in clustername —N ip:端口
tiup cluster scale-out cluster 文件名
文件解决,但是还是不知道其中缘由