一个tikv节点宕机导致集群不可用

【 TiDB 使用环境】生产环境
【 TiDB 版本】
v4.0.14
【复现路径】做过哪些操作出现的问题
使用云服务器部署的tidb v4.0.14 3个tikv,一个tikv宕机重启之后发现起不来,panic。导致且整个集群处于不可用状态
【遇到的问题:问题现象及影响】

最后通过扩容一个tikv,停掉坏的kv,重启剩下的2个kv解决的
【资源配置】进入到 TiDB Dashboard -集群信息 (Cluster Info) -主机(Hosts) 截图此页面
【附件:截图/日志/监控】
这是tikv错误日志

{"log":"[2023/07/20 12:49:05.709 +08:00] [FATAL] [lib.rs:481] [\"to_commit 1238767 is out of range [last_index 1238765], raft_id: 893827, region_id: 893825\"] [backtrace=\"stack backtrace:\\n   0: tikv_util::set_panic_hook::{{closure}}\\n             at components/tikv_util/src/lib.rs:480\\n   1: std::panicking::rust_panic_with_hook\\n             at src/libstd/panicking.rs:475\\n   2: rust_begin_unwind\\n             at src/libstd/panicking.rs:375\\n   3: std::panicking::begin_panic_fmt\\n             at src/libstd/panicking.rs:326\\n   4: raft::raft_log::RaftLog\u003cT\u003e::commit_to\\n             at home/jenkins/agent/workspace/optimization-build-tidb-linux-amd/go/src/github.com/pingcap/tikv/\u003c::std::macros::panic macros\u003e:9\\n   5: raft::raft::Raft\u003cT\u003e::handle_heartbeat\\n             at rust/git/checkouts/raft-rs-841f8a6db665c5c0/b5f5830/src/raft.rs:1877\\n   6: raft::raft::Raft\u003cT\u003e::step_follower\\n             at rust/git/checkouts/raft-rs-841f8a6db665c5c0/b5f5830/src/raft.rs:1718\\n      raft::raft::Raft\u003cT\u003e::step\\n             at rust/git/checkouts/raft-rs-841f8a6db665c5c0/b5f5830/src/raft.rs:1129\\n   7: raft::raw_node::RawNode\u003cT\u003e::step\\n             at rust/git/checkouts/raft-rs-841f8a6db665c5c0/b5f5830/src/raw_node.rs:339\\n      raftstore::store::peer::Peer::step\\n             at home/jenkins/agent/workspace/optimization-build-tidb-linux-amd/go/src/github.com/pingcap/tikv/components/raftstore/src/store/peer.rs:941\\n      raftstore::store::fsm::peer::PeerFsmDelegate\u003cT,C\u003e::on_raft_message\\n             at home/jenkins/agent/workspace/optimization-build-tidb-linux-amd/go/src/github.com/pingcap/tikv/components/raftstore/src/store/fsm/peer.rs:1206\\n   8: raftstore::store::fsm::peer::PeerFsmDelegate\u003cT,C\u003e::handle_msgs\\n             at home/jenkins/agent/workspace/optimization-build-tidb-linux-amd/go/src/github.com/pingcap/tikv/components/raftstore/src/store/fsm/peer.rs:455\\n   9: \u003craftstore::store::fsm::store::RaftPoller\u003cT,C\u003e as batch_system::batch::PollHandler\u003craftstore::store::fsm::peer::PeerFsm\u003cengine_rocks::engine::RocksEngine\u003e,raftstore::store::fsm::store::StoreFsm\u003e\u003e::handle_normal\\n             at home/jenkins/agent/workspace/optimization-build-tidb-linux-amd/go/src/github.com/pingcap/tikv/components/raftstore/src/store/fsm/store.rs:785\\n  10: batch_system::batch::Poller\u003cN,C,Handler\u003e::poll\\n             at home/jenkins/agent/workspace/optimization-build-tidb-linux-amd/go/src/github.com/pingcap/tikv/components/batch-system/src/batch.rs:325\\n  11: batch_system::batch::BatchSystem\u003cN,C\u003e::spawn::{{closure}}\\n             at home/jenkins/agent/workspace/optimization-build-tidb-linux-amd/go/src/github.com/pingcap/tikv/components/batch-system/src/batch.rs:402\\n      std::sys_common::backtrace::__rust_begin_short_backtrace\\n             at rustc/0de96d37fbcc54978458c18f5067cd9817669bc8/src/libstd/sys_common/backtrace.rs:136\\n  12: std::thread::Builder::spawn_unchecked::{{closure}}::{{closure}}\\n             at rustc/0de96d37fbcc54978458c18f5067cd9817669bc8/src/libstd/thread/mod.rs:469\\n      \u003cstd::panic::AssertUnwindSafe\u003cF\u003e as core::ops::function::FnOnce\u003c()\u003e\u003e::call_once\\n             at rustc/0de96d37fbcc54978458c18f5067cd9817669bc8/src/libstd/panic.rs:318\\n      std::panicking::try::do_call\\n             at rustc/0de96d37fbcc54978458c18f5067cd9817669bc8/src/libstd/panicking.rs:292\\n      std::panicking::try\\n             at rustc/0de96d37fbcc54978458c18f5067cd9817669bc8//src/libpanic_unwind/lib.rs:78\\n      std::panic::catch_unwind\\n             at rustc/0de96d37fbcc54978458c18f5067cd9817669bc8/src/libstd/panic.rs:394\\n      std::thread::Builder::spawn_unchecked::{{closure}}\\n             at rustc/0de96d37fbcc54978458c18f5067cd9817669bc8/src/libstd/thread/mod.rs:468\\n      core::ops::function::FnOnce::call_once{{vtable.shim}}\\n             at rustc/0de96d37fbcc54978458c18f5067cd9817669bc8/src/libcore/ops/function.rs:232\\n  13: \u003calloc::boxed::Box\u003cF\u003e as core::ops::function::FnOnce\u003cA\u003e\u003e::call_once\\n             at rustc/0de96d37fbcc54978458c18f5067cd9817669bc8/src/liballoc/boxed.rs:1022\\n  14: \u003calloc::boxed::Box\u003cF\u003e as core::ops::function::FnOnce\u003cA\u003e\u003e::call_once\\n             at rustc/0de96d37fbcc54978458c18f5067cd9817669bc8/src/liballoc/boxed.rs:1022\\n      std::sys_common::thread::start_thread\\n             at src/libstd/sys_common/thread.rs:13\\n      std::sys::unix::thread::Thread::new::thread_start\\n             at src/libstd/sys/unix/thread.rs:80\\n  15: \u003cunknown\u003e\\n  16: clone\\n\"] [location=/rust/git/checkouts/raft-rs-841f8a6db665c5c0/b5f5830/src/raft_log.rs:237] [thread_name=raftstore-22185-0]\n","stream":"stderr","time":"2023-07-20T04:49:05.709255578Z"}

SHOW config WHERE NAME LIKE ‘%max-replicas%’;
max-replica参数是多少啊?为啥3个tikv挂一个集群直接不可用了。。。

我也好奇这一点。明明设置的3副本。一个kv坏了,整个集群都连不上了

tiup display打印一下

docker部署的

那看下其他tikv和tidb节点的日志报什么错?

其他kv期间都报这个错

[2023/07/20 12:47:26.961 +08:00] [INFO] [<unknown>] ["Connect failed: {\"created\":\"@1689828446.961442851\",\"description\":\"Failed to connect to remote host: Connection timed out\",\"errno\":110,\"file\":\"/rust/registry/src/github.com-1ecc6299db9ec823/grpcio-sys-0.5.3/grpc/src/core/lib/iomgr/tcp_client_posix.cc\",\"file_line\":205,\"os_error\":\"Connection timed out\",\"syscall\":\"getsockopt(SO_ERROR)\",\"target_address\":\"ipv4:10.0.1.11:10000\"}"]
[2023/07/20 12:47:26.961 +08:00] [INFO] [<unknown>] ["Subchannel 0x7f5daeeb9b80: Retry immediately"]
[2023/07/20 12:47:26.961 +08:00] [INFO] [<unknown>] ["Failed to connect to channel, retrying"]
[2023/07/20 12:47:26.961 +08:00] [WARN] [raft_client.rs:296] ["RPC batch_raft fail"] [err="Some(RpcFailure(RpcStatus { status: 14-UNAVAILABLE, details: Some(\"failed to connect to all addresses\") }))"] [sink_err="Some(RpcFinished(Some(RpcStatus { status: 14-UNAVAILABLE, details: Some(\"failed to connect to all addresses\") })))"] [to_addr=10.0.1.11:10000]
[2023/07/20 12:47:26.961 +08:00] [WARN] [raft_client.rs:199] ["send to 10.0.1.11:10000 nnnfail, the gRPC connection could be broken"]
[2023/07/20 12:47:26.961 +08:00] [ERROR] [transport.rs:163] ["send raft msg err"] [err="Other(\"[src/server/raft_client.rs:208]: RaftClient send fail\")"]

tidb期间的报错

[2023/07/20 04:47:20.901 +00:00] [INFO] [client_batch.go:348] ["batchRecvLoop fails when receiving, needs to reconnect"] [target=10.0.1.11:10000] [error="rpc error: code = Unavailable desc = transport is closing"]
[2023/07/20 04:47:20.917 +00:00] [INFO] [client_batch.go:348] ["batchRecvLoop fails when receiving, needs to reconnect"] [target=10.0.1.11:10000] [error="rpc error: code = Unavailable desc = transport is closing"]
[2023/07/20 04:47:20.917 +00:00] [INFO] [client_batch.go:348] ["batchRecvLoop fails when receiving, needs to reconnect"] [target=10.0.1.11:10000] [error="rpc error: code = Unavailable desc = transport is closing"]
[2023/07/20 04:47:20.917 +00:00] [INFO] [client_batch.go:348] ["batchRecvLoop fails when receiving, needs to reconnect"] [target=10.0.1.11:10000] [error="rpc error: code = Unavailable desc = transport is closing"]
[2023/07/20 04:47:20.952 +00:00] [WARN] [client_batch.go:530] ["no available connections"] [target=10.0.1.11:10000]
[2023/07/20 04:47:20.953 +00:00] [WARN] [client_batch.go:530] ["no available connections"] [target=10.0.1.11:10000]
[2023/07/20 04:47:20.955 +00:00] [WARN] [client_batch.go:530] ["no available connections"] [target=10.0.1.11:10000]
[2023/07/20 04:47:20.957 +00:00] [WARN] [client_batch.go:530] ["no available connections"] [target=10.0.1.11:10000]
[2023/07/20 04:47:20.959 +00:00] [WARN] [client_batch.go:530] ["no available connections"] [target=10.0.1.11:10000]

其中10.0.1.11:10000就是 那个宕机的kv

你不是用docker部署的?先扩容一个tikv节点呢

这个已经通过扩容一个tikv,停掉坏的kv,重启剩下的2个kv 恢复了。但是得查查原因

不知道是哪一步起作用了。反正做了这些操作,现在那个坏了的kv还处于stop状态。也没给它从pd-ctl中下线,留着查原因

raftstore.sync-log这个参数是什么?

是false

https://docs.pingcap.com/zh/tidb/v5.1/tidb-troubleshooting-map#41-tikv-panic-启动不了
https://docs.pingcap.com/zh/tidb/v5.1/release-5.0.0#配置文件参数
https://github.com/pingcap/tidb/issues/17099
4.0的版本有这个问题,raftstore.sync-log参数为false,当tikv节点因断电类的原因重启,会导致tikv panic无法启动,必须 通过 tikv-ctl 工具恢复 Region,在5.0删除了这个参数,默认都是true了

1 个赞

好的。那升级一下版本吧

但是为什么会导致整个集群不可用呢。是因为这个kv在一直重启的原因吗

4.0的版本确实比较老了,出现这种问题也不可避免,建议起码升级到5.4,会稳定很多

是指kv panic 之后导致整个集群不可用不可避免吗

4.0我没用过,5和6版本的3副本3节点tikv,损坏一个节点肯定不会影响剩下2个节点对外提供服务,4.0的有问题也很难追究原因,除非你去扒拉一下源码,社区工作人员估计也只会推荐你升级版本。。。

好的,谢谢大佬 :grinning: