普罗米修斯
1
【 TiDB 使用环境】生产环境
【 TiDB 版本】Tidb3.0
【遇到的问题】
机房掉电后重新拉起tidb,有一个tikv无法启动,查看该tikv日志报错
[FATAL] [lib.rs:499] [“[region 27] 11522647 unexpected raft log index: last_index 5177432 < applied_index 5177434”]
【做过的操作】
1.查看bad-region /home/rds/tidb-v3.0-linux-amd64/bin/tikv-ctl --db /TiDBDisk3/deploy/data/db bad-regions
2.停掉调度服务

3.关闭每个tikv服务;
4.在每个tikv节点执行/home/rds/tidb-v3.0-linux-amd64/bin/tikv-ctl --db /TiDBDisk3/deploy/data/db unsafe-recover remove-fail-stores -s 170154 -r 27
(170154为挂掉的tikv)
5.在挂掉的tikv节点执行该命令时报以下错误
【资源配置】
普罗米修斯
2
[2023/03/15 19:16:46.109 +08:00] [FATAL] [lib.rs:499] [“[region 27] 11522647 unexpected raft log index: last_index 5177432 < applied_index 5177434”] [backtrace=“stack backtrace:\n 0: 0x55bfe319b51d - backtrace::backtrace::libunwind::trace::h0500f4f2825a5d17\n at /rust/registry/src/github.com-1ecc6299db9ec823/backtrace-0.2.3/src/backtrace/libunwind.rs:54\n - backtrace::backtrace::trace::h4187244de1605a06\n at /rust/registry/src/github.com-1ecc6299db9ec823/backtrace-0.2.3/src/backtrace/mod.rs:70\n 1: 0x55bfe318fcd0 - tikv_util::set_panic_hook::{{closure}}::h195100b0bbd49cfb\n at /home/jenkins/.target/release/build/backtrace-e20a32a05fd0b8fe/out/capture.rs:79\n 2: 0x55bfe333464f - std::panicking::rust_panic_with_hook::h8d2408723e9a2bd4\n at src/libstd/panicking.rs:479\n 3: 0x55bfe333442d - std::panicking::continue_panic_fmt::hb2aaa9386c4e5e80\n at src/libstd/panicking.rs:382\n 4: 0x55bfe33343db - std::panicking::begin_panic_fmt::h1c91fada5a982dcd\n at src/libstd/panicking.rs:337\n 5: 0x55bfe291a49a - tikv::raftstore::store::peer::Peer:
:h114b9c5233192fb4\n at src/raftstore/store/peer.rs:0\n 6: 0x55bfe27b70c5 - tikv::raftstore::store::fsm::peer::PeerFsm::create::h796ac694c7f2d0e5\n at src/raftstore/store/fsm/peer.rs:151\n 7: 0x55bfe26a0973 - tikv::raftstore::store::fsm::store::RaftPollerBuilder<T,C>::init::{{closure}}::h4c08e0783f93ab81\n at src/raftstore/store/fsm/store.rs:750\n - engine::iterable::scan_impl::h0450662012195a8e\n at /home/jenkins/workspace/release_tidb_3.0/tikv/components/engine/src/iterable.rs:198\n - engine::iterable::Iterable::scan_cf::h6d7aa7d8bbcfb1ed\n at /home/jenkins/workspace/release_tidb_3.0/tikv/components/engine/src/iterable.rs:174\n - tikv::raftstore::store::fsm::store::RaftPollerBuilder<T,C>::init::h75225d858b84b1b1\n at src/raftstore/store/fsm/store.rs:721\n - tikv::raftstore::store::fsm::store::RaftBatchSystem::spawn::h4291f89945cb9126\n at src/raftstore/store/fsm/store.rs:1003\n 8: 0x55bfe267c982 - tikv::server::node::Node::start_store::hb6e41ce5d17f8092\n at src/server/node.rs:341\n - tikv::server::node::Node::start::h502723d45c7f0962\n at src/server/node.rs:148\n - tikv::binutil::server::run_raft_server::he798388cc8852c3a\n at src/binutil/server.rs:276\n 9: 0x55bfe26451b1 - tikv::binutil::server::run_tikv::h17cbdf211cde42b2\n at src/binutil/server.rs:79\n 10: 0x55bfe263a805 - tikv_server::main::h28eadcb59f5aa918\n at src/bin/tikv-server.rs:159\n 11: 0x55bfe25b1362 - std::rt::lang_start::{{closure}}::hd8df218522d5a046\n at /rustc/0e4a56b4b04ea98bb16caada30cb2418dd06e250/src/libstd/rt.rs:64\n 12: 0x55bfe263c128 - main\n 13: 0x7f69f5917b96 - __libc_start_main\n 14: 0x55bfe25871a8 - \n 15: 0x0 - ”] [location=src/raftstore/store/peer_storage.rs:494] [thread_name=main]
[2023/03/15 19:17:01.850 +08:00] [FATAL] [lib.rs:499] [“[region 27] 11522647 unexpected raft log index: last_index 5177432 < applied_index 5177434”] [backtrace=“stack backtrace:\n 0: 0x561944e9c51d - backtrace::backtrace::libunwind::trace::h0500f4f2825a5d17\n at /rust/registry/src/github.com-1ecc6299db9ec823/backtrace-0.2.3/src/backtrace/libunwind.rs:54\n - backtrace::backtrace::trace::h4187244de1605a06\n at /rust/registry/src/github.com-1ecc6299db9ec823/backtrace-0.2.3/src/backtrace/mod.rs:70\n 1: 0x561944e90cd0 - tikv_util::set_panic_hook::{{closure}}::h195100b0bbd49cfb\n at /home/jenkins/.target/release/build/backtrace-e20a32a05fd0b8fe/out/capture.rs:79\n 2: 0x56194503564f - std::panicking::rust_panic_with_hook::h8d2408723e9a2bd4\n at src/libstd/panicking.rs:479\n 3: 0x56194503542d - std::panicking::continue_panic_fmt::hb2aaa9386c4e5e80\n at src/libstd/panicking.rs:382\n 4: 0x5619450353db - std::panicking::begin_panic_fmt::h1c91fada5a982dcd\n at src/libstd/panicking.rs:337\n 5: 0x56194461b49a - tikv::raftstore::store::peer::Peer:
:h114b9c5233192fb4\n at src/raftstore/store/peer.rs:0\n 6: 0x5619444b80c5 - tikv::raftstore::store::fsm::peer::PeerFsm::create::h796ac694c7f2d0e5\n at src/raftstore/store/fsm/peer.rs:151\n 7: 0x5619443a1973 - tikv::raftstore::store::fsm::store::RaftPollerBuilder<T,C>::init::{{closure}}::h4c08e0783f93ab81\n at src/raftstore/store/fsm/store.rs:750\n - engine::iterable::scan_impl::h0450662012195a8e\n at /home/jenkins/workspace/release_tidb_3.0/tikv/components/engine/src/iterable.rs:198\n - engine::iterable::Iterable::scan_cf::h6d7aa7d8bbcfb1ed\n at /home/jenkins/workspace/release_tidb_3.0/tikv/components/engine/src/iterable.rs:174\n - tikv::raftstore::store::fsm::store::RaftPollerBuilder<T,C>::init::h75225d858b84b1b1\n at src/raftstore/store/fsm/store.rs:721\n - tikv::raftstore::store::fsm::store::RaftBatchSystem::spawn::h4291f89945cb9126\n at src/raftstore/store/fsm/store.rs:1003\n 8: 0x56194437d982 - tikv::server::node::Node::start_store::hb6e41ce5d17f8092\n at src/server/node.rs:341\n - tikv::server::node::Node::start::h502723d45c7f0962\n at src/server/node.rs:148\n - tikv::binutil::server::run_raft_server::he798388cc8852c3a\n at src/binutil/server.rs:276\n 9: 0x5619443461b1 - tikv::binutil::server::run_tikv::h17cbdf211cde42b2\n at src/binutil/server.rs:79\n 10: 0x56194433b805 - tikv_server::main::h28eadcb59f5aa918\n at src/bin/tikv-server.rs:159\n 11: 0x5619442b2362 - std::rt::lang_start::{{closure}}::hd8df218522d5a046\n at /rustc/0e4a56b4b04ea98bb16caada30cb2418dd06e250/src/libstd/rt.rs:64\n 12: 0x56194433d128 - main\n 13: 0x7f642ab11b96 - __libc_start_main\n 14: 0x5619442881a8 - \n 15: 0x0 - ”] [location=src/raftstore/store/peer_storage.rs:494] [thread_name=main]
首先节点多,挂掉一个不要紧,不要慌,听我的先别胡乱操作慢慢来
第一步 tiup cluster display检查节点状态,pd-ctl store 查看节点状态,查好反馈给我
普罗米修斯
6
ansible部署的,我们写了一个tidb监测的网页
第一步 tiup cluster display检查节点状态,pd-ctl store 查看节点状态,查好反馈给我,可以吗
普罗米修斯
8
ansible部署的,没有tiup工具 ,store状态和上面一样
普罗米修斯
10
问题解决了 ,down tikv 上region和leader已经全部转移了,手动缩容 扩容 恢复正常了
tikv节点要保持四个以上,磁盘充足,问题就会少很多,至少出现集群不可用的概率比较小