tikv一个节点无法起来,报错[FATAL] [lib.rs:491] ["attempt to overwrite compacted entries in

【 TiDB 使用环境】生产环境 /测试/ Poc
生产
【 TiDB 版本】
v6.1.0
【复现路径】做过哪些操作出现的问题

【遇到的问题:问题现象及影响】
tikv节点是down状态
【资源配置】
【附件:截图/日志/监控】
[FATAL] [lib.rs:491] [“attempt to overwrite compacted entries in 227990773”] [backtrace=" 0: tikv_util::set_panic_hook::{{closure}}\n at /home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/tikv/components/tikv_util/src/lib.rs:490:18\n 1: std::panicking::rust_panic_with_hook\n at /rust/toolchains/nightly-2022-02-14-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/std/src/panicking.rs:702:17\n 2: std::panicking::begin_panic_handler::{{closure}}\n at /rust/toolchains/nightly-2022-02-14-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/std/src/panicking.rs:588:13\n 3: std::sys_common::backtrace::_rust_end_short_backtrace\n at /rust/toolchains/nightly-2022-02-14-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/std/src/sys_common/backtrace.rs:138:18\n 4: rust_begin_unwind\n at /rust/toolchains/nightly-2022-02-14-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/std/src/panicking.rs:584:5\n 5: core::panicking::panic_fmt\n at /rust/toolchains/nightly-2022-02-14-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/core/src/panicking.rs:143:14\n 6: raft_engine::memtable::MemTable::prepare_append\n 7: raft_engine::memtable::MemTable::append\n at /rust/git/checkouts/raft-engine-35ec7b0b2c07ddd2/0e066f8/src/memtable.rs:334:13\n raft_engine::memtable::MemTableAccessor::apply_append_writes\n at /rust/git/checkouts/raft-engine-35ec7b0b2c07ddd2/0e066f8/src/memtable.rs:965:21\n 8: <raft_engine::memtable::MemTableRecoverContext as raft_engine::file_pipe_log::pipe_builder::ReplayMachine>::replay\n at /rust/git/checkouts/raft-engine-35ec7b0b2c07ddd2/0e066f8/src/memtable.rs:1112:33\n raft_engine::file_pipe_log::pipe_builder::DualPipesBuilder::recover_queue::{{closure}}\n at /rust/git/checkouts/raft-engine-35ec7b0b2c07ddd2/0e066f8/src/file_pipe_log/pipe_builder.rs:265:33\n core::ops::function::impls::<impl core::ops::function::FnMut for &F>::call_mut\n at /rust/toolchains/nightly-2022-02-14-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/core/src/ops/function.rs:247:13\n core::ops::function::impls::<impl core::ops::function::FnOnce for &mut F>::call_once\n at /rust/toolchains/nightly-2022-02-14-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/core/src/ops/function.rs:280:13\n core::option::Option::map\n at /rust/toolchains/nightly-2022-02-14-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/core/src/option.rs:906:29\n <core::iter::adapters::map::Map<I,F> as core::iter::traits::iterator::Iterator>::next\n at /rust/toolchains/nightly-2022-02-14-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/core/src/iter/adapters/map.rs:103:9\n rayon::iter::plumbing::Folder::consume_iter\n at /rust/registry/src/github.com-1ecc6299db9ec823/rayon-1.5.0/src/iter/plumbing/mod.rs:178:21\n <rayon::iter::map::MapFolder<C,F> as rayon::iter::plumbing::Folder>::consume_iter\n at /rust/registry/src/github.com-1ecc6299db9ec823/rayon-1.5.0/src/iter/map.rs:248:21\n rayon::iter::plumbing::Producer::fold_with\n at /rust/registry/src/github.com-1ecc6299db9ec823/rayon-1.5.0/src/iter/plumbing/mod.rs:110:9\n rayon::iter::plumbing::bridge_producer_consumer::helper\n at /rust/registry/src/github.com-1ecc6299db9ec823/rayon-1.5.0/src/iter/plumbing/mod.rs:438:13\n 9: rayon::iter::plumbing::bridge_producer_consumer::helper::{{closure}}\n at /rust/registry/src/github.com-1ecc6299db9ec823/rayon-1.5.0/src/iter/plumbing/mod.rs:418:21\n rayon_core::join::join_context::call_a::{{closure}}\n at /rust/registry/src/github.com-1ecc6299db9ec823/rayon-core-1.9.0/src/join/mod.rs:124:17\n <core::panic::unwind_safe::AssertUnwindSafe as core::ops::function::FnOnce<()>>::call_once\n at /rust/toolchains/nightly-2022-02-14-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/core/src/panic/unwind_safe.rs:271:9\n std::panicking::try::do_call\n at /rust/toolchains/nightly-2022-02-14-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/std/src/panicking.rs:492:40\n std::panicking::try\n at /rust/toolchains/nightly-2022-02-14-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/std/src/panicking.rs:456:19\n std::panic::catch_unwind\n at /rust/toolchains/nightly-2022-02-14-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/std/src/panic.rs:137:14\n rayon_core::unwind::halt_unwinding\n at /rust/registry/src/github.com-1ecc6299db9ec823/rayon-core-1.9.0/src/unwind.rs:17:5\n rayon_core::join::join_context::{{closure}}\n at /rust/registry/src/github.com-1ecc6299db9ec823/rayon-core-1.9.0/src/join/mod.rs:141:24\n 10: rayon_core::registry::in_worker\n at /rust/registry/src/github.com-1ecc6299db9ec823/rayon-core-1.9.0/src/registry.rs:879:13\n rayon_core::join::join_context\n at /rust/registry/src/github.com-1ecc6299db9ec823/rayon-core-1.9.0/src/join/mod.rs:132:5\n rayon::iter::plumbing::bridge_producer_consumer::helper\n at /rust/registry/src/github.com-1ecc6299db9ec823/rayon-1.5.0/src/iter/plumbing/mod.rs:416:47\n 11: rayon::iter::plumbing::bridge_producer_consumer::helper::{{closure}}\n at /rust/registry/src/github.com-1ecc6299db9ec823/rayon-1.5.0/src/iter/plumbing/mod.rs:427:21\n rayon_core::join::join_context::call_b::{{closure}}\n at /rust/registry/src/github.com-1ecc6299db9ec823/rayon-core-1.9.0/src/join/mod.rs:129:25\n <rayon_core::job::StackJob<L,F,R> as rayon_core::job::Job>::execute::call::{{closure}}\n at /rust/registry/src/github.com-1ecc6299db9ec823/rayon-core-1.9.0/src/job.rs:113:21\n <core::panic::unwind_safe::AssertUnwindSafe as core::ops::function::FnOnce<()>>::call_once\n at /rust/toolchains/nightly-2022-02-14-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/core/src/panic/unwind_safe.rs:271:9\n std::panicking::try::do_call\n at /rust/toolchains/nightly-2022-02-14-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/std/src/panicking.rs:492:40\n std::panicking::try\n at /rust/toolchains/nightly-2022-02-14-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/std/src/panicking.rs:456:19\n std::panic::catch_unwind\n at /rust/toolchains/nightly-2022-02-14-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/std/src/panic.rs:137:14\n rayon_core::unwind::halt_unwinding\n at /rust/registry/src/github.com-1ecc6299db9ec823/rayon-core-1.9.0/src/unwind.rs:17:5\n <rayon_core::job::StackJob<L,F,R> as rayon_core::job::Job>::execute\n at /rust/registry/src/github.com-1ecc6299db9ec823/rayon-core-1.9.0/src/job.rs:119:38\n 12: rayon_core::job::JobRef::execute\n at /rust/registry/src/github.com-1ecc6299db9ec823/rayon-core-1.9.0/src/job.rs:59:9\n rayon_core::registry::WorkerThread::execute\n at /rust/registry/src/github.com-1ecc6299db9ec823/rayon-core-1.9.0/src/registry.rs:753:9\n rayon_core::registry::WorkerThread::wait_until_cold\n at /rust/registry/src/github.com-1ecc6299db9ec823/rayon-core-1.9.0/src/registry.rs:730:17\n 13: rayon_core::registry::WorkerThread::wait_until\n at /rust/registry/src/github.com-1ecc6299db9ec823/rayon-core-1.9.0/src/registry.rs:704:13\n rayon_core::registry::main

您是做什么操作?报的错误吗 ??还是突然间就成这个了 ?
详细介绍下 什么操作。

突然就这样了

试着重启了没 ??目前状态是怎么样的?? 服务正常吧 ?

重启了服务,也重启了服务器,还是没有起来

猜测是bug了 ,估计得需要扩缩容处理,先保留现场,等官方确认下吧

是的,计划先加一个新节点进去,完成balance后再把故障节点踢了。

这个版本是不是有 raft engine 的一个 bug :thinking:,推荐升级到 6.1 最新版本。
或者将 raft engine 设置为 1 线程

现在还不敢升级,要先尽快恢复三节点

扩缩容先恢复吧。

添加了一个新节点,差不多了,leader已经齐了,region还差点。

新节点已经加入tikv,现在要把老的有问题tikv踢出集群然后重新加入集群

1.scale-in原tikvn
执行后发现tikv始终处于pending offline状态
2.scale-in --force原tikvn
tiup中已经看不到此tikv
3.scale-out 原tikv到集群
日志报错:有相关ip不同id的tikv存在,无法启动新的tikv
4.使用pd-ctl确实能看到原tikv的信息(显示还有300多个region),使用pd-ctl delete原tikv的id显示succuss但是信息仍在

现在原tikv的数据和部署目录下的内容为空,无法使用tikv-ctl清除pd显示信息中原tikv的region信息

想过把这台服务器的ip作更换应该可以正常扩容进集群,除此之外还有别的办法吗?

缩容步骤不对,得做unsafe recovery了

现在tiup cluster display 已经看不到这个tikv节点了,也可以用unsafe recovery吗?

可以tiup就是个管理展示 实际的元数据等在Pd里还有 可以Pd-ctl store 看看

最终通过如下步骤将原异常tikv重新加入集群:
1.移除原tikv在pd中的所有region信息
for i in $(tiup ctl:v6.1.0 pd -u 192.168.7.188:2379 region store 8 | grep -B 1 start_key | grep id |awk ‘{print $2}’|sed ‘s/,//’)
do
tiup ctl:v6.1.0 pd -u 192.168.7.188:2379 operator add remove-peer $i 8
done

2.清除原有tikv在pd中的所有信息
tiup ctl:v6.1.0 pd -u 192.168.7.188:2379 store remove-tombstone

3.扩容重新把原tikv节点加入集群

目前新节点在正常平衡中,感谢各位热心支持,特别感谢@h5n1—第一招就帮助我解决了问题!

已知问题,https://github.com/tikv/tikv/issues/13123, 已经在 6.1.1 上修复,建议升级到 6.1.x 系列的最新版本。

多谢。

此话题已在最后回复的 60 天后被自动关闭。不再允许新回复。