v7.5.6调大region异常

【TiDB 使用环境】生产环境
【TiDB 版本】v7.5.6
【问题复现路径】
搭建一套v7.5.6版本集群、调整了3个参数,压测集群性能
set config pd schedule.max-merge-region-size=96;
set config pd schedule.max-merge-region-keys=960000;
coprocessor.region-split-size: 128MB

【遇到的问题:问题现象及影响】
在造压测数据时,某个tikv节点异常

【复制黏贴 ERROR 报错的日志】
[2025/05/14 11:59:36.432 +08:00] [INFO] [resource_group.rs:182] [“add resource group”] [ru=2147483647] [name=default] [thread_id=1]
[2025/05/14 11:59:36.432 +08:00] [INFO] [resource_group.rs:182] [“add resource group”] [ru=2147483647] [name=default] [thread_id=16]
[2025/05/14 11:59:36.433 +08:00] [INFO] [client.rs:413] [“[global_config] start watch global config”] [revision=7094] [path=resource_group/settings] [thread_id=16]
[2025/05/14 11:59:36.433 +08:00] [FATAL] [common.rs:180] [“panic_mark_file /data1/tidb-data/tikv-20160/panic_mark_file exists, there must be something wrong with the db. Do not remove the panic_mark_file and force the TiKV node to restart. Please contact TiKV maintainers to investigate the issue. If needed, use scale in and scale out to replace the TiKV node. https://docs.pingcap.com/tidb/stable/scale-tidb-using-tiup”] [thread_id=1]

这个不是第一次 panic 的地方,发下第一个异常的地方,或者提供下完整日志。

tikv_stderr.log (12.5 MB)
tikv.log (12.6 MB)

[2025/05/14 11:29:27.614 +08:00] [FATAL] [lib.rs:512] [“Failed to recover sst file: /016283.sst, error: file still exists, it may belong L0, damaged_files:[name:"/016283.sst", smallest_key:[122, 116, 128, 0, 0, 0, 0, 0, 0, 255, 116, 95, 105, 128, 0, 0, 0, 0, 255, 0, 0, 1, 3, 128, 0, 0, 0, 255, 5, 243, 228, 139, 3, 128, 0, 0, 255, 0, 0, 128, 164, 139, 0, 0, 0, 252, 249, 164, 204, 183, 235, 255, 255, 253], largest_key:[122, 116, 128, 0, 0, 0, 0, 0, 0, 255, 116, 95, 105, 128, 0, 0, 0, 0, 255, 0, 0, 1, 3, 128, 0, 0, 0, 255, 5, 253, 121, 30, 3, 128, 0, 0, 255, 0, 0, 157, 21, 93, 0, 0, 0, 252, 249, 164, 204, 179, 3, 195, 255, 247], elapsed_secs:10.976487963]”] [backtrace=" 0: tikv_util::set_panic_hook::{{closure}}\n at /workspace/source/tikv/components/tikv_util/src/lib.rs:511:18\n 1: <alloc::boxed::Box<F,A> as core::ops::function::Fn>::call\n at /root/.rustup/toolchains/nightly-2022-11-15-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/alloc/src/boxed.rs:2032:9\n std::panicking::rust_panic_with_hook\n at /root/.rustup/toolchains/nightly-2022-11-15-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/std/src/panicking.rs:692:13\n 2: std::panicking::begin_panic_handler::{{closure}}\n at /root/.rustup/toolchains/nightly-2022-11-15-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/std/src/panicking.rs:579:13\n 3: std::sys_common::backtrace::__rust_end_short_backtrace\n at /root/.rustup/toolchains/nightly-2022-11-15-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/std/src/sys_common/backtrace.rs:137:18\n 4: rust_begin_unwind\n at /root/.rustup/toolchains/nightly-2022-11-15-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/std/src/panicking.rs:575:5\n 5: core::panicking::panic_fmt\n at /root/.rustup/toolchains/nightly-2022-11-15-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/core/src/panicking.rs:65:14\n 6: engine_rocks_helper::sst_recovery::RecoveryRunner::set_panic_mark_and_panic\n at /workspace/source/tikv/components/engine_rocks_helper/src/sst_recovery.rs:191:9\n engine_rocks_helper::sst_recovery::RecoveryRunner::must_file_not_exist\n at /workspace/source/tikv/components/engine_rocks_helper/src/sst_recovery.rs:202:17\n engine_rocks_helper::sst_recovery::RecoveryRunner::check_overlap_damaged_regions\n at /workspace/source/tikv/components/engine_rocks_helper/src/sst_recovery.rs:175:13\n 7: alloc::vec::Vec<T,A>::retain_mut::process_loop\n at /root/.rustup/toolchains/nightly-2022-11-15-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/alloc/src/vec/mod.rs:1641:21\n alloc::vec::Vec<T,A>::retain_mut\n at /root/.rustup/toolchains/nightly-2022-11-15-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/alloc/src/vec/mod.rs:1667:9\n alloc::vec::Vec<T,A>::retain\n at /root/.rustup/toolchains/nightly-2022-11-15-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/alloc/src/vec/mod.rs:1561:9\n engine_rocks_helper::sst_recovery::RecoveryRunner::check_damaged_files\n at /workspace/source/tikv/components/engine_rocks_helper/src/sst_recovery.rs:143:9\n <engine_rocks_helper::sst_recovery::RecoveryRunner as tikv_util::worker::pool::RunnableWithTimer>::on_timeout\n at /workspace/source/tikv/components/engine_rocks_helper/src/sst_recovery.rs:65:9\n tikv_util::worker::pool::Worker::start_with_timer_impl::{{closure}}\n at /workspace/source/tikv/components/tikv_util/src/worker/pool.rs:511:25\n <core::future::from_generator::GenFuture as core::future::future::Future>::poll\n at /root/.rustup/toolchains/nightly-2022-11-15-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/core/src/future/mod.rs:91:19\n <tracker::tls::TrackedFuture as core::future::future::Future>::poll::{{closure}}\n at /workspace/source/tikv/components/tracker/src/tls.rs:64:23\n std::thread::local::LocalKey::try_with\n at /root/.rustup/toolchains/nightly-2022-11-15-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/std/src/thread/local.rs:446:16\n std::thread::local::LocalKey::with\n at /root/.rustup/toolchains/nightly-2022-11-15-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/std/src/thread/local.rs:422:9\n <tracker::tls::TrackedFuture as core::future::future::Future>::poll\n at /workspace/source/tikv/components/tracker/src/tls.rs:62:9\n <futures_util::future::future::map::Map<Fut,F> as core::future::future::Future>::poll\n at /workspace/.cargo/registry/src/mirrors.tuna.tsinghua.edu.cn-df7c3c540f42cdbd/futures-util-0.3.31/src/future/future/map.rs:55:37\n <futures_util::future::future::Map<Fut,F> as core::future::future::Future>::poll\n at /workspace/.cargo/registry/src/mirrors.tuna.tsinghua.edu.cn-df7c3c540f42cdbd/futures-util-0.3.31/src/lib.rs:86:13\n yatp::task::future::RawTask::poll\n at /workspace/.cargo/git/checkouts/yatp-e704b73c3ee279b6/5572a78/src/task/future.rs:59:9\n 8: yatp::task::future::TaskCell::poll\n at /workspace/.cargo/git/checkouts/yatp-e704b73c3ee279b6/5572a78/src/task/future.rs:103:9\n <yatp::task::future::Runner as yatp::pool::runner::Runner>::handle\n at /workspace/.cargo/git/checkouts/yatp-e704b73c3ee279b6/5572a78/src/task/future.rs:387:20\n 9: <tikv_util::yatp_pool::YatpPoolRunner as yatp::pool::runner::Runner>::handle\n at /workspace/source/tikv/components/tikv_util/src/yatp_pool/mod.rs:199:24\n yatp::pool::worker::WorkerThread<T,R>::run\n at /workspace/.cargo/git/checkouts/yatp-e704b73c3ee279b6/5572a78/src/pool/worker.rs:48:13\n yatp::pool::builder::LazyBuilder::build::{{closure}}\n at /workspace/.cargo/git/checkouts/yatp-e704b73c3ee279b6/5572a78/src/pool/builder.rs:114:25\n std::sys_common::backtrace::rust_begin_short_backtrace\n at /root/.rustup/toolchains/nightly-2022-11-15-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/std/src/sys_common/backtrace.rs:121:18\n 10: std::thread::Builder::spawn_unchecked::{{closure}}::{{closure}}\n at /root/.rustup/toolchains/nightly-2022-11-15-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/std/src/thread/mod.rs:551:17\n <core::panic::unwind_safe::AssertUnwindSafe as core::ops::function::FnOnce<()>>::call_once\n at /root/.rustup/toolchains/nightly-2022-11-15-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/core/src/panic/unwind_safe.rs:271:9\n std::panicking::try::do_call\n at /root/.rustup/toolchains/nightly-2022-11-15-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/std/src/panicking.rs:483:40\n std::panicking::try\n at /root/.rustup/toolchains/nightly-2022-11-15-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/std/src/panicking.rs:447:19\n std::panic::catch_unwind\n at /root/.rustup/toolchains/nightly-2022-11-15-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/std/src/panic.rs:137:14\n std::thread::Builder::spawn_unchecked::{{closure}}\n at /root/.rustup/toolchains/nightly-2022-11-15-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/std/src/thread/mod.rs:550:30\n core::ops::function::FnOnce::call_once{{vtable.shim}}\n at /root/.rustup/toolchains/nightly-2022-11-15-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/core/src/ops/function.rs:513:5\n 11: <alloc::boxed::Box<F,A> as core::ops::function::FnOnce>::call_once\n at /root/.rustup/toolchains/nightly-2022-11-15-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/alloc/src/boxed.rs:2000:9\n <alloc::boxed::Box<F,A> as core::ops::function::FnOnce>::call_once\n at /root/.rustup/toolchains/nightly-2022-11-15-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/alloc/src/boxed.rs:2000:9\n std::sys::unix::thread::thread::new::thread_start\n at /root/.rustup/toolchains/nightly-2022-11-15-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/std/src/sys/unix/thread.rs:108:17\n 12: start_thread\n 13: clone\n"] [location=components/engine_rocks_helper/src/sst_recovery.rs:191] [thread_name=sst-recovery-0] [thread_id=21]
[2025/05/14 11:29:44.918 +08:00] [INFO] [lib.rs:88] [“Welcome to TiKV”] [thread_id=1]

错误应该是这个,然后这个tikv基本就是1分钟重启一次了。搜Welcome to TiKV,多到爆炸。

https://github.com/tikv/tikv/issues/16308 看起来之前也遇到过。从日志看,确实是某个 sst 文件损坏了导致报错,并且 tikv 在 panic 之前已经检测到 sst 的损坏. 感觉大概率是磁盘的问题,但很难实锤 root cause.

[2025/05/14 11:29:16.164 +08:00] [WARN] [event_listener.rs:131] [“detected rocksdb background error”] [err=“Corruption: block checksum mismatch: stored = 2499878532, computed = 4049305679, type = 1 in /data1/tidb-data/tikv-20160/db /016283.sst offset 2575206 size 14610”] [sst=/016283.sst] [reason=compaction] [thread_id=176]

workaround 就是使用 tikv-ctl 恢复将损坏的 region 移除并恢复此 tikv. https://docs.pingcap.com/zh/tidb/v7.5/tikv-control/#打印损坏的-sst-文件信息

你这个调大了是安装的时候初始化还是后期调整的参数。会不会和有些是96有些事128影响的