TiDB升级到7.1以后经常会出现TiKV的CDC模块导致TiKV实例崩溃重启的情况

【 TiDB 使用环境】生产环境 /测试/ Poc
生产环境
【 TiDB 版本】
TiDB7.1.6
【复现路径】做过哪些操作出现的问题
从5.0版本升级到7.1以后

【遇到的问题:问题现象及影响】
现象:
TiKV实例经常出现(tikv_util::set_panic_hook::{{closure}} )崩溃重启

日志信息:
[2025/01/23 07:33:00.927 +08:00] [INFO] [peer.rs:3832] [“starts destroy”] [is_latest_initialized=false] [is_peer_initialized=true] [merged_by_target=true] [peer_id=648414849] [region_id=648414846]
[2025/01/23 07:33:00.927 +08:00] [INFO] [peer.rs:1537] [“begin to destroy”] [peer_id=648414849] [region_id=648414846]
[2025/01/23 07:33:00.927 +08:00] [INFO] [peer_storage.rs:1051] [“finish clear peer meta”] [takes=5.988µs] [raft_key=1] [apply_key=1] [meta_key=1] [region_id=648414846]
[2025/01/23 07:33:00.928 +08:00] [INFO] [peer.rs:1641] [“peer destroy itself”] [keep_data=true] [clean=true] [takes=728.058µs] [peer_id=648414849] [region_id=648414846]
[2025/01/23 07:33:00.928 +08:00] [INFO] [router.rs:283] [“shutdown mailbox”] [region_id=648414846]
[2025/01/23 07:33:00.929 +08:00] [INFO] [peer.rs:3832] [“starts destroy”] [is_latest_initialized=false] [is_peer_initialized=true] [merged_by_target=true] [peer_id=648414894] [region_id=648414891]
[2025/01/23 07:33:00.929 +08:00] [INFO] [peer.rs:1537] [“begin to destroy”] [peer_id=648414894] [region_id=648414891]
[2025/01/23 07:33:00.929 +08:00] [INFO] [peer_storage.rs:1051] [“finish clear peer meta”] [takes=5.338µs] [raft_key=1] [apply_key=1] [meta_key=1] [region_id=648414891]
[2025/01/23 07:33:00.929 +08:00] [INFO] [peer.rs:1641] [“peer destroy itself”] [keep_data=true] [clean=true] [takes=804.331µs] [peer_id=648414894] [region_id=648414891]
[2025/01/23 07:33:00.929 +08:00] [INFO] [router.rs:283] [“shutdown mailbox”] [region_id=648414891]
[2025/01/23 07:33:00.931 +08:00] [INFO] [peer.rs:3832] [“starts destroy”] [is_latest_initialized=false] [is_peer_initialized=true] [merged_by_target=true] [peer_id=648414872] [region_id=648414869]
[2025/01/23 07:33:00.931 +08:00] [INFO] [peer.rs:1537] [“begin to destroy”] [peer_id=648414872] [region_id=648414869]
[2025/01/23 07:33:00.931 +08:00] [INFO] [peer_storage.rs:1051] [“finish clear peer meta”] [takes=7.7µs] [raft_key=1] [apply_key=1] [meta_key=1] [region_id=648414869]
[2025/01/23 07:33:00.932 +08:00] [INFO] [peer.rs:1641] [“peer destroy itself”] [keep_data=true] [clean=true] [takes=661.759µs] [peer_id=648414872] [region_id=648414869]
[2025/01/23 07:33:00.932 +08:00] [INFO] [router.rs:283] [“shutdown mailbox”] [region_id=648414869]
[2025/01/23 07:33:01.715 +08:00] [INFO] [endpoint.rs:577] [“the max gap of leader resolved-ts is large”] [last_resolve_attempt=None] [duration_to_last_update_safe_ts=19961ms] [min_memory_lock=None] [txn_num=1] [lock_num=109412] [min_lock=“Some((TimeStamp(455498356728266754), TxnLocks { lock_count: 109412, sample_lock: Some(74800000000000A1815F69800000000000000303800000000591723B03800000046F1D633B) }))”] [applied_index=6] [read_state=“ReadState { idx: 6, ts: 455498356728266754 }”] [gap=445765ms] [region_id=648417214]
[2025/01/23 07:33:01.715 +08:00] [INFO] [endpoint.rs:599] [“the max gap of follower safe-ts is large”] [oldest_candidate=None] [latest_candidate=None] [applied_index=6] [duration_to_last_consume_leader=19836ms] [resolved_ts=455498356728266754] [safe_ts=455498356728266754] [gap=445765ms] [region_id=648417210]
[2025/01/23 07:33:01.715 +08:00] [INFO] [endpoint.rs:620] [“the max gap of follower resolved-ts is large; it’s the same region that has the min safe-ts”]
[2025/01/23 07:33:03.896 +08:00] [FATAL] [lib.rs:510] [“region 7771722 commit_ts: TimeStamp(455498448898097162), resolved_ts: TimeStamp(455498473814884365)”] [backtrace=" 0: tikv_util::set_panic_hook::{{closure}}\n at /workspace/source/tikv/components/tikv_util/src/lib.rs:509:18\n 1: <alloc::boxed::Box<F,A> as core::ops::function::Fn>::call\n at /root/.rustup/toolchains/nightly-2022-11-15-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/alloc/src/boxed.rs:2032:9\n std::panicking::rust_panic_with_hook\n at /root/.rustup/toolchains/nightly-2022-11-15-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/std/src/panicking.rs:692:13\n 2: std::panicking::begin_panic_handler::{{closure}}\n at /root/.rustup/toolchains/nightly-2022-11-15-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/std/src/panicking.rs:579:13\n 3: std::sys_common::backtrace::__rust_end_short_backtrace\n at /root/.rustup/toolchains/nightly-2022-11-15-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/std/src/sys_common/backtrace.rs:137:18\n 4: rust_begin_unwind\n at /root/.rustup/toolchains/nightly-2022-11-15-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/std/src/panicking.rs:575:5\n 5: core::panicking::panic_fmt\n at /root/.rustup/toolchains/nightly-2022-11-15-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/core/src/panicking.rs:65:14\n 6: cdc::delegate::Delegate::sink_txn_put\n at /workspace/source/tikv/components/cdc/src/delegate.rs:859:21\n cdc::delegate::Delegate::sink_put\n at /workspace/source/tikv/components/cdc/src/delegate.rs:818:13\n cdc::delegate::Delegate::sink_data\n at /workspace/source/tikv/components/cdc/src/delegate.rs:678:33\n cdc::delegate::Delegate::on_batch\n at /workspace/source/tikv/components/cdc/src/delegate.rs:547:17\n 7: cdc::endpoint::Endpoint<T,E,S>::on_multi_batch\n at /workspace/source/tikv/components/cdc/src/endpoint.rs:833:33\n <cdc::endpoint::Endpoint<T,E,S> as tikv_util::worker::pool::Runnable>::run\n at /workspace/source/tikv/components/cdc/src/endpoint.rs:1203:18\n 8: tikv_util::worker::pool::Worker::start_with_timer_impl::{{closure}}\n at /workspace/source/tikv/components/tikv_util/src/worker/pool.rs:502:25\n <core::future::from_generator::GenFuture as core::future::future::Future>::poll\n at /root/.rustup/toolchains/nightly-2022-11-15-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/core/src/future/mod.rs:91:19\n yatp::task::future::RawTask::poll\n at /workspace/.cargo/git/checkouts/yatp-e704b73c3ee279b6/5523a9a/src/task/future.rs:59:9\n 9: yatp::task::future::TaskCell::poll\n at /workspace/.cargo/git/checkouts/yatp-e704b73c3ee279b6/5523a9a/src/task/future.rs:103:9\n <yatp::task::future::Runner as yatp::pool::runner::Runner>::handle\n at /workspace/.cargo/git/checkouts/yatp-e704b73c3ee279b6/5523a9a/src/task/future.rs:387:20\n 10: <tikv_util::yatp_pool::YatpPoolRunner as yatp::pool::runner::Runner>::handle\n at /workspace/source/tikv/components/tikv_util/src/yatp_pool/mod.rs:193:24\n yatp::pool::worker::WorkerThread<T,R>::run\n at /workspace/.cargo/git/checkouts/yatp-e704b73c3ee279b6/5523a9a/src/pool/worker.rs:48:13\n yatp::pool::builder::LazyBuilder::build::{{closure}}\n at /workspace/.cargo/git/checkouts/yatp-e704b73c3ee279b6/5523a9a/src/pool/builder.rs:114:25\n std::sys_common::backtrace::rust_begin_short_backtrace\n at /root/.rustup/toolchains/nightly-2022-11-15-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/std/src/sys_common/backtrace.rs:121:18\n 11: std::thread::Builder::spawn_unchecked::{{closure}}::{{closure}}\n at /root/.rustup/toolchains/nightly-2022-11-15-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/std/src/thread/mod.rs:551:17\n <core::panic::unwind_safe::AssertUnwindSafe as core::ops::function::FnOnce<()>>::call_once\n at /root/.rustup/toolchains/nightly-2022-11-15-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/core/src/panic/unwind_safe.rs:271:9\n std::panicking::try::do_call\n at /root/.rustup/toolchains/nightly-2022-11-15-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/std/src/panicking.rs:483:40\n std::panicking::try\n at /root/.rustup/toolchains/nightly-2022-11-15-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/std/src/panicking.rs:447:19\n std::panic::catch_unwind\n at /root/.rustup/toolchains/nightly-2022-11-15-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/std/src/panic.rs:137:14\n std::thread::Builder::spawn_unchecked::{{closure}}\n at /root/.rustup/toolchains/nightly-2022-11-15-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/std/src/thread/mod.rs:550:30\n core::ops::function::FnOnce::call_once{{vtable.shim}}\n at /root/.rustup/toolchains/nightly-2022-11-15-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/core/src/ops/function.rs:513:5\n 12: <alloc::boxed::Box<F,A> as core::ops::function::FnOnce>::call_once\n at /root/.rustup/toolchains/nightly-2022-11-15-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/alloc/src/boxed.rs:2000:9\n <alloc::boxed::Box<F,A> as core::ops::function::FnOnce>::call_once\n at /root/.rustup/toolchains/nightly-2022-11-15-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/alloc/src/boxed.rs:2000:9\n std::sys::unix::thread::thread::new::thread_start\n at /root/.rustup/toolchains/nightly-2022-11-15-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/std/src/sys/unix/thread.rs:108:17\n 13: start_thread\n 14: clone\n"] [location=components/cdc/src/delegate.rs:859] [thread_name=cdc-0]
[2025/01/23 07:33:26.113 +08:00] [INFO] [lib.rs:88] [“Welcome to TiKV”]
[2025/01/23 07:33:26.114 +08:00] [INFO] [lib.rs:93] [“Release Version: 7.1.6”]
[2025/01/23 07:33:26.114 +08:00] [INFO] [lib.rs:93] [“Edition: Community”]
[2025/01/23 07:33:26.114 +08:00] [INFO] [lib.rs:93] [“Git Commit Hash: 3a9f24b7e7cb5cc18b4c0b1d799b666cb1aa2175”]
[2025/01/23 07:33:26.114 +08:00] [INFO] [lib.rs:93] [“Git Commit Branch: HEAD”]
[2025/01/23 07:33:26.114 +08:00] [INFO] [lib.rs:93] [“UTC Build Time: Unknown (env var does not exist when building)”]
[2025/01/23 07:33:26.114 +08:00] [INFO] [lib.rs:93] [“Rust Version: rustc 1.67.0-nightly (96ddd32c4 2022-11-14)”]
[2025/01/23 07:33:26.114 +08:00] [INFO] [lib.rs:93] [“Enable Features: pprof-fp jemalloc mem-profiling portable sse test-engine-kv-rocksdb test-engine-raft-raft-engine cloud-aws cloud-gcp cloud-azure”]
[2025/01/23 07:33:26.114 +08:00] [INFO] [lib.rs:93] [“Profile: dist_release”]
[2025/01/23 07:33:26.114 +08:00] [INFO] [mod.rs:80] [“cgroup quota: memory=Some(9223372036854771712), cpu=None, cores={12, 27, 8, 32, 2, 33, 13, 26, 23, 36, 44, 21, 24, 25, 10, 31, 4, 15, 37, 38, 42, 45, 9, 14, 22, 34, 54, 41, 46, 3, 49, 16, 53, 20, 0, 35, 19, 1, 39, 55, 5, 7, 43, 50, 40, 48, 30, 11, 18, 17, 28, 6, 47, 51, 52, 29}”]
[2025/01/23 07:33:26.114 +08:00] [INFO] [mod.rs:87] [“memory limit in bytes: 405208932352, cpu cores quota: 56”]

您好,是个 bug #17656 introduces a panic into TiKV cdc module · Issue #18142 · tikv/tikv · GitHub
我们已经有 pr 再修复了 cdc: fix the panic introduced by #17656 by hicqu · Pull Request #18143 · tikv/tikv · GitHub
我们这两天会尽快 push fix merge,但是 patch release 就没那么快了,应该得等年后了。
目前这个 issue 影响版本有 v7.5.5 和 v7.1.6,如果等不急 v7.1.7 的话,建议可以先升级到 v7.5.4 w/a 一下。

好的,非常感谢。