TiCDC同步任务卡住,重启CDC之后TiKV一直重启

【 TiDB 使用环境】生产环境
【 TiDB 版本】v5.4.0
【遇到的问题:问题现象及影响】
凌晨集群下面的4个cdc任务延迟超过30min,任务状态narmol正常,但checkpoint无法向前推进,之前遇到类似问题通过重启cdc来恢复。
但没想到的是重启之后,4个任务中有两个恢复正常,还有两个还是无法正常推进checkpoint


同时TiKV节点也开始出现down掉重启的现象

TiKV的日志一直报:cdc initialize fail,failed to send extra message,cdc initialize fail: Request error message: "peer is not leader for region

后续尝试重启了TiDB,PD,和TiKV组件,两个CDC任务无法恢复,TiKV一直重启。
后续暂停了有问题的任务,TiKV不再重启,表象就是这两个任务无法从出问题的时间点拿到TiKV变更的数据,从而初始化失败,TiKV一直重启。
后来跟业务确认,可以在出问题的时间点之后跳过一段时间开始同步,于是尝试重建任务,指定时间戳跳过半小时,开始同步,新的任务没有问题,checkpoint正常推进。

【资源配置】

【附件:截图/日志/监控】
TiKV最近24小时的监控和TiCDC出问题时间段的监控:
tidb-cluster3-TiKV-Details_2023-08-10T05_58_59.975Z.json (35.9 MB)
tidb-cluster3-TiCDC_2023-08-10T06_02_33.040Z.json (2.9 MB)

admin show ddl 是否有 ddl 任务 ,再分析下报错的region所在的表


没有过ddl操作


主要报错就是这种,基本能确定是cdc同步的那几张表,因为暂停这两个任务报错就没有了,TiKV也不一直重启了,只是不知道是什么问题导致的,还是得看零点03分左右任务卡住是什么造成的



看到好多Transport(Full) 是不是集群之间带宽不够了

ticdc 同步需要扫 region,region 没扫完不会进行同步。

你现在应该先解决 tikv 重启问题。

  1. tikv 重启原因要查下,是 OOM么?
  2. ticdc 扫 region 的时候会吃比较多内存,你不会 ticdc 和 tikv 是混合部署的吧?

这需要找个平衡点,尝试调整CDC的参数,如调整CDC的工作线程数、调整CDC的内存限制等,以提高CDC的性能和稳定性




看网络带宽8月10号凌晨和其它时间没有明显峰值,而且也不高

1.确定不是oom,重启如日志报错,cdc initialize fail,卡住的cdc任务暂停,tikv不再重启
2.分开部署的,机器负载也不高

听起来像 bug。

  1. 要么试试升级
  2. 要么这个 ticdc 任务重做。

是重做了,原来的暂停了,跳过30分钟指定时间开始同步就正常了。

TiKV日志里面的panic信息

2023-08-10 00:44:56 (GMT+8)
UNKNOWN
TiKV x.x.x.x:20160
[lib.rs:465] ["assertion failed: `(left >= right)`\n left: `TimeStamp(443442001931665411)`,\n right: `TimeStamp(443442007383998887)`"] [backtrace=" 0: tikv_util::set_panic_hook::{{closure}}\n at /home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/tikv/components/tikv_util/src/lib.rs:464:18\n 1: std::panicking::rust_panic_with_hook\n at /rustc/2faabf579323f5252329264cc53ba9ff803429a3/library/std/src/panicking.rs:626:17\n 2: std::panicking::begin_panic_handler::{{closure}}\n at /rustc/2faabf579323f5252329264cc53ba9ff803429a3/library/std/src/panicking.rs:519:13\n 3: std::sys_common::backtrace::__rust_end_short_backtrace\n at /rustc/2faabf579323f5252329264cc53ba9ff803429a3/library/std/src/sys_common/backtrace.rs:141:18\n 4: rust_begin_unwind\n at /rustc/2faabf579323f5252329264cc53ba9ff803429a3/library/std/src/panicking.rs:515:5\n 5: std::panicking::begin_panic_fmt\n at /rustc/2faabf579323f5252329264cc53ba9ff803429a3/library/std/src/panicking.rs:457:5\n 6: tikv::storage::mvcc::reader::scanner::seek_for_valid_write\n at /home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/tikv/src/storage/mvcc/reader/scanner/mod.rs:448:17\n tikv::storage::mvcc::reader::scanner::seek_for_valid_value\n at /home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/tikv/src/storage/mvcc/reader/scanner/mod.rs:495:9\n 7: <tikv::storage::mvcc::reader::scanner::forward::DeltaEntryPolicy as tikv::storage::mvcc::reader::scanner::forward::ScanPolicy<S>>::handle_lock\n at /home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/tikv/src/storage/mvcc/reader/scanner/forward.rs:694:29\n tikv::storage::mvcc::reader::scanner::forward::ForwardScanner<S,P>::read_next\n at /home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/tikv/src/storage/mvcc/reader/scanner/forward.rs:271:42\n <tikv::storage::mvcc::reader::scanner::forward::ForwardScanner<S,P> as tikv::storage::txn::store::TxnEntryScanner>::next_entry\n at /home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/tikv/src/storage/mvcc/reader/scanner/forward.rs:837:12\n 8: cdc::endpoint::Initializer<E>::do_scan\n at /home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/tikv/components/cdc/src/endpoint.rs:1183:19\n cdc::endpoint::Initializer<E>::scan_batch::{{closure}}\n at /home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/tikv/components/cdc/src/endpoint.rs:1221:13\n <core::future::from_generator::GenFuture<T> as core::future::future::Future>::poll\n at /rustc/2faabf579323f5252329264cc53ba9ff803429a3/library/core/src/future/mod.rs:80:19\n cdc::endpoint::Initializer<E>::async_incremental_scan::{{closure}}\n at /home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/tikv/components/cdc/src/endpoint.rs:1133:27\n 9: <core::future::from_generator::GenFuture<T> as core::future::future::Future>::poll\n at /rustc/2faabf579323f5252329264cc53ba9ff803429a3/library/core/src/future/mod.rs:80:19\n cdc::endpoint::Initializer<E>::on_change_cmd_response::{{closure}}\n at /home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/tikv/components/cdc/src/endpoint.rs:1066:13\n <core::future::from_generator::GenFuture<T> as core::future::future::Future>::poll\n at /rustc/2faabf579323f5252329264cc53ba9ff803429a3/library/core/src/future/mod.rs:80:19\n 10: cdc::endpoint::Initializer<E>::initialize::{{closure}}\n at /home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/tikv/components/cdc/src/endpoint.rs:1054:25\n <core::future::from_generator::GenFuture<T> as core::future::future::Future>::poll\n at /rustc/2faabf579323f5252329264cc53ba9ff803429a3/library/core/src/future/mod.rs:80:19\n cdc::endpoint::Endpoint<T,E>::on_register::{{closure}}\n at /home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/tikv/components/cdc/src/endpoint.rs:628:19\n <core::future::from_generator::GenFuture<T> as core::future::future::Future>::poll\n at /rustc/2faabf579323f5252329264cc53ba9ff803429a3/library/core/src/future/mod.rs:80:19\n tokio::runtime::task::core::CoreStage<T>::poll::{{closure}}\n at /rust/registry/src/github.com-1ecc6299db9ec823/tokio-1.12.0/src/runtime/task/core.rs:161:17\n tokio::loom::std::unsafe_cell::UnsafeCell<T>::with_mut\n at /rust/registry/src/github.com-1ecc6299db9ec823/tokio-1.12.0/src/loom/std/unsafe_cell.rs:14:9\n tokio::runtime::task::core::CoreStage<T>::poll\n at /rust/registry/src/github.com-1ecc6299db9ec823/tokio-1.12.0/src/runtime/task/core.rs:151:13\n tokio::runtime::task::harness::poll_future::{{closure}}\n at /rust/registry/src/github.com-1ecc6299db9ec823/tokio-1.12.0/src/runtime/task/harness.rs:461:19\n <std::panic::AssertUnwindSafe<F> as core::ops::function::FnOnce<()>>::call_once\n at /rustc/2faabf579323f5252329264cc53ba9ff803429a3/library/std/src/panic.rs:347:9\n std::panicking::try::do_call\n at /rustc/2faabf579323f5252329264cc53ba9ff803429a3/library/std/src/panicking.rs:401:40\n std::panicking::try\n at /rustc/2faabf579323f5252329264cc53ba9ff803429a3/library/std/src/panicking.rs:365:19\n std::panic::catch_unwind\n at /rustc/2faabf579323f5252329264cc53ba9ff803429a3/library/std/src/panic.rs:434:14\n tokio::runtime::task::harness::poll_future\n at /rust/registry/src/github.com-1ecc6299db9ec823/tokio-1.12.0/src/runtime/task/harness.rs:449:18\n 11: tokio::runtime::task::harness::Harness<T,S>::poll_inner\n at /rust/registry/src/github.com-1ecc6299db9ec823/tokio-1.12.0/src/runtime/task/harness.rs:98:27\n tokio::runtime::task::harness::Harness<T,S>::poll\n at /rust/registry/src/github.com-1ecc6299db9ec823/tokio-1.12.0/src/runtime/task/harness.rs:53:15\n tokio::runtime::task::raw::poll\n at /rust/registry/src/github.com-1ecc6299db9ec823/tokio-1.12.0/src/runtime/task/raw.rs:113:5\n 12: tokio::runtime::task::raw::RawTask::poll\n at /rust/registry/src/github.com-1ecc6299db9ec823/tokio-1.12.0/src/runtime/task/raw.rs:70:18\n tokio::runtime::task::LocalNotified<S>::run\n at /rust/registry/src/github.com-1ecc6299db9ec823/tokio-1.12.0/src/runtime/task/mod.rs:343:9\n tokio::runtime::thread_pool::worker::Context::run_task::{{closure}}\n at /rust/registry/src/github.com-1ecc6299db9ec823/tokio-1.12.0/src/runtime/thread_pool/worker.rs:420:13\n tokio::coop::with_budget::{{closure}}\n at /rust/registry/src/github.com-1ecc6299db9ec823/tokio-1.12.0/src/coop.rs:106:9\n std::thread::local::LocalKey<T>::try_with\n at /rustc/2faabf579323f5252329264cc53ba9ff803429a3/library/std/src/thread/local.rs:399:16\n std::thread::local::LocalKey<T>::with\n at /rustc/2faabf579323f5252329264cc53ba9ff803429a3/library/std/src/thread/local.rs:375:9\n tokio::coop::with_budget\n at /rust/registry/src/github.com-1ecc6299db9ec823/tokio-1.12.0/src/coop.rs:99:5\n tokio::coop::budget\n at /rust/registry/src/github.com-1ecc6299db9ec823/tokio-1.12.0/src/coop.rs:76:5\n tokio::runtime::thread_pool::worker::Context::run_task\n at /rust/registry/src/github.com-1ecc6299db9ec823/tokio-1.12.0/src/runtime/thread_pool/worker.rs:419:9\n 13: tokio::runtime::thread_pool::worker::Context::run\n at /rust/registry/src/github.com-1ecc6299db9ec823/tokio-1.12.0/src/runtime/thread_pool/worker.rs:386:24\n tokio::runtime::thread_pool::worker::run::{{closure}}\n at /rust/registry/src/github.com-1ecc6299db9ec823/tokio-1.12.0/src/runtime/thread_pool/worker.rs:371:17\n tokio::macros::scoped_tls::ScopedKey<T>::set\n at /rust/registry/src/github.com-1ecc6299db9ec823/tokio-1.12.0/src/macros/scoped_tls.rs:61:9\n tokio::runtime::thread_pool::worker::run\n at /rust/registry/src/github.com-1ecc6299db9ec823/tokio-1.12.0/src/runtime/thread_pool/worker.rs:368:5\n tokio::runtime::thread_pool::worker::Launch::launch::{{closure}}\n at /rust/registry/src/github.com-1ecc6299db9ec823/tokio-1.12.0/src/runtime/thread_pool/worker.rs:347:45\n <tokio::runtime::blocking::task::BlockingTask<T> as core::future::future::Future>::poll\n at /rust/registry/src/github.com-1ecc6299db9ec823/tokio-1.12.0/src/runtime/blocking/task.rs:42:21\n tokio::runtime::task::core::CoreStage<T>::poll::{{closure}}\n at /rust/registry/src/github.com-1ecc6299db9ec823/tokio-1.12.0/src/runtime/task/core.rs:161:17\n tokio::loom::std::unsafe_cell::UnsafeCell<T>::with_mut\n at /rust/registry/src/github.com-1ecc6299db9ec823/tokio-1.12.0/src/loom/std/unsafe_cell.rs:14:9\n tokio::runtime::task::core::CoreStage<T>::poll\n at /rust/registry/src/github.com-1ecc6299db9ec823/tokio-1.12.0/src/runtime/task/core.rs:151:13\n tokio::runtime::task::harness::poll_future::{{closure}}\n at /rust/registry/src/github.com-1ecc6299db9ec823/tokio-1.12.0/src/runtime/task/harness.rs:461:19\n <std::panic::AssertUnwindSafe<F> as core::ops::function::FnOnce<()>>::call_once\n at /rustc/2faabf579323f5252329264cc53ba9ff803429a3/library/std/src/panic.rs:347:9\n std::panicking::try::do_call\n at /rustc/2faabf579323f5252329264cc53ba9ff803429a3/library/std/src/panicking.rs:401:40\n std::panicking::try\n at /rustc/2faabf579323f5252329264cc53ba9ff803429a3/library/std/src/panicking.rs:365:19\n std::panic::catch_unwind\n at /rustc/2faabf579323f5252329264cc53ba9ff803429a3/library/std/src/panic.rs:434:14\n tokio::runtime::task::harness::poll_future\n at /rust/registry/src/github.com-1ecc6299db9ec823/tokio-1.12.0/src/runtime/task/harness.rs:449:18\n tokio::runtime::task::harness::Harness<T,S>::poll_inner\n at /rust/registry/src/github.com-1ecc6299db9ec823/tokio-1.12.0/src/runtime/task/harness.rs:98:27\n tokio::runtime::task::harness::Harness<T,S>::poll\n at /rust/registry/src/github.com-1ecc6299db9ec823/tokio-1.12.0/src/runtime/task/harness.rs:53:15\n tokio::runtime::task::raw::poll\n at /rust/registry/src/github.com-1ecc6299db9ec823/tokio-1.12.0/src/runtime/task/raw.rs:113:5\n 14: tokio::runtime::task::raw::RawTask::poll\n at /rust/registry/src/github.com-1ecc6299db9ec823/tokio-1.12.0/src/runtime/task/raw.rs:70:18\n tokio::runtime::task::UnownedTask<S>::run\n at /rust/registry/src/github.com-1ecc6299db9ec823/tokio-1.12.0/src/runtime/task/mod.rs:379:9\n tokio::runtime::blocking::pool::Inner::run\n at /rust/registry/src/github.com-1ecc6299db9ec823/tokio-1.12.0/src/runtime/blocking/pool.rs:265:17\n tokio::runtime::blocking::pool::Spawner::spawn_thread::{{closure}}\n at /rust/registry/src/github.com-1ecc6299db9ec823/tokio-1.12.0/src/runtime/blocking/pool.rs:245:17\n std::sys_common::backtrace::__rust_begin_short_backtrace\n at /rustc/2faabf579323f5252329264cc53ba9ff803429a3/library/std/src/sys_common/backtrace.rs:125:18\n 15: std::thread::Builder::spawn_unchecked::{{closure}}::{{closure}}\n at /rustc/2faabf579323f5252329264cc53ba9ff803429a3/library/std/src/thread/mod.rs:476:17\n <std::panic::AssertUnwindSafe<F> as core::ops::function::FnOnce<()>>::call_once\n at /rustc/2faabf579323f5252329264cc53ba9ff803429a3/library/std/src/panic.rs:347:9\n std::panicking::try::do_call\n at /rustc/2faabf579323f5252329264cc53ba9ff803429a3/library/std/src/panicking.rs:401:40\n std::panicking::try\n at /rustc/2faabf579323f5252329264cc53ba9ff803429a3/library/std/src/panicking.rs:365:19\n std::panic::catch_unwind\n at /rustc/2faabf579323f5252329264cc53ba9ff803429a3/library/std/src/panic.rs:434:14\n std::thread::Builder::spawn_unchecked::{{closure}}\n at /rustc/2faabf579323f5252329264cc53ba9ff803429a3/library/std/src/thread/mod.rs:475:30\n core::ops::function::FnOnce::call_once{{vtable.shim}}\n at /rustc/2faabf579323f5252329264cc53ba9ff803429a3/library/core/src/ops/function.rs:227:5\n 16: <alloc::boxed::Box<F,A> as core::ops::function::FnOnce<Args>>::call_once\n at /rustc/2faabf579323f5252329264cc53ba9ff803429a3/library/alloc/src/boxed.rs:1572:9\n <alloc::boxed::Box<F,A> as core::ops::function::FnOnce<Args>>::call_once\n at /rustc/2faabf579323f5252329264cc53ba9ff803429a3/library/alloc/src/boxed.rs:1572:9\n std::sys::unix::thread::Thread::new::thread_start\n at /rustc/2faabf579323f5252329264cc53ba9ff803429a3/library/std/src/sys/unix/thread.rs:91:17\n 17: start_thread\n 18: clone\n"] [location=/home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/tikv/src/storage/mvcc/reader/scanner/mod.rs:448] [thread_name=cdcwkr]

此话题已在最后回复的 60 天后被自动关闭。不再允许新回复。