tiflash不同步数据,执行完【创建副本】执行链接不到tiflash

【 TiDB 使用环境】生产环境
【 TiDB 版本】v6.5.1
【复现路径】
执行 ALTER TABLE sxt_order SET TIFLASH REPLICA 1;
过几分钟 发现集群无法链接tiflash,且有报错日志。
日志如下:

[2023/03/30 17:24:39.182 +08:00] [ERROR] [Exception.cpp:89] ["Code: 49, e.displayText() = DB::Exception: invalid flag 83 in write cf: physical_table_id=10479: (while preHandleSnapshot region_id=6513, index=45900, term=7), e.what() = DB::Exception, Stack trace:\n\n\n       0x17225ce\tDB::Exception::Exception(std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, int) [tiflash+24257998]\n                \tdbms/src/Common/Exception.h:46\n       0x6b08404\tDB::RegionCFDataBase<DB::RegionWriteCFDataTrait>::insert(DB::StringObject<true>&&, DB::StringObject<false>&&) [tiflash+112231428]\n                \tdbms/src/Storages/Transaction/RegionCFDataBase.cpp:46\n       0x6a949db\tDB::DM::SSTFilesToBlockInputStream::read() [tiflash+111757787]\n                \tdbms/src/Storages/DeltaMerge/SSTFilesToBlockInputStream.cpp:137\n       0x696ca15\tDB::DM::readNextBlock(std::__1::shared_ptr<DB::IBlockInputStream> const&) [tiflash+110545429]\n                \tdbms/src/Storages/DeltaMerge/DeltaMergeHelpers.h:253\n       0x6a97412\tDB::DM::PKSquashingBlockInputStream<true>::read() [tiflash+111768594]\n                \tdbms/src/Storages/DeltaMerge/PKSquashingBlockInputStream.h:68\n       0x696ca15\tDB::DM::readNextBlock(std::__1::shared_ptr<DB::IBlockInputStream> const&) [tiflash+110545429]\n                \tdbms/src/Storages/DeltaMerge/DeltaMergeHelpers.h:253\n       0x16d57e5\tDB::DM::DMVersionFilterBlockInputStream<1>::initNextBlock() [tiflash+23943141]\n                \tdbms/src/Storages/DeltaMerge/DMVersionFilterBlockInputStream.h:137\n       0x16d360c\tDB::DM::DMVersionFilterBlockInputStream<1>::read(DB::PODArray<unsigned char, 4096ul, Allocator<false>, 15ul, 16ul>*&, bool) [tiflash+23934476]\n                \tdbms/src/Storages/DeltaMerge/DMVersionFilterBlockInputStream.cpp:51\n       0x6a96868\tDB::DM::BoundedSSTFilesToBlockInputStream::read() [tiflash+111765608]\n                \tdbms/src/Storages/DeltaMerge/SSTFilesToBlockInputStream.cpp:307\n       0x16d9044\tDB::DM::SSTFilesToDTFilesOutputStream<std::__1::shared_ptr<DB::DM::BoundedSSTFilesToBlockInputStream> >::write() [tiflash+23957572]\n                \tdbms/src/Storages/DeltaMerge/SSTFilesToDTFilesOutputStream.cpp:200\n       0x6a8d38f\tDB::KVStore::preHandleSSTsToDTFiles(std::__1::shared_ptr<DB::Region>, DB::SSTViewVec, unsigned long, unsigned long, DB::DM::FileConvertJobType, DB::TMTContext&) [tiflash+111727503]\n                \tdbms/src/Storages/Transaction/ApplySnapshot.cpp:360\n       0x6a8ca64\tDB::KVStore::preHandleSnapshotToFiles(std::__1::shared_ptr<DB::Region>, DB::SSTViewVec, unsigned long, unsigned long, DB::TMTContext&) [tiflash+111725156]\n                \tdbms/src/Storages/Transaction/ApplySnapshot.cpp:275\n       0x6ae7d66\tPreHandleSnapshot [tiflash+112098662]\n                \tdbms/src/Storages/Transaction/ProxyFFI.cpp:388\n  0x7f693aa9a228\tengine_store_ffi::_$LT$impl$u20$engine_store_ffi..interfaces..root..DB..EngineStoreServerHelper$GT$::pre_handle_snapshot::hec57f9b0ef29a0bb [libtiflash_proxy.so+17646120]\n  0x7f693aa91d09\tengine_store_ffi::observer::pre_handle_snapshot_impl::h0b40090f59175b24 [libtiflash_proxy.so+17612041]\n  0x7f693aa84b86\tyatp::task::future::RawTask$LT$F$GT$::poll::hd3296fb5cae316b9 [libtiflash_proxy.so+17558406]\n  0x7f693c910dc3\t_$LT$yatp..task..future..Runner$u20$as$u20$yatp..pool..runner..Runner$GT$::handle::h0056e31c4da70e35 [libtiflash_proxy.so+49589699]\n  0x7f693c9036ac\tstd::sys_common::backtrace::__rust_begin_short_backtrace::h747afb2668c16dcb [libtiflash_proxy.so+49534636]\n  0x7f693c9041cc\tcore::ops::function::FnOnce::call_once$u7b$$u7b$vtable.shim$u7d$$u7d$::h83ec6721ad8db87f [libtiflash_proxy.so+49537484]\n  0x7f693c071555\tstd::sys::unix::thread::Thread::new::thread_start::hd2791a9cabec1fda [libtiflash_proxy.so+40547669]\n                \t/rustc/96ddd32c4bfb1d78f0cd03eb068b1710a8cebeef/library/std/src/sys/unix/thread.rs:108\n  0x7f69397b1ea5\tstart_thread [libpthread.so.0+32421]\n  0x7f6938bb6b0d\tclone [libc.so.6+1043213]"] [source="DB::RawCppPtr DB::PreHandleSnapshot(DB::EngineStoreServerWrap *, DB::BaseBuffView, uint64_t, DB::SSTViewVec, uint64_t, uint64_t)"] [thread_id=30]

tiflash 报错日志:

集群配置:


执行前状态是正常的,执行ALTER TABLE sxt_order SET TIFLASH REPLICA 1 后就Disconnected了

tiflash已经尝试过缩容再扩容都无法解决
报错后tiflash部署目录不停的出现core这个文件,大小1G左右,直观感觉像是啥溢出

这个bug我验证了一下,验证结果是:只要下游tidb集群接收过上游cdc同步的数据,下游的tiflash肯定崩,无论下游tiflash再怎么缩容再扩容都无法再正常启动。在v6.5.0,v6.5.1,v7.0.0都是必现

1 个赞

集群状态看下

1 个赞

启动tiflash节点,有报错信息么?

启动时候直接提示超时了,目前因为上面那个日志一直刷新。且日志中发现tiflash在不断的重复启动和宕机,最终磁盘满了 导致现在无法启动了

select * from information_schema.tiflash_replica;
pd-ctl region 6513
报错日志中的region_id 是有很多不同的还是集中再某几个?

你是不是还是那个我修复的集群

不是,这是新部署的集群,我补充了集群拓扑了,麻烦看下


只有1个

这套Tidb集群的上游,有没有ticdc给这套集群同步数据?有的话,v6.5.0是有这个bug的,tiflash会无限重启,产生core文件

get

有上游cdc在给这套tidb集群同步数据,但是我停了上游也不行。停止是tiup ctl:v6.5.0 cdc changefeed pause暂停,没有删除cdc任务

有解决办法吗

这个bug我验证了一下,验证结果是:只要下游tidb集群接收过上游cdc同步的数据,下游的tiflash肯定崩,无论下游tiflash再怎么缩容再扩容都无法再正常启动。在v6.5.0,v6.5.1,v7.0.0都是必现

https://github.com/pingcap/tiflash/issues/7212 相关的问题,已经提 issue 记录,会尽快解决

https://github.com/tikv/tikv/pull/13777/files#diff-1c7cb9c3480464a0b9a7b2da401f6d40c5daeea381790bc40f50d529f2290b38R33
问题是由这个 pr 引发的

此话题已在最后回复的 60 天后被自动关闭。不再允许新回复。