tiflash突然挂了-Detected invalid null

【 TiDB 使用环境】生产环境
【 TiDB 版本】v6.5.5
【复现路径】做过哪些操作出现的问题
没做操作,突然挂了,自动重启失败

【遇到的问题:问题现象及影响】

【资源配置】进入到 TiDB Dashboard -集群信息 (Cluster Info) -主机(Hosts) 截图此页面
【附件:截图/日志/监控】

1 个赞

display下集群状态呢

集群状态看下呢?

感觉时集群突然挂了呢?

监控看下某个tikv节点网络流量是不是异常

抱歉,日志拿错了,已更改

日志拉错了,已重新截图

一开始是dissconnect,现在已经down了,日志拉错了,已重新截图

其他组件都正常,这个tiflash实例一开始是dissconnect,现在已经down了,日志拉错了,已重新截图

[2023/11/24 10:10:37.234 +08:00] [ERROR] [Exception.cpp:89] [“Code: 49, e.displayText() = DB::Exception: Detected invalid null when decoding data of column denomination with column type Decimal64: physical_table_id=3668: (while preHandleSnapshot region_id=2680177673, index=847, term=21), e.what() = DB::Exception, Stack trace:\n\n\n 0x1718afe\tDB::Exception::Exception(std::__1::basic_string<char, std::__1::char_traits, std::__1::allocator > const&, int) [tiflash+24218366]\n \tdbms/src/Common/Exception.h:46\n 0x6b1536a\tbool DB::appendRowV2ToBlockImpl(std::__1::basic_string<char, std::__1::char_traits, std::__1::allocator > const&, std::__1::__map_const_iterator<std::__1::__tree_const_iterator<std::__1::__value_type<long, unsigned long>, std::__1::__tree_node<std::__1::__value_type<long, unsigned long>, void*>, long> >, std::__1::__map_const_iterator<std::__1::__tree_const_iterator<std::__1::__value_type<long, unsigned long>, std::__1::__tree_node<std::__1::__value_type<long, unsigned long>, void>, long> >, DB::Block&, unsigned long, std::__1::vector<TiDB::ColumnInfo, std::__1::allocatorTiDB::ColumnInfo > const&, long, bool, bool) [tiflash+112284522]\n \tdbms/src/Storages/Transaction/RowCodec.cpp:487\n 0x6b13824\tDB::appendRowToBlock(std::__1::basic_string<char, std::__1::char_traits, std::__1::allocator > const&, std::__1::__map_const_iterator<std::__1::__tree_const_iterator<std::__1::__value_type<long, unsigned long>, std::__1::__tree_node<std::__1::__value_type<long, unsigned long>, void>, long> >, std::__1::__map_const_iterator<std::__1::__tree_const_iterator<std::__1::__value_type<long, unsigned long>, std::__1::__tree_node<std::__1::__value_type<long, unsigned long>, void>, long> >, DB::Block&, unsigned long, std::__1::shared_ptr<DB::DecodingStorageSchemaSnapshot const> const&, bool) [tiflash+112277540]\n \tdbms/src/Storages/Transaction/RowCodec.cpp:349\n 0x6ae0e53\tbool DB::RegionBlockReader::readImpl<(DB::TMTPKType)0>(DB::Block&, std::__1::vector<std::__1::tuple<DB::RawTiDBPK, unsigned char, unsigned long, std::__1::shared_ptr<DB::StringObject const> >, std::__1::allocator<std::__1::tuple<DB::RawTiDBPK, unsigned char, unsigned long, std::__1::shared_ptr<DB::StringObject const> > > > const&, bool) [tiflash+112070227]\n \tdbms/src/Storages/Transaction/RegionBlockReader.cpp:146\n 0x6abac58\tDB::GenRegionBlockDataWithSchema(std::__1::shared_ptrDB::Region const&, std::__1::shared_ptr<DB::DecodingStorageSchemaSnapshot const> const&, unsigned long, bool, DB::TMTContext&) [tiflash+111914072]\n \tdbms/src/Storages/Transaction/PartitionStreams.cpp:598\n 0x6a7089a\tDB::DM::SSTFilesToBlockInputStream::readCommitedBlock() [tiflash+111610010]\n \tdbms/src/Storages/DeltaMerge/SSTFilesToBlockInputStream.cpp:255\n 0x6a6f30e\tDB::DM::SSTFilesToBlockInputStream::read() [tiflash+111604494]\n \tdbms/src/Storages/DeltaMerge/SSTFilesToBlockInputStream.cpp:154\n 0x6946ea5\tDB::DM::readNextBlock(std::__1::shared_ptrDB::IBlockInputStream const&) [tiflash+110390949]\n \tdbms/src/Storages/DeltaMerge/DeltaMergeHelpers.h:253\n 0x6a71dec\tDB::DM::PKSquashingBlockInputStream::read() [tiflash+111615468]\n \tdbms/src/Storages/DeltaMerge/PKSquashingBlockInputStream.h:78\n 0x6946ea5\tDB::DM::readNextBlock(std::__1::shared_ptrDB::IBlockInputStream const&) [tiflash+110390949]\n \tdbms/src/Storages/DeltaMerge/DeltaMergeHelpers.h:253\n 0x16cbd35\tDB::DM::DMVersionFilterBlockInputStream<1>::initNextBlock() [tiflash+23903541]\n \tdbms/src/Storages/DeltaMerge/DMVersionFilterBlockInputStream.h:137\n 0x16cb56b\tDB::DM::DMVersionFilterBlockInputStream<1>::read(DB::PODArray<unsigned char, 4096ul, Allocator, 15ul, 16ul>&, bool) [tiflash+23901547]\n \tdbms/src/Storages/DeltaMerge/DMVersionFilterBlockInputStream.cpp:323\n 0x6a71018\tDB::DM::BoundedSSTFilesToBlockInputStream::read() [tiflash+111611928]\n \tdbms/src/Storages/DeltaMerge/SSTFilesToBlockInputStream.cpp:307\n 0x16cf574\tDB::DM::SSTFilesToDTFilesOutputStream<std::__1::shared_ptrDB::DM::BoundedSSTFilesToBlockInputStream >::write() [tiflash+23917940]\n \tdbms/src/Storages/DeltaMerge/SSTFilesToDTFilesOutputStream.cpp:200\n 0x6a67b3f\tDB::KVStore::preHandleSSTsToDTFiles(std::__1::shared_ptrDB::Region, DB::SSTViewVec, unsigned long, unsigned long, DB::DM::FileConvertJobType, DB::TMTContext&) [tiflash+111573823]\n \tdbms/src/Storages/Transaction/ApplySnapshot.cpp:360\n 0x6a67214\tDB::KVStore::preHandleSnapshotToFiles(std::__1::shared_ptrDB::Region, DB::SSTViewVec, unsigned long, unsigned long, DB::TMTContext&) [tiflash+111571476]\n \tdbms/src/Storages/Transaction/ApplySnapshot.cpp:275\n 0x6ac2516\tPreHandleSnapshot [tiflash+111944982]\n \tdbms/src/Storages/Transaction/ProxyFFI.cpp:388\n 0x7fd8813cc228\tengine_store_ffi::$LT$impl$u20$engine_store_ffi…interfaces…root…DB…EngineStoreServerHelper$GT$::pre_handle_snapshot::hec57f9b0ef29a0bb [libtiflash_proxy.so+17646120]\n 0x7fd8813c3d09\tengine_store_ffi::observer::pre_handle_snapshot_impl::h0b40090f59175b24 [libtiflash_proxy.so+17612041]\n 0x7fd8813b6b86\tyatp::task::future::RawTask$LT$F$GT$::poll::hd3296fb5cae316b9 [libtiflash_proxy.so+17558406]\n 0x7fd883242f13\t$LT$yatp…task…future…Runner$u20$as$u20$yatp…pool…runner…Runner$GT$::handle::h0056e31c4da70e35 [libtiflash_proxy.so+49590035]\n 0x7fd8832357fc\tstd::sys_common::backtrace::__rust_begin_short_backtrace::h747afb2668c16dcb [libtiflash_proxy.so+49534972]\n 0x7fd88323631c\tcore::ops::function::FnOnce::call_once$u7b$$u7b$vtable.shim$u7d$$u7d$::h83ec6721ad8db87f [libtiflash_proxy.so+49537820]\n 0x7fd8829a36a5\tstd::sys::unix::thread::thread::new::thread_start::hd2791a9cabec1fda [libtiflash_proxy.so+40548005]\n \t/rustc/96ddd32c4bfb1d78f0cd03eb068b1710a8cebeef/library/std/src/sys/unix/thread.rs:108\n 0x7fd8800e3e25\tstart_thread [libpthread.so.0+32293]\n 0x7fd87f4e9bad\tclone [libc.so.6+1043373]”] [source=“DB::RawCppPtr DB::PreHandleSnapshot(DB::EngineStoreServerWrap *, DB::BaseBuffView, uint64_t, DB::SSTViewVec, uint64_t, uint64_t)”] [thread_id=197]

刚刚tiflash已经全都挂了

删除重建就好了

tiflash修复挺容易的

主要是数据很多,同步很慢,线上很急 :scream:

从堆栈看起来是 tiflash 遇到了无法正确 decode 为列的数据。

select `table_schema`,`table_name`, "" as partition_name from information_schema.tables where tidb_table_id='3668'
union
select `table_schema`,`table_name`,`partition_name` from information_schema.partitions where tidb_partition_id = '3668';

根据上面的 sql 查一下 3668 是属于哪个表,然后看一下这个表的 schema 以及最近这个表执行过什么 ddl 操作?

1 个赞

添加一个新的tiflash节点

删唯一索引,加唯一索引,最近只有这个操作

要恢复业务的话,一个方法是给上面的 table_id=3668 的表 set tiflash replica 为 0,然后扩容新的 tiflash 节点重建。这个应该可以恢复业务。

另外一个方法是把表的数据 copy 到另外没有 tiflash 副本的表 table_new 上。然后在原表 ALTER TABLE table_old DROP COLUMN denomination。这样让 tiflash 不去 decode 该列的数据,在原 tiflash 节点上尝试能不能绕过问题。 不过因为 bug 根因不明确,暂不确保该方法一定能恢复业务。

1 个赞

我们准备不同步这个表和这个表对应的库到tiflash了,因为这个表对应的库现在没有使用tiflash

删唯一索引,加唯一索引,最近只有这个操作

删除、添加唯一索引的操作有涉及到 denomination 这个列么?