tiflash新增表导致节点down

【TiDB 版本】V5.0.1

【问题描述】新增Tiflash节点,运行正常。
ALTER TABLE xxx.yyy SET TIFLASH REPLICA 1; 添加小表,正常。
ALTER TABLE aaa.bbb SET TIFLASH REPLICA 1; 添加大表,异常。(约1亿行)
添加大表同步过程中TiFlash节点down;
【数据来源】:TiDB数据为通过DM RDS的数据。未采用lighting导入。

频繁重启生成大量dump文件

报错日志如下

tiflash.log

 [ERROR] [<unknown>] ["DB::DM::DeltaMergeStore::DeltaMergeStore(DB::Context&, bool, const String&, const String&, const ColumnDefines&, const DB::DM::ColumnDefine&, bool, size_t, const DB::DM::DeltaMergeStore::Settings&): Code: 49, e.displayText() = DB::Exception: PageFile binary version not match, unknown [version=0] [file=/tidb/tidb-data/tiflash-9000/data/t_4189/log/page_89_0/meta], e.what() = DB::Exception, Stack trace:\
\
0. bin/tiflash/tiflash(StackTrace::StackTrace()+0x15) [0x367c835]\
1. bin/tiflash/tiflash(DB::Exception::Exception(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, int)+0x25) [0x36733c5]\
2. bin/tiflash/tiflash(DB::PageFile::MetaMergingReader::moveNext(unsigned int*)+0xdb4) [0x73b0944]\
3. bin/tiflash/tiflash(DB::PageStorage::restore()+0xfae) [0x73bdc7e]\
4. bin/tiflash/tiflash(DB::DM::StoragePool::restore()+0x22) [0x7192a22]\
5. bin/tiflash/tiflash(DB::DM::DeltaMergeStore::DeltaMergeStore(DB::Context&, bool, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::vector<DB::DM::ColumnDefine, std::allocator<DB::DM::ColumnDefine> > const&, DB::DM::ColumnDefine const&, bool, unsigned long, DB::DM::DeltaMergeStore::Settings const&)+0x901) [0x7139e31]\
6. bin/tiflash/tiflash(DB::StorageDeltaMerge::StorageDeltaMerge(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::optional<std::reference_wrapper<TiDB::TableInfo const> >, DB::ColumnsDescription const&, std::shared_ptr<DB::IAST> const&, unsigned long, DB::Context&)+0x127e) [0x70b5f1e]\
7. bin/tiflash/tiflash() [0x71948e5]\
8. bin/tiflash/tiflash() [0x71951a7]\
9. bin/tiflash/tiflash(DB::StorageFactory::get(DB::ASTCreateQuery&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, DB::Context&, DB::Context&, DB::ColumnsDescription const&, bool, bool) const+0x1ba) [0x70c955a]\
10. bin/tiflash/tiflash(DB::createTableFromDefinition(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, DB::Context&, bool, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)+0x10b) [0x68e5d4b]\
11. bin/tiflash/tiflash(DB::DatabaseLoading::loadTable(DB::Context&, DB::IDatabase&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, bool)+0x2da) [0x68e7bda]\
12. bin/tiflash/tiflash() [0x68de07d]\
13. bin/tiflash/tiflash(ThreadPool::worker()+0x166) [0x7a37bf6]\
14. bin/tiflash/tiflash() [0x86ab60e]\
15. /lib64/libpthread.so.0(+0x7dd4) [0x7f5244cd2dd4]\
16. /lib64/libc.so.6(clone+0x6c) [0x7f52446fa02c]\
"] [thread_id=3]

tiflash_error.log

2021.05.17 15:57:36.846075 [ 1 ] <Error> Application: DB::Exception: Cannot create table from metadata file /tidb/tidb-data/tiflash-9000/metadata/db_4167/t_4189.sql, error: DB::Exception: PageFile binary version not match, unknown [version=0] [file=/tidb/tidb-data/tiflash-9000/data/t_4189/log/page_89_0/meta], stack trace:
0. bin/tiflash/tiflash(StackTrace::StackTrace()+0x15) [0x367c835]
1. bin/tiflash/tiflash(DB::Exception::Exception(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, int)+0x25) [0x36733c5]
2. bin/tiflash/tiflash(DB::PageFile::MetaMergingReader::moveNext(unsigned int*)+0xdb4) [0x73b0944]
3. bin/tiflash/tiflash(DB::PageStorage::restore()+0xfae) [0x73bdc7e]
4. bin/tiflash/tiflash(DB::DM::StoragePool::restore()+0x22) [0x7192a22]
5. bin/tiflash/tiflash(DB::DM::DeltaMergeStore::DeltaMergeStore(DB::Context&, bool, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::vector<DB::DM::ColumnDefine, std::allocator<DB::DM::ColumnDefine> > const&, DB::DM::ColumnDefine const&, bool, unsigned long, DB::DM::DeltaMergeStore::Settings const&)+0x901) [0x7139e31]
6. bin/tiflash/tiflash(DB::StorageDeltaMerge::StorageDeltaMerge(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::optional<std::reference_wrapper<TiDB::TableInfo const> >, DB::ColumnsDescription const&, std::shared_ptr<DB::IAST> const&, unsigned long, DB::Context&)+0x127e) [0x70b5f1e]
7. bin/tiflash/tiflash() [0x71948e5]
8. bin/tiflash/tiflash() [0x71951a7]
9. bin/tiflash/tiflash(DB::StorageFactory::get(DB::ASTCreateQuery&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, DB::Context&, DB::Context&, DB::ColumnsDescription const&, bool, bool) const+0x1ba) [0x70c955a]
10. bin/tiflash/tiflash(DB::createTableFromDefinition(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, DB::Context&, bool, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)+0x10b) [0x68e5d4b]
11. bin/tiflash/tiflash(DB::DatabaseLoading::loadTable(DB::Context&, DB::IDatabase&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, bool)+0x2da) [0x68e7bda]
12. bin/tiflash/tiflash() [0x68de07d]
13. bin/tiflash/tiflash(ThreadPool::worker()+0x166) [0x7a37bf6]
14. bin/tiflash/tiflash() [0x86ab60e]
15. /lib64/libpthread.so.0(+0x7dd4) [0x7fbddb4c6dd4]
16. /lib64/libc.so.6(clone+0x6c) [0x7fbddaeee02c]

你好,请问可以上传第一次出现这个错误及其发生前 2 个小时的 tiflash.log 的完整日志吗?

另外请使用 ls -l /tidb/tidb-data/tiflash-9000/data/t_4189/log/page_89_0/ 列出这个目录下面文件的信息。

-rw-r--r-- 1 tidb tidb  1514107 May 16 00:34 meta
-rw-r--r-- 1 tidb tidb 10141200 May 16 00:34 page

日志太多太大,不太好提取上传。

或者放在网盘里可以吗

可以定位下第一次出现这个错误及其发生前 2 个小时的 tiflash.log 的完整日志,截取、缩小下原始的数据大小,压缩上传到网盘。 ​/tidb/tidb-data/tiflash-9000/data/t_4189/log/page_89_0 这个文件夹希望也能打包压缩上传一下。

另外想确认下这个 TIFlash 节点是新部署的,还是从旧版本升级上来 5.0.1 的?

5.0.2版本会修复吗?我看时间线,5.0.2也快发布了吧。

麻烦先提供一下日志吧。

已私信。关联错帖子了,您看私信便知。

麻烦将问题和解决过程最后更新到这个帖子, 多谢。

已私发日志给 @懂的都懂

https://share.weiyun.com/jSzEGrR5

好的,他在分析了。。。

目前已经定位到这是一个 bug,https://github.com/pingcap/tics/issues/1932
对于已经出现这样情况的节点,建议用户把出现问题的 tiflash 节点强制下线,手动清理数据。
将在 5.0.2 修复该 bug。

404。
坐等5.0.2,预计几号发布?

5.0.2 预计下这个月底发布。或者您点击联系社区专家,我们给您 hotfix 的版本

相关 issue https://github.com/pingcap/tics/issues/1932

此话题已在最后回复的 1 分钟后被自动关闭。不再允许新回复。