tiflash异常重启 checksum not match

【 TiDB 使用环境】线上
【 TiDB 版本】
6.1.0
【遇到的问题】
之前访问tiflash的表报错,报错后设置副本数为0,报错如下(仅报错未重启):

现在重新开启这个表的副本,同步到50%左右tiflash开始报错,同时服务不断重启。
tiflash.log报错同上,部分日志如下:
[2022/07/27 16:56:48.764 +08:00] [ERROR] [Exception.cpp:85] ["void DB::BackgroundProcessingPool::threadFunction():Code: 40, e.displayText() = DB::Exception: Page[167976] field[1] checksum not match, broken file: /data01/deploy/data/data/t_17630/log/page_56_0/page, expected: b0d1876a36fa4582, but: a3e70a0a2ff59556, e.what() = DB::Exception, Stack trace:


0x1d272d3\tStackTrace::StackTrace() [tiflash+30569171]
\tdbms/src/Common/StackTrace.cpp:23
0x1d248d6\tDB::Exception::Exception(std::__1::basic_string<char, std::__1::char_traits, std::__1::allocator > const&, int) [tiflash+30558422]
\tdbms/src/Common/Exception.h:41
0x79f8633\tDB::PS::V2::PageFile::Reader::read(std::__1::vector<DB::PS::V2::PageFile::Reader::FieldReadInfo, std::__1::allocatorDB::PS::V2::PageFile::Reader::FieldReadInfo >&, std::__1::shared_ptrDB::ReadLimiter const&) [tiflash+127895091]
\tdbms/src/Storages/Page/V2/PageFile.cpp:1050
0x7a0c089\tDB::PS::V2::PageStorage::readImpl(unsigned long, std::__1::vector<std::__1::pair<unsigned long, std::__1::vector<unsigned long, std::__1::allocator > >, std::__1::allocator<std::__1::pair<unsigned long, std::__1::vector<unsigned long, std::__1::allocator > > > > const&, std::__1::shared_ptrDB::ReadLimiter const&, std::__1::shared_ptrDB::PageStorageSnapshot, bool) [tiflash+127975561]
\tdbms/src/Storages/Page/V2/PageStorage.cpp:783
0x7a8fbf0\tDB::PageReaderImplNormal::read(std::__1::vector<std::__1::pair<unsigned long, std::__1::vector<unsigned long, std::__1::allocator > >, std::__1::allocator<std::__1::pair<unsigned long, std::__1::vector<unsigned long, std::__1::allocator > > > > const&) const [tiflash+128515056]
\tdbms/src/Storages/Page/PageStorage.cpp:113
0x7a8d3f2\tDB::PageReader::read(std::__1::vector<std::__1::pair<unsigned long, std::__1::vector<unsigned long, std::__1::allocator > >, std::__1::allocator<std::__1::pair<unsigned long, std::__1::vector<unsigned long, std::__1::allocator > > > > const&) const [tiflash+128504818]
\tdbms/src/Storages/Page/PageStorage.cpp:415
0x7898948\tDB::DM::ColumnFileTiny::readFromDisk(DB::PageReader const&, std::__1::vector<DB::DM::ColumnDefine, std::__1::allocatorDB::DM::ColumnDefine > const&, unsigned long, unsigned long) const [tiflash+126454088]
\tdbms/src/Storages/DeltaMerge/ColumnFile/ColumnFileTiny.cpp:79
0x7899124\tDB::DM::ColumnFileTiny::fillColumns(DB::PageReader const&, std::__1::vector<DB::DM::ColumnDefine, std::__1::allocatorDB::DM::ColumnDefine > const&, unsigned long, std::__1::vector<COWPtrDB::IColumn::immutable_ptrDB::IColumn, std::__1::allocator<COWPtrDB::IColumn::immutable_ptrDB::IColumn > >&) const [tiflash+126456100]
\tdbms/src/Storages/DeltaMerge/ColumnFile/ColumnFileTiny.cpp:115
0x789a3f6\tDB::DM::ColumnFileTinyReader::readRows(std::__1::vector<COWPtrDB::IColumn::mutable_ptrDB::IColumn, std::__1::allocator<COWPtrDB::IColumn::mutable_ptrDB::IColumn > >&, unsigned long, unsigned long, DB::DM::RowKeyRange const*) [tiflash+126460918]
\tdbms/src/Storages/DeltaMerge/ColumnFile/ColumnFileTiny.cpp:237
0x788fa13\tDB::DM::ColumnFileSetReader::readRows(std::__1::vector<COWPtrDB::IColumn::mutable_ptrDB::IColumn, std::__1::allocator<COWPtrDB::IColumn::mutable_ptrDB::IColumn > >&, unsigned long, unsigned long, DB::DM::RowKeyRange const*) [tiflash+126417427]
\tdbms/src/Storages/DeltaMerge/ColumnFile/ColumnFileSetReader.cpp:160
0x788f485\tDB::DM::ColumnFileSetReader::readPKVersion(unsigned long, unsigned long) [tiflash+126416005]
\tdbms/src/Storages/DeltaMerge/ColumnFile/ColumnFileSetReader.cpp:115
0x788fb51\tDB::DM::ColumnFileSetReader::getPlaceItems(std::__1::vector<DB::DM::BlockOrDelete, std::__1::allocatorDB::DM::BlockOrDelete >&, unsigned long, unsigned long, unsigned long, unsigned long, unsigned long) [tiflash+126417745]
\tdbms/src/Storages/DeltaMerge/ColumnFile/ColumnFileSetReader.cpp:185
0x78c55c0\tDB::DM::DeltaValueReader::getPlaceItems(unsigne…

/var/log/message 报错如下:
systemd: tiflash-9000.service: main process exited, code=killed, status=6/ABRT
systemd: Unit tiflash-9000.service entered failed state.
systemd: tiflash-9000.service failed.
systemd: tiflash-9000.service holdoff time over, scheduling restart.
systemd: Stopped tiflash service.
systemd: Started tiflash service.
bash: sync …
bash: real#0110m0.103s
bash: user#0110m0.000s
bash: sys#0110m0.074s
bash: ok

后续操作:
tiflash节点下线,下线过程中仍然在重启服务,后来改为强制下线,目前该节点已重新上线

看来是数据文件写坏了
/data01/deploy/data/data/t_17630/log/page_56_0/ 这个文件夹还在吗?能否打包发送一下,我们分析下文件内容。

文件没备份,只有日志

问一下这个问题发生之前,tiflash 集群有过一些什么操作吗?比如重启之类的

之前的重启问题,确认后是持续分析功能造成的https://asktug.com/t/topic/695336,关闭这个功能之后没有再重启过了

设置副本数为0,不会清理之前的数据文件吗?如果是这样的话,日常应该怎么清理呢?怎么检测数据文件是否损坏?

看来是火焰图持续分析导致反复重启,触发了这个问题,导致文件写坏了。我们尝试内部重现一下问题。
set tiflash replica 0 之后,数据是逐步回收的,不是立刻删除。极端情况下可能不会完全清理(6.2 进一步修复了不清理的问题)。
抱歉目前tiflash 还无法主动检测文件损坏,后续有计划增加这个功能

这个 issue 也有用户记录出现过文件写坏的 io 错误:https://github.com/pingcap/tiflash/issues/5292

这个内容就是我这边的,只不过不是我上传的~:sweat:

:rofl:

该主题在最后一个回复创建后60天后自动关闭。不再允许新的回复。