使用场景是: 分批从 TiDB 的某个表里拉取数据导入到其他数据源,整表大小有 3亿。
执行 SQL: select * from settlement_fee_auto_mapping where (97 <= fee_id AND fee_id < 86864439)
,其中fee_id是主键
报错内容:如下,感觉是读取 S3 文件时候报错了,重试有时候会成功,有时又会报错
other error for mpp stream: Code: 0, e.displayText() = DB::Exception: cannot load checksum framed data from tiflash-remote-data/s18195849145/data/t_38167/dmf_610890/2.dat (errno = 0): (while reading from DTFile: s3://s18195849145/data/t_38167/dmf_610890), e.what() = DB::Exception, - java.sql.SQLException: other error for mpp stream: Code: 0, e.displayText() = DB::Exception: cannot load checksum framed data from tiflash-remote-data/s18195849145/data/t_38167/dmf_610890/2.dat (errno = 0): (while reading from DTFile: s3://s18195849145/data/t_38167/dmf_610890), e.what() = DB::Exception
看起来是读取 S3 报的错误,但是错误原因不太清楚。
麻烦提供下错误时间段前后的详细日志。
今天又出现这个错误了,重试多次也都是失败了,每次失败报的文件还不一样。下面是 TiFlash Compute 节点的报错信息
[2024/12/02 16:06:34.704 +08:00] [ERROR] [S3RandomAccessFile.cpp:98] ["Cannot read from istream, size=1048592 gcount=589262 state=0x06 cur_offset=0 content_length=1504148 errmsg=Success cost=5215266ns"] [source=s18195849145/data/t_82592/dmf_420808/7.dat] [thread_id=16]
[2024/12/02 16:06:34.738 +08:00] [WARN] [Task.cpp:140] ["error occurred and cancel the query"] [source="MPP<gather_id:<gather_id:1, query_ts:1733126669131441552, local_query_id:3191, server_id:1447, start_ts:454328757548220421, resource_group: default>,task_id:11> 1"] [thread_id=16]
[2024/12/02 16:06:34.738 +08:00] [WARN] [PipelineExecutorContext.cpp:79] ["error cannot load checksum framed data from tiflash-remote-data/s18195849145/data/t_82592/dmf_420808/7.dat (errno = 0): (while reading from DTFile: s3://s18195849145/data/t_82592/dmf_420808) occured and cancel the query"] [source="MPP<gather_id:<gather_id:1, query_ts:1733126669131441552, local_query_id:3191, server_id:1447, start_ts:454328757548220421, resource_group: default>,task_id:11>"] [thread_id=16]
[2024/12/02 16:06:35.183 +08:00] [ERROR] [MPPTask.cpp:647] ["task running meets error: Code: 0, e.displayText() = DB::Exception: cannot load checksum framed data from tiflash-remote-data/s18195849145/data/t_82592/dmf_420808/7.dat (errno = 0): (while reading from DTFile: s3://s18195849145/data/t_82592/dmf_420808), e.what() = DB::Exception, Stack trace:\n\n\n 0x1ee9431\tDB::TiFlashException::TiFlashException(std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, DB::TiFlashError const&) [tiflash+32412721]\n \tdbms/src/Common/TiFlashException.h:263\n 0x1da7e37\tDB::FramedChecksumReadBuffer<DB::Digest::XXH3>::expectRead(char*, unsigned long) [tiflash+31096375]\n \tdbms/src/IO/ChecksumBuffer.h:353\n 0x1da801d\tDB::FramedChecksumReadBuffer<DB::Digest::XXH3>::nextImpl() [tiflash+31096861]\n \tdbms/src/IO/ChecksumBuffer.h:391\n 0x1daae38\tDB::CompressedReadBufferBase<false>::readCompressedData(unsigned long&, unsigned long&) [tiflash+31108664]\n \tdbms/src/IO/CompressedReadBufferBase.cpp:53\n 0x1dd9363\tDB::CompressedReadBufferFromFileProvider<false>::nextImpl() [tiflash+31298403]\n \tdbms/src/Encryption/CompressedReadBufferFromFileProvider.cpp:32\n 0x785f603\tvoid DB::deserializeBinarySSE2<2>(DB::PODArray<unsigned char, 4096ul, Allocator<false>, 15ul, 16ul>&, DB::PODArray<unsigned long, 4096ul, Allocator<false>, 15ul, 16ul>&, DB::ReadBuffer&, unsigned long) [tiflash+126219779]\n \tdbms/src/DataTypes/DataTypeString.cpp:128\n 0x76aac30\tDB::DM::DMFileReader::readColumn(DB::DM::ColumnDefine const&, COWPtr<DB::IColumn>::immutable_ptr<DB::IColumn>&, unsigned long, unsigned long, unsigned long, unsigned long) [tiflash+124431408]\n \tdbms/src/Storages/DeltaMerge/File/DMFileReader.cpp:838\n 0x76a83d1\tDB::DM::DMFileReader::read() [tiflash+124421073]\n \tdbms/src/Storages/DeltaMerge/File/DMFileReader.cpp:742\n 0x769bb25\tDB::DM::DMFileBlockInputStream::read() [tiflash+124369701]\n \tdbms/src/Storages/DeltaMerge/File/DMFileBlockInputStream.h:62\n 0x760f6bd\tDB::DM::ConcatSkippableBlockInputStream<false>::read() [tiflash+123795133]\n \tdbms/src/Storages/DeltaMerge/SkippableBlockInputStream.h:185\n 0x7635921\tDB::DM::readBlock(std::__1::shared_ptr<DB::DM::SkippableBlockInputStream>&, std::__1::shared_ptr<DB::DM::SkippableBlockInputStream>&) [tiflash+123951393]\n \tdbms/src/Storages/DeltaMerge/ReadUtil.cpp:33\n 0x7657fd8\tDB::DM::BitmapFilterBlockInputStream::readImpl(DB::PODArray<unsigned char, 4096ul, Allocator<false>, 15ul, 16ul>*&, bool) [tiflash+124092376]\n \tdbms/src/Storages/DeltaMerge/BitmapFilter/BitmapFilterBlockInputStream.cpp:40\n 0x7658554\tDB::DM::BitmapFilterBlockInputStream::readImpl() [tiflash+124093780]\n \tdbms/src/Storages/DeltaMerge/BitmapFilter/BitmapFilterBlockInputStream.h:46\n 0x77a9c15\tDB::IProfilingBlockInputStream::read(DB::PODArray<unsigned char, 4096ul, Allocator<false>, 15ul, 16ul>*&, bool) [tiflash+125475861]\n \tdbms/src/DataStreams/IProfilingBlockInputStream.cpp:82\n 0x77dd763\tDB::DM::Remote::RNSegmentSourceOp::executeIOImpl() [tiflash+125687651]\n \tdbms/src/Storages/DeltaMerge/Remote/RNSegmentSourceOp.cpp:132\n 0x891fe04\tDB::Operator::executeIO() [tiflash+143785476]\n \tdbms/src/Operators/Operator.cpp:81\n 0x8852b7a\tDB::PipelineTaskBase::runExecuteIO() [tiflash+142945146]\n \tdbms/src/Flash/Pipeline/Schedule/Tasks/PipelineTaskBase.h:88\n 0x89412ca\tDB::Task::executeIO() [tiflash+143921866]\n \tdbms/src/Flash/Pipeline/Schedule/Tasks/Task.cpp:140\n 0x1e9cf05\tDB::TaskThreadPool<DB::IOImpl>::loop(unsigned long) [tiflash+32100101]\n \tdbms/src/Flash/Pipeline/Schedule/ThreadPool/TaskThreadPool.cpp:59\n 0x1e9d636\tvoid* std::__1::__thread_proxy<std::__1::tuple<std::__1::unique_ptr<std::__1::__thread_struct, std::__1::default_delete<std::__1::__thread_struct> >, void (DB::TaskThreadPool<DB::IOImpl>::*)(unsigned long), DB::TaskThreadPool<DB::IOImpl>*, unsigned long> >(void*) [tiflash+32101942]\n \t/usr/local/bin/../include/c++/v1/thread:291\n 0x7f78506c8ac3\t<unknown symbol> [libc.so.6+608963]\n 0x7f785075a850\t<unknown symbol> [libc.so.6+1206352]"] [source="MPP<gather_id:<gather_id:1, query_ts:1733126669131441552, local_query_id:3191, server_id:1447, start_ts:454328757548220421, resource_group: default>,task_id:11>"] [thread_id=78]
[2024/12/02 16:06:35.183 +08:00] [WARN] [MPPTask.cpp:745] ["Begin abort task: MPP<gather_id:<gather_id:1, query_ts:1733126669131441552, local_query_id:3191, server_id:1447, start_ts:454328757548220421, resource_group: default>,task_id:11>, abort type: ONERROR"] [source="MPP<gather_id:<gather_id:1, query_ts:1733126669131441552, local_query_id:3191, server_id:1447, start_ts:454328757548220421, resource_group: default>,task_id:11>"] [thread_id=78]
[2024/12/02 16:06:35.183 +08:00] [WARN] [ExchangeReceiver.cpp:982] ["connection end. meet error: true, err msg: Code: 0, e.displayText() = DB::Exception: cannot load checksum framed data from tiflash-remote-data/s18195849145/data/t_82592/dmf_420808/7.dat (errno = 0): (while reading from DTFile: s3://s18195849145/data/t_82592/dmf_420808), e.what() = DB::Exception,, current alive connections: 3"] [source="MPP<gather_id:<gather_id:1, query_ts:1733126669131441552, local_query_id:3191, server_id:1447, start_ts:454328757548220421, resource_group: default>,task_id:15> ExchangeReceiver_340 local tunnel11+15"] [thread_id=78]
[2024/12/02 16:06:35.183 +08:00] [WARN] [ExchangeReceiver.cpp:1003] ["Finish receiver channels, meet error: true, error message: Code: 0, e.displayText() = DB::Exception: cannot load checksum framed data from tiflash-remote-data/s18195849145/data/t_82592/dmf_420808/7.dat (errno = 0): (while reading from DTFile: s3://s18195849145/data/t_82592/dmf_420808), e.what() = DB::Exception,"] [source="MPP<gather_id:<gather_id:1, query_ts:1733126669131441552, local_query_id:3191, server_id:1447, start_ts:454328757548220421, resource_group: default>,task_id:15> ExchangeReceiver_340"] [thread_id=78]
[2024/12/02 16:06:35.183 +08:00] [WARN] [MPPTask.cpp:774] ["Finish abort task from running"] [source="MPP<gather_id:<gather_id:1, query_ts:1733126669131441552, local_query_id:3191, server_id:1447, start_ts:454328757548220421, resource_group: default>,task_id:11>"] [thread_id=78]
[2024/12/02 16:06:35.186 +08:00] [WARN] [MPPTaskManager.cpp:277] ["Begin to abort gather: <gather_id:1, query_ts:1733126669131441552, local_query_id:3191, server_id:1447, start_ts:454328757548220421, resource_group: default>, abort type: ONCANCELLATION, reason: Receive cancel request from TiDB"] [thread_id=837]
[2024/12/02 16:06:35.186 +08:00] [WARN] [MPPTaskManager.cpp:321] ["Remaining task in gather <gather_id:1, query_ts:1733126669131441552, local_query_id:3191, server_id:1447, start_ts:454328757548220421, resource_group: default> are: MPP<gather_id:<gather_id:1, query_ts:1733126669131441552, local_query_id:3191, server_id:1447, start_ts:454328757548220421, resource_group: default>,task_id:15> "] [thread_id=837]
[2024/12/02 16:06:35.186 +08:00] [WARN] [MPPTask.cpp:745] ["Begin abort task: MPP<gather_id:<gather_id:1, query_ts:1733126669131441552, local_query_id:3191, server_id:1447, start_ts:454328757548220421, resource_group: default>,task_id:15>, abort type: ONCANCELLATION"] [source="MPP<gather_id:<gather_id:1, query_ts:1733126669131441552, local_query_id:3191, server_id:1447, start_ts:454328757548220421, resource_group: default>,task_id:15>"] [thread_id=837]
[2024/12/02 16:06:35.186 +08:00] [WARN] [ExchangeReceiver.cpp:982] ["connection end. meet error: true, err msg: Exchange receiver meet error : Receive cancel request from TiDB, current alive connections: 2"] [source="MPP<gather_id:<gather_id:1, query_ts:1733126669131441552, local_query_id:3191, server_id:1447, start_ts:454328757548220421, resource_group: default>,task_id:15> ExchangeReceiver_340 async tunnel12+15"] [thread_id=43]
[2024/12/02 16:06:35.187 +08:00] [WARN] [ExchangeReceiver.cpp:1003] ["Finish receiver channels, meet error: true, error message: Code: 0, e.displayText() = DB::Exception: cannot load checksum framed data from tiflash-remote-data/s18195849145/data/t_82592/dmf_420808/7.dat (errno = 0): (while reading from DTFile: s3://s18195849145/data/t_82592/dmf_420808), e.what() = DB::Exception,"] [source="MPP<gather_id:<gather_id:1, query_ts:1733126669131441552, local_query_id:3191, server_id:1447, start_ts:454328757548220421, resource_group: default>,task_id:15> ExchangeReceiver_340"] [thread_id=43]
[2024/12/02 16:06:35.187 +08:00] [WARN] [ExchangeReceiver.cpp:982] ["connection end. meet error: true, err msg: Exchange receiver meet error : Receive cancel request from TiDB, current alive connections: 0"] [source="MPP<gather_id:<gather_id:1, query_ts:1733126669131441552, local_query_id:3191, server_id:1447, start_ts:454328757548220421, resource_group: default>,task_id:15> ExchangeReceiver_340 async tunnel10+15"] [thread_id=223]
[2024/12/02 16:06:35.187 +08:00] [WARN] [ExchangeReceiver.cpp:1003] ["Finish receiver channels, meet error: true, error message: Code: 0, e.displayText() = DB::Exception: cannot load checksum framed data from tiflash-remote-data/s18195849145/data/t_82592/dmf_420808/7.dat (errno = 0): (while reading from DTFile: s3://s18195849145/data/t_82592/dmf_420808), e.what() = DB::Exception,"] [source="MPP<gather_id:<gather_id:1, query_ts:1733126669131441552, local_query_id:3191, server_id:1447, start_ts:454328757548220421, resource_group: default>,task_id:15> ExchangeReceiver_340"] [thread_id=223]
[2024/12/02 16:06:35.187 +08:00] [WARN] [ExchangeReceiver.cpp:982] ["connection end. meet error: true, err msg: Exchange receiver meet error : Receive cancel request from TiDB, current alive connections: 1"] [source="MPP<gather_id:<gather_id:1, query_ts:1733126669131441552, local_query_id:3191, server_id:1447, start_ts:454328757548220421, resource_group: default>,task_id:15> ExchangeReceiver_340 async tunnel13+15"] [thread_id=29]
[2024/12/02 16:06:35.187 +08:00] [WARN] [ExchangeReceiver.cpp:1003] ["Finish receiver channels, meet error: true, error message: Code: 0, e.displayText() = DB::Exception: cannot load checksum framed data from tiflash-remote-data/s18195849145/data/t_82592/dmf_420808/7.dat (errno = 0): (while reading from DTFile: s3://s18195849145/data/t_82592/dmf_420808), e.what() = DB::Exception,"] [source="MPP<gather_id:<gather_id:1, query_ts:1733126669131441552, local_query_id:3191, server_id:1447, start_ts:454328757548220421, resource_group: default>,task_id:15> ExchangeReceiver_340"] [thread_id=29]
[2024/12/02 16:06:35.190 +08:00] [WARN] [MPPTask.cpp:774] ["Finish abort task from running"] [source="MPP<gather_id:<gather_id:1, query_ts:1733126669131441552, local_query_id:3191, server_id:1447, start_ts:454328757548220421, resource_group: default>,task_id:15>"] [thread_id=837]
[2024/12/02 16:06:35.190 +08:00] [WARN] [MPPTaskManager.cpp:339] ["Finish abort gather: <gather_id:1, query_ts:1733126669131441552, local_query_id:3191, server_id:1447, start_ts:454328757548220421, resource_group: default>"] [thread_id=837]
报错代码可以看到是这里,不过为什么返回值是errmsg=Success
,但还是报错了。
让 AI 分析一下这个错误信息,说是下载时候网络出现了抖动导致的,看了下监控当时流量确实也达到机型的网络上限,所以问题就变成了:
- 有什么办法可以增加重试,或者降低网络读取速度(这个已经确定当前还不支持)么?
- 为什么多次重试也一样有报错,理论上应该会后台缓存到本地的磁盘上的吧?
下面是 AI 的分析结果:
- 状态分析:
state=0x06
在iostream中表示failbit(0x02) | eofbit(0x04)
的组合- 这表明流既达到了EOF,同时也发生了某种失败
- 数据分析:
- 请求读取大小:
size=1048592
(约1MB) - 实际读取大小:
gcount=589262
(约589KB) - 当前偏移:
cur_offset=0
(从文件开始读取) - 文件总长度:
content_length=1504148
(约1.5MB)
- 错误判断条件:
if (gcount < size && (!istr.eof() || cur_offset + gcount != static_cast<size_t>(content_length)))
这个条件在以下情况下会触发:
- 实际读取的数据量小于请求的数据量 (
gcount < size
) - 并且 (流没有到达EOF 或者 当前读取位置加上读取的数据量不等于文件总长度)
- 可能的原因:
- S3数据流在读取过程中可能被中断或发生网络问题
- AWS SDK的底层流实现可能存在问题,导致在读取大块数据时提前触发了EOF
- 数据流的状态标志(failbit和eofbit同时被设置)表明读取操作异常终止
- 建议解决方案:
- 考虑增加重试机制(虽然代码中已经有重试逻辑)
- 可以尝试减小单次读取的数据块大小
- 添加更详细的错误状态检查和日志记录
- 考虑在读取大文件时使用分片下载策略
这种错误通常与网络传输或S3服务的响应有关,建议检查:
- 网络连接的稳定性
- S3访问权限是否正确
- 是否存在并发访问导致的问题
- AWS SDK的版本是否需要更新
数据量太大,超时了,超时时间设长点。
这个超时是指什么超时啊?应该用哪些参数控制的?
[2024/12/02 16:06:34.704 +08:00] [ERROR] [S3RandomAccessFile.cpp:98] ["Cannot read from istream, size=1048592 gcount=589262 state=0x06 cur_offset=0 content_length=1504148 errmsg=Success cost=5215266ns"] [source=s18195849145/data/t_82592/dmf_420808/7.dat] [thread_id=16]
从这句日志可以看出,数据读取到一半发生了错误。此时,从 S3 读取的流的状态为 state=0x06,说明从 S3 读取失败了。这个是 AWS S3 的 API 返回的错误信息,具体是什么错误暂时不清楚。我们后续会改进这一块的逻辑:输出更明确的错误信息;增加自动重试。
1 个赞
这个修复应该后续会放到 7.5 后续的小版本里吧? 刚升级版本到 7.5.4,预计最快也要明年 6 月份才可能再次升级大版本。
这个改动不大,应该会进 7.5 的小版本。
那需要我创建一个 github issue 么?还是你们内部跟踪就好了?
可以帮忙在 github 创建一个 issue。这样后面问题修复了你也可以在 github 收到通知。
此话题已在最后回复的 7 天后被自动关闭。不再允许新回复。