tiflash store offline

【 TiDB 使用环境】生产环境 /测试/ Poc
【 TiDB 版本】7.1
【TiDB Operator 版本】:1.4
【K8s 版本】:1.20
前几天突然发现tidb监控有一个store offline状态:
image
到数据库里查询发现是tiflash store offline:


但是tiflash运行正常,重启tiflash pod也没用:

同时新创建一个tiflash副本,一直没有成功:

帮忙分析下原因,谢谢

kubectl logs advanced-tidb-tiflash-0 errorlog -n tidb-admin看下tiflash的报错日志

可以通过检查这个offline状态的Store所在的TiKV节点的日志,来了解该Store发生了什么问题。另外,可以查看在该Store宕机期间,是否有由于数据分布不均衡导致的热点问题。

日志只有这么两条您帮忙看看,是内存设置的不够吗?
[2023/06/27 00:44:20.992 +08:00] [ERROR] [MPPTask.cpp:469] [“task running meets error: Code: 0, e.displayText() = DB::TiFlashException: Memory limit (total) exceeded caused by ‘out of memory quota for data computing’ : would use 5.60 GiB for data computing (attempt to allocate chunk of 8388608 bytes), limit of memory for data computing: 5.60 GiB, e.what() = DB::TiFlashException, Stack trace:\n\n\n 0x1c116d1\tDB::TiFlashException::TiFlashException(std::__1::basic_string<char, std::__1::char_traits, std::__1::allocator > const&, DB::TiFlashError const&) [tiflash+29431505]\n \tdbms/src/Common/TiFlashException.h:250\n 0x1c10c5e\tMemoryTracker::alloc(long, bool) [tiflash+29428830]\n \tdbms/src/Common/MemoryTracker.cpp:154\n 0x1c108d5\tMemoryTracker::alloc(long, bool) [tiflash+29427925]\n \tdbms/src/Common/MemoryTracker.cpp:165\n 0x1c108d5\tMemoryTracker::alloc(long, bool) [tiflash+29427925]\n \tdbms/src/Common/MemoryTracker.cpp:165\n 0x1c2098b\tAllocator::alloc(unsigned long, unsigned long) [tiflash+29493643]\n \tdbms/src/Common/Allocator.cpp:68\n 0x7db2f7c\tDB::ColumnString::reserve(unsigned long) [tiflash+131805052]\n \tdbms/src/Columns/ColumnString.cpp:285\n 0x7983941\tDB::Aggregator::prepareBlocksAndFillSingleLevel(DB::AggregatedDataVariants&, bool) const [tiflash+127416641]\n \tdbms/src/Interpreters/Aggregator.cpp:1581\n 0x79aad08\tDB::MergingBuckets::getDataForSingleLevel() [tiflash+127577352]\n \tdbms/src/Interpreters/Aggregator.cpp:2262\n 0x791c77c\tDB::MergingAndConvertingBlockInputStream::readImpl() [tiflash+126994300]\n \tdbms/src/DataStreams/MergingAndConvertingBlockInputStream.h:39\n 0x764cff5\tDB::IProfilingBlockInputStream::read(DB::PODArray<unsigned char, 4096ul, Allocator, 15ul, 16ul>&, bool) [tiflash+124047349]\n \tdbms/src/DataStreams/IProfilingBlockInputStream.cpp:75\n 0x764cce5\tDB::IProfilingBlockInputStream::read() [tiflash+124046565]\n \tdbms/src/DataStreams/IProfilingBlockInputStream.cpp:43\n 0x7a49c05\tDB::AggregatingBlockInputStream::readImpl() [tiflash+128228357]\n \tdbms/src/DataStreams/AggregatingBlockInputStream.cpp:79\n 0x764cff5\tDB::IProfilingBlockInputStream::read(DB::PODArray<unsigned char, 4096ul, Allocator, 15ul, 16ul>&, bool) [tiflash+124047349]\n \tdbms/src/DataStreams/IProfilingBlockInputStream.cpp:75\n 0x764cce5\tDB::IProfilingBlockInputStream::read() [tiflash+124046565]\n \tdbms/src/DataStreams/IProfilingBlockInputStream.cpp:43\n 0x7644bfe\tDB::ExpressionBlockInputStream::readImpl() [tiflash+124013566]\n \tdbms/src/DataStreams/ExpressionBlockInputStream.cpp:39\n 0x764cff5\tDB::IProfilingBlockInputStream::read(DB::PODArray<unsigned char, 4096ul, Allocator, 15ul, 16ul>&, bool) [tiflash+124047349]\n \tdbms/src/DataStreams/IProfilingBlockInputStream.cpp:75\n 0x764cce5\tDB::IProfilingBlockInputStream::read() [tiflash+124046565]\n \tdbms/src/DataStreams/IProfilingBlockInputStream.cpp:43\n 0x7644bfe\tDB::ExpressionBlockInputStream::readImpl() [tiflash+124013566]\n \tdbms/src/DataStreams/ExpressionBlockInputStream.cpp:39\n 0x764cff5\tDB::IProfilingBlockInputStream::read(DB::PODArray<unsigned char, 4096ul, Allocator, 15ul, 16ul>&, bool) [tiflash+124047349]\n \tdbms/src/DataStreams/IProfilingBlockInputStream.cpp:75\n 0x764cce5\tDB::IProfilingBlockInputStream::read() [tiflash+124046565]\n \tdbms/src/DataStreams/IProfilingBlockInputStream.cpp:43\n 0x7644bfe\tDB::ExpressionBlockInputStream::readImpl() [tiflash+124013566]\n \tdbms/src/DataStreams/ExpressionBlockInputStream.cpp:39\n 0x764cff5\tDB::IProfilingBlockInputStream::read(DB::PODArray<unsigned char, 4096ul, Allocator, 15ul, 16ul>&, bool) [tiflash+124047349]\n \tdbms/src/DataStreams/IProfilingBlockInputStream.cpp:75\n 0x764cce5\tDB::IProfilingBlockInputStream::read() [tiflash+124046565]\n \tdbms/src/DataStreams/IProfilingBlockInputStream.cpp:43\n 0x7644bfe\tDB::ExpressionBlockInputStream::readImpl() [tiflash+124013566]\n \tdbms/src/DataStreams/ExpressionBlockInputStream.cpp:39\n 0x764cff5\tDB::IProfilingBlockInputStream::read(DB::PODArray<unsigned char, 4096ul, Allocator, 15ul, 16ul>&, bool) [tiflash+124047349]\n \tdbms/src/DataStreams/IProfilingBlockInputStream.cpp:75\n 0x764cce5\tDB::IProfilingBlockInputStream::read() [tiflash+124046565]\n \tdbms/src/DataStreams/IProfilingBlockInputStream.cpp:43\n 0x820a3cb\tDB::ExchangeSenderBlockInputStream::readImpl() [tiflash+136356811]\n \tdbms/src/DataStreams/ExchangeSenderBlockInputStream.cpp:40\n 0x764cff5\tDB::IProfilingBlockInputStream::read(DB::PODArray<unsigned char, 4096ul, Allocator, 15ul, 16ul>*&, bool) [tiflash+124047349]\n \tdbms/src/DataStreams/IProfilingBlockInputStream.cpp:75\n 0x764cce5\tDB::IProfilingBlockInputStream::read() [tiflash+124046565]\n \tdbms/src/DataStreams/IProfilingBlockInputStream.cpp:43\n 0x82b5284\tDB::DataStreamExecutor::execute(DB::ResultHandler&&) [tiflash+137056900]\n \tdbms/src/Flash/Executor/DataStreamExecutor.cpp:44\n 0x8269089\tDB::MPPTask::runImpl() [tiflash+136745097]\n \tdbms/src/Flash/Mpp/MPPTask.cpp:408\n 0x1d06d28\tauto DB::wrapInvocable<std::__1::function<void ()> >(bool, std::__1::function<void ()>&&)::‘lambda’()::operator()() [tiflash+30436648]\n \tdbms/src/Common/wrapInvocable.h:36”] [source=“MPP<query:<query_ts:1687797841423682071, local_query_id:3, server_id:3328292, start_ts:442446077336223751>,task_id:3>”] [thread_id=288]
[2023/06/27 00:44:20.992 +08:00] [WARN] [MPPTaskManager.cpp:155] [“Begin to abort query: <query_ts:1687797841423682071, local_query_id:3, server_id:3328292, start_ts:442446077336223751>, abort type: ONERROR, reason: From MPP<query:<query_ts:1687797841423682071, local_query_id:3, server_id:3328292, start_ts:442446077336223751>,task_id:3>: Code: 0, e.displayText() = DB::TiFlashException: Memory limit (total) exceeded caused by ‘out of memory quota for data computing’ : would use 5.60 GiB for data computing (attempt to allocate chunk of 8388608 bytes), limit of memory for data computing: 5.60 GiB, e.what() = DB::TiFlashException,”] [thread_id=288]
[2023/06/27 00:44:20.992 +08:00] [WARN] [MPPTaskManager.cpp:198] [“Remaining task in query <query_ts:1687797841423682071, local_query_id:3, server_id:3328292, start_ts:442446077336223751> are: MPP<query:<query_ts:1687797841423682071, local_query_id:3, server_id:3328292, start_ts:442446077336223751>,task_id:3> “] [thread_id=288]
[2023/06/27 00:44:20.992 +08:00] [WARN] [MPPTask.cpp:511] [“Begin abort task: MPP<query:<query_ts:1687797841423682071, local_query_id:3, server_id:3328292, start_ts:442446077336223751>,task_id:3>, abort type: ONERROR”] [source=“MPP<query:<query_ts:1687797841423682071, local_query_id:3, server_id:3328292, start_ts:442446077336223751>,task_id:3>”] [thread_id=288]
[2023/06/27 00:44:20.992 +08:00] [WARN] [MPPTask.cpp:540] [“Finish abort task from running”] [source=“MPP<query:<query_ts:1687797841423682071, local_query_id:3, server_id:3328292, start_ts:442446077336223751>,task_id:3>”] [thread_id=288]
[2023/06/27 00:44:20.992 +08:00] [WARN] [MPPTaskManager.cpp:210] [“Finish abort query: <query_ts:1687797841423682071, local_query_id:3, server_id:3328292, start_ts:442446077336223751>”] [thread_id=288]
[2023/06/27 00:44:20.993 +08:00] [WARN] [MPPTaskManager.cpp:155] [“Begin to abort query: <query_ts:1687797841423682071, local_query_id:3, server_id:3328292, start_ts:442446077336223751>, abort type: ONCANCELLATION, reason: Receive cancel request from TiDB”] [thread_id=294]
[2023/06/27 00:44:20.993 +08:00] [WARN] [MPPTaskManager.cpp:165] [”<query_ts:1687797841423682071, local_query_id:3, server_id:3328292, start_ts:442446077336223751> does not found in task manager, skip abort”] [thread_id=294]

我看到是tiflash所在的store offline,是不是应该看tiflash日志?

看着是内存的问题,你tiflash设置最大内存限额了?设了多少?调大点试试

之前设置的7g,我改成16g试了一下,在看error log为空:


但是store 还是offline状态:

再看下tiflash日志呢,kubectl logs tidb-tiflash-0 tiflash -n tidb-admin

tiflash.log (124.2 KB)
麻烦您帮忙看看

这日志看着挺正常啊,现在还是下线状态吗?

curl -X POST http://127.0.0.1:2379/pd/api/v1/store/49317/state?state=Up,手工上线下试试?

按照您说的这个操作,恢复了,非常感谢,之前可能不知道什么时候下线的 :sweat_smile:

此话题已在最后回复的 60 天后被自动关闭。不再允许新回复。