TiDB Bug List=

[Critical bug] TiFlash 开启 async grpc server 时会随机 crash

版本

6.1.0 6.1.1 6.1.2 6.1.3 6.1.4 6.1.5 6.1.6

Issue

https://github.com/pingcap/tiflash/issues/7325

Root Cause

Async grpc server 的实现中存在 data race

Diagnostic Steps

TiFlash 在开启 async grpc server 时遇到随机 crash,其 crash 的 stack 中可以看到 EstablishCallData 相关的信息。例如
[2024/01/19 19:06:28.901 +08:00] [ERROR] [BaseDaemon.cpp:570] [“BaseDaemon:\n 0x1bf3ca4\tfaultSignalHandler(int, siginfo_t*, void*) [tiflash+29310116]\n \tlibs/libdaemon/src/BaseDaemon.cpp:221\n 0xfffc96c207c0\t [linux-vdso.so.1+1984]\n 0x648fc54\tDB::EstablishCallData::proceed() [tiflash+105446484]\n \tdbms/src/Flash/EstablishCall.cpp:151\n 0x1ab2280\tDB::handleRpcs(grpc_impl::ServerCompletionQueue*, Poco::Logger*) [tiflash+27992704]\n \tdbms/src/Server/Server.cpp:546\n 0x1afe6b0\tauto DB::wrapInvocable<std::__1::function<void ()> >(bool, std::__1::function<void ()>&&)::‘lambda’()::operator()() [tiflash+28305072]\n \tdbms/src/Common/wrapInvocable.h:36\n 0x1afe828\tstd::__1::packaged_task<void ()>::operator()() [tiflash+28305448]\n \t/usr/local/bin/…/include/c++/v1/future:2089\n 0x1af0ca8\tDB::DynamicThreadPool::executeTask(std::__1::unique_ptr<DB::IExecutableTask, std::__1::default_deleteDB::IExecutableTask >&) [tiflash+28249256]\n \tdbms/src/Common/DynamicThreadPool.cpp:101\n 0x1af0684\tDB::DynamicThreadPool::fixedWork(unsigned long) [tiflash+28247684]\n \tdbms/src/Common/DynamicThreadPool.cpp:115\n 0x1af1804\tauto std::__1::thread DB::ThreadFactory::newThread<void (DB::DynamicThreadPool::)(unsigned long), DB::DynamicThreadPool, unsigned long&>(bool, std::__1::basic_string<char, std::__1::char_traits, std::__1::allocator >, void (DB::DynamicThreadPool::&&)(unsigned long), DB::DynamicThreadPool&&, unsigned long&)::‘lambda’(auto&&…)::operator()<DB::DynamicThreadPool*, unsigned long>(auto&&…) const [tiflash+28252164]\n \tdbms/src/Common/ThreadFactory.h:47\n 0x1af1628\tvoid* std::__1::__thread_proxy<std::__1::tuple<std::__1::unique_ptr<std::__1::__thread_struct, std::__1::default_deletestd::__1::__thread_struct >, std::__1::thread DB::ThreadFactory::newThread<void (DB::DynamicThreadPool::)(unsigned long), DB::DynamicThreadPool, unsigned long&>(bool, std::__1::basic_string<char, std::__1::char_traits, std::__1::allocator >, void (DB::DynamicThreadPool::&&)(unsigned long), DB::DynamicThreadPool&&, unsigned long&)::‘lambda’(auto&&…), DB::DynamicThreadPool*, unsigned long> >(void*) [tiflash+28251688]\n \t/usr/local/bin/…/include/c++/v1/thread:291\n 0xfffc939a88cc\t [libpthread.so.0+35020]”] [thread_id=3266]

Resolution

升级到 v6.1.7 或以上版本

Workaround

关闭 TiFlash async grpc server
对于 TiUP 部署的集群,可以用
tiup cluster edit-config cluster_name
命令来修改集群配置,在 server_configs 的 tiflash 下面加入
profiles.default.enable_async_server: false
之后用
tiup cluster reload cluster_name -R tiflash
来 reload 配置并重启 TiFlash 节点即可
对于非 TiUP 部署的集群,可以直接编辑 tiflash.toml 文件,在
[profiles]
[profiles.default]
下面增加
enable_async_server = false
之后重启对应 TiFlash server