tiflash v7.1.1存算分离报错

TiDBer_Lee · 2023 年9 月 26 日 01:54

【 TiDB 使用环境】生产环境 /测试/ Poc
【 TiDB 版本】v7.1.1
存算分离架构部署，在计算节点的日志中不停的报错：

[2023/09/26 01:48:55.096 +00:00] [ERROR] [DiagnosticsService.cpp:57] 
["TiFlashRaftProxyHelper is null, `DiagnosticsService::server_info` is useless"] 
[source=DiagnosticsService] [thread_id=351]

不知道这个问题大家有没有遇到过

TiDBer_oHSwKxOH · 2023 年9 月 26 日 03:57

把架构贴出来

有猫万事足 · 2023 年9 月 26 日 05:37

github.com

pingcap/tiflash/blob/803b58ed10185d2a663fa7dbd742fbced1095358/dbms/src/Storages/KVStore/FFI/ProxyFFI.h#L101


      
                  : RawRustPtrWrap(inner_)
              {}
          };
          
          class MockSetFFI
          {
              friend struct MockRaftStoreProxy;
              static void MockSetRustGcHelper(void (*)(RawVoidPtr, RawRustPtrType));
          };
          
          struct TiFlashRaftProxyHelper : RaftStoreProxyFFIHelper
          {
              RaftProxyStatus getProxyStatus() const;
              bool checkEncryptionEnabled() const;
              EncryptionMethod getEncryptionMethod() const;
              FileEncryptionInfo getFile(const std::string &) const;
              FileEncryptionInfo newFile(const std::string &) const;
              FileEncryptionInfo deleteFile(const std::string &) const;
              FileEncryptionInfo linkFile(const std::string &, const std::string &) const;
              BatchReadIndexRes batchReadIndex_v1(const std::vector<kvrpcpb::ReadIndexRequest> &, uint64_t) const;
              BatchReadIndexRes batchReadIndex(const std::vector<kvrpcpb::ReadIndexRequest> &, uint64_t) const;

TiFlashRaftProxyHelper继承自RaftStoreProxyFFIHelper。

而RaftStoreProxyFFIHelper的作用是

TiFlash 和 Proxy 会各自将 FFI 函数封装入 Helper 对象中，然后再互相持有对方的 Helper 指针。其中 RaftStoreProxyFFIHelper 是 Proxy 给 TiFlash 调用的句柄，它封装了 RaftStoreProxy 对象。TiFlash 通过该句柄可以进行 ReadIndex、解析 SST、获取 Region 相关信息以及 Encryption 等相关工作。

即，某个tikv给tiflash调用的句柄是空的。对应的tiflash没有办法再通过这个句柄进行ReadIndex、解析 SST、获取 Region 相关信息以及 Encryption 等相关工作。

感觉tiflash的同步会不正常。建议看看是否有其他的日志。

TiDBer_Lee · 2023 年9 月 26 日 07:14

目前只有tiflash计算节点上有报错信息；
表都是重新同步的，状态都是可用；
在PD节点上有告警
[grpclog.go:60] [“transport: http2Server.HandleStreams failed to read frame: read tcp 10.60.71.229:2379->10.60.76.129:47208: read: connection reset by peer”]

TiDBer_小阿飞 · 2023 年9 月 26 日 07:20

这题太难了，我不会！等大神解释和最佳答案

有猫万事足 · 2023 年9 月 26 日 07:37

https://github.com/pingcap/tiflash/blob/v7.1.1/dbms/src/Flash/DiagnosticsService.cpp#L45

在这个报错前面确实判断过，这个节点是不是tiflash计算节点。

tiflash compute node should be managed by AutoScaler instead of PD, this grpc should not be called be AutoScaler for now

大意是计算节点应该由AutoScaler 管理而不是pd管理。能执行到57行才报错，说明前面43行这个判断没有其效果。难道是个bug？

不过看上去，就算是个bug，应该是也是输出级别有点吓人，但大概率没有实际影响。
说穿了，就是判断自身为计算节点失败了，然后输出了不该输出的日志。
等其他大神看看吧。我没招了。

ajin0514 · 2023 年10 月 3 日 01:34

可以用新版本试试看