一台TiKV机器宕机后连接 TiDB特别慢/查询也特别慢

为提高效率,提问时请提供以下信息,问题描述清晰可优先响应。

  • 【TiDB 版本】: 3.0.1
  • 【问题描述】: 现在 7 台 TiKV,14 个 TiKV 实例 TiDB 8个,PD 3 个

其中一台 TiKV机器挂掉后,连接TiDB 要 hang 10 几秒,查询 1000 多条记录的表都没有返回结果

ps. 之前也挂过其他机器,没有遇到此问题。

若提问为性能优化、故障排查类问题,请下载脚本运行。终端输出的打印结果,请务必全选并复制粘贴上传。

请问这边问题的 TiDB 有错误日志输出吗?

[2020/04/09 15:16:00.830 +08:00] [ERROR] [client.go:197] ["batchRecvLoop error when receive"] [target=down_tikv_ip:20164] [error="rpc error: code = Unavailable desc = transport is closing"] [stack="github.com/pingcap/tidb/store/tikv.(*batchCommandsClient).batchRecvLoop\
\t/home/jenkins/workspace/release_tidb_3.0/go/src/github.com/pingcap/tidb/store/tikv/client.go:197"]
[2020/04/09 15:16:01.854 +08:00] [ERROR] [client.go:169] ["batchRecvLoop re-create streaming fail"] [target=down_tikv_ip:20164] [error="rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: connection error: desc = \"transport: Error while dialing dial tcp down_tikv_ip:20164: i/o timeout\""] [stack="github.com/pingcap/tidb/store/tikv.(*batchCommandsClient).reCreateStreamingClient\
\t/home/jenkins/workspace/release_tidb_3.0/go/src/github.com/pingcap/tidb/store/tikv/client.go:169\
github.com/pingcap/tidb/store/tikv.(*batchCommandsClient).batchRecvLoop\
\t/home/jenkins/workspace/release_tidb_3.0/go/src/github.com/pingcap/tidb/store/tikv/client.go:203"]
[2020/04/09 15:16:02.854 +08:00] [ERROR] [client.go:197] ["batchRecvLoop error when receive"] [target=down_tikv_ip:20164] [error="rpc error: code = Unavailable desc = transport is closing"] [stack="github.com/pingcap/tidb/store/tikv.(*batchCommandsClient).batchRecvLoop\
\t/home/jenkins/workspace/release_tidb_3.0/go/src/github.com/pingcap/tidb/store/tikv/client.go:197"]
[2020/04/09 15:16:05.346 +08:00] [ERROR] [client.go:169] ["batchRecvLoop re-create streaming fail"] [target=down_tikv_ip:20164] [error="rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: connection error: desc = \"transport: Error while dialing dial tcp down_tikv_ip:20164: i/o timeout\""] [stack="github.com/pingcap/tidb/store/tikv.(*batchCommandsClient).reCreateStreamingClient\
\t/home/jenkins/workspace/release_tidb_3.0/go/src/github.com/pingcap/tidb/store/tikv/client.go:169\
github.com/pingcap/tidb/store/tikv.(*batchCommandsClient).batchRecvLoop\
\t/home/jenkins/workspace/release_tidb_3.0/go/src/github.com/pingcap/tidb/store/tikv/client.go:203"]
[2020/04/09 15:16:06.346 +08:00] [ERROR] [client.go:197] ["batchRecvLoop error when receive"] [target=down_tikv_ip:20164] [error="rpc error: code = Unavailable desc = transport is closing"] [stack="github.com/pingcap/tidb/store/tikv.(*batchCommandsClient).batchRecvLoop\
\t/home/jenkins/workspace/release_tidb_3.0/go/src/github.com/pingcap/tidb/store/tikv/client.go:197"]
[2020/04/09 15:16:07.790 +08:00] [ERROR] [client.go:169] ["batchRecvLoop re-create streaming fail"] [target=down_tikv_ip:20163] [error="rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: connection error: desc = \"transport: Error while dialing dial tcp down_tikv_ip:20163: i/o timeout\""] [stack="github.com/pingcap/tidb/store/tikv.(*batchCommandsClient).reCreateStreamingClient\
\t/home/jenkins/workspace/release_tidb_3.0/go/src/github.com/pingcap/tidb/store/tikv/client.go:169\
github.com/pingcap/tidb/store/tikv.(*batchCommandsClient).batchRecvLoop\
\t/home/jenkins/workspace/release_tidb_3.0/go/src/github.com/pingcap/tidb/store/tikv/client.go:203"]
[2020/04/09 15:16:07.798 +08:00] [ERROR] [client.go:169] ["batchRecvLoop re-create streaming fail"] [target=down_tikv_ip:20163] [error="rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: connection error: desc = \"transport: Error while dialing dial tcp down_tikv_ip:20163: i/o timeout\""] [stack="github.com/pingcap/tidb/store/tikv.(*batchCommandsClient).reCreateStreamingClient\
\t/home/jenkins/workspace/release_tidb_3.0/go/src/github.com/pingcap/tidb/store/tikv/client.go:169\
github.com/pingcap/tidb/store/tikv.(*batchCommandsClient).batchRecvLoop\
\t/home/jenkins/workspace/release_tidb_3.0/go/src/github.com/pingcap/tidb/store/tikv/client.go:203"]
[2020/04/09 15:16:08.033 +08:00] [ERROR] [client.go:169] ["batchRecvLoop re-create streaming fail"] [target=down_tikv_ip:20164] [error="rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: connection error: desc = \"transport: Error while dialing dial tcp down_tikv_ip:20164: i/o timeout\""] [stack="github.com/pingcap/tidb/store/tikv.(*batchCommandsClient).reCreateStreamingClient\
\t/home/jenkins/workspace/release_tidb_3.0/go/src/github.com/pingcap/tidb/store/tikv/client.go:169\
github.com/pingcap/tidb/store/tikv.(*batchCommandsClient).batchRecvLoop\
\t/home/jenkins/workspace/release_tidb_3.0/go/src/github.com/pingcap/tidb/store/tikv/client.go:203"]
[2020/04/09 15:16:08.790 +08:00] [ERROR] [client.go:197] ["batchRecvLoop error when receive"] [target=down_tikv_ip:20163] [error="rpc error: code = Unavailable desc = transport is closing"] [stack="github.com/pingcap/tidb/store/tikv.(*batchCommandsClient).batchRecvLoop\
\t/home/jenkins/workspace/release_tidb_3.0/go/src/github.com/pingcap/tidb/store/tikv/client.go:197"]
[2020/04/09 15:16:08.798 +08:00] [ERROR] [client.go:197] ["batchRecvLoop error when receive"] [target=down_tikv_ip:20163] [error="rpc error: code = Unavailable desc = transport is closing"] [stack="github.com/pingcap/tidb/store/tikv.(*batchCommandsClient).batchRecvLoop\
\t/home/jenkins/workspace/release_tidb_3.0/go/src/github.com/pingcap/tidb/store/tikv/client.go:197"]
[2020/04/09 15:16:08.829 +08:00] [ERROR] [client.go:169] ["batchRecvLoop re-create streaming fail"] [target=down_tikv_ip:20163] [error="rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: connection error: desc = \"transport: Error while dialing dial tcp down_tikv_ip:20163: i/o timeout\""] [stack="github.com/pingcap/tidb/store/tikv.(*batchCommandsClient).reCreateStreamingClient\
\t/home/jenkins/workspace/release_tidb_3.0/go/src/github.com/pingcap/tidb/store/tikv/client.go:169\
github.com/pingcap/tidb/store/tikv.(*batchCommandsClient).batchRecvLoop\
\t/home/jenkins/workspace/release_tidb_3.0/go/src/github.com/pingcap/tidb/store/tikv/client.go:203"]
[2020/04/09 15:16:08.829 +08:00] [ERROR] [client.go:169] ["batchRecvLoop re-create streaming fail"] [target=down_tikv_ip:20163] [error="rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: connection error: desc = \"transport: Error while dialing dial tcp down_tikv_ip:20163: i/o timeout\""] [stack="github.com/pingcap/tidb/store/tikv.(*batchCommandsClient).reCreateStreamingClient\
\t/home/jenkins/workspace/release_tidb_3.0/go/src/github.com/pingcap/tidb/store/tikv/client.go:169\
github.com/pingcap/tidb/store/tikv.(*batchCommandsClient).batchRecvLoop\
\t/home/jenkins/workspace/release_tidb_3.0/go/src/github.com/pingcap/tidb/store/tikv/client.go:203"]
[2020/04/09 15:16:08.940 +08:00] [ERROR] [client.go:169] ["batchRecvLoop re-create streaming fail"] [target=down_tikv_ip:20164] [error="rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: connection error: desc = \"transport: Error while dialing dial tcp down_tikv_ip:20164: i/o timeout\""] [stack="github.com/pingcap/tidb/store/tikv.(*batchCommandsClient).reCreateStreamingClient\
\t/home/jenkins/workspace/release_tidb_3.0/go/src/github.com/pingcap/tidb/store/tikv/client.go:169\
github.com/pingcap/tidb/store/tikv.(*batchCommandsClient).batchRecvLoop\
\t/home/jenkins/workspace/release_tidb_3.0/go/src/github.com/pingcap/tidb/store/tikv/client.go:203"]
[2020/04/09 15:16:09.034 +08:00] [ERROR] [client.go:197] ["batchRecvLoop error when receive"] [target=down_tikv_ip:20164] [error="rpc error: code = Unavailable desc = transport is closing"] [stack="github.com/pingcap/tidb/store/tikv.(*batchCommandsClient).batchRecvLoop\
\t/home/jenkins/workspace/release_tidb_3.0/go/src/github.com/pingcap/tidb/store/tikv/client.go:197"]
[2020/04/09 15:16:09.804 +08:00] [ERROR] [client.go:169] ["batchRecvLoop re-create streaming fail"] [target=down_tikv_ip:20163] [error="rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: connection error: desc = \"transport: Error while dialing dial tcp down_tikv_ip:20163: i/o timeout\""] [stack="github.com/pingcap/tidb/store/tikv.(*batchCommandsClient).reCreateStreamingClient\
\t/home/jenkins/workspace/release_tidb_3.0/go/src/github.com/pingcap/tidb/store/tikv/client.go:169\
github.com/pingcap/tidb/store/tikv.(*batchCommandsClient).batchRecvLoop\
\t/home/jenkins/workspace/release_tidb_3.0/go/src/github.com/pingcap/tidb/store/tikv/client.go:203"]
[2020/04/09 15:16:09.811 +08:00] [ERROR] [client.go:169] ["batchRecvLoop re-create streaming fail"] [target=down_tikv_ip:20164] [error="rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: connection error: desc = \"transport: Error while dialing dial tcp down_tikv_ip:20164: i/o timeout\""] [stack="github.com/pingcap/tidb/store/tikv.(*batchCommandsClient).reCreateStreamingClient\
\t/home/jenkins/workspace/release_tidb_3.0/go/src/github.com/pingcap/tidb/store/tikv/client.go:169\
github.com/pingcap/tidb/store/tikv.(*batchCommandsClient).batchRecvLoop\
\t/home/jenkins/workspace/release_tidb_3.0/go/src/github.com/pingcap/tidb/store/tikv/client.go:203"]
[2020/04/09 15:16:09.829 +08:00] [ERROR] [client.go:197] ["batchRecvLoop error when receive"] [target=down_tikv_ip:20163] [error="rpc error: code = Unavailable desc = transport is closing"] [stack="github.com/pingcap/tidb/store/tikv.(*batchCommandsClient).batchRecvLoop\
\t/home/jenkins/workspace/release_tidb_3.0/go/src/github.com/pingcap/tidb/store/tikv/client.go:197"]
[2020/04/09 15:16:09.829 +08:00] [ERROR] [client.go:197] ["batchRecvLoop error when receive"] [target=down_tikv_ip:20163] [error="rpc error: code = Unavailable desc = transport is closing"] [stack="github.com/pingcap/tidb/store/tikv.(*batchCommandsClient).batchRecvLoop\
\t/home/jenkins/workspace/release_tidb_3.0/go/src/github.com/pingcap/tidb/store/tikv/client.go:197"]
[2020/04/09 15:16:09.940 +08:00] [ERROR] [client.go:197] ["batchRecvLoop error when receive"] [target=down_tikv_ip:20164] [error="rpc error: code = Unavailable desc = transport is closing"] [stack="github.com/pingcap/tidb/store/tikv.(*batchCommandsClient).batchRecvLoop\
\t/home/jenkins/workspace/release_tidb_3.0/go/src/github.com/pingcap/tidb/store/tikv/client.go:197"]
[2020/04/09 15:16:10.804 +08:00] [ERROR] [client.go:197] ["batchRecvLoop error when receive"] [target=down_tikv_ip:20163] [error="rpc error: code = Unavailable desc = transport is closing"] [stack="github.com/pingcap/tidb/store/tikv.(*batchCommandsClient).batchRecvLoop\
\t/home/jenkins/workspace/release_tidb_3.0/go/src/github.com/pingcap/tidb/store/tikv/client.go:197"]
[2020/04/09 15:16:10.812 +08:00] [ERROR] [client.go:197] ["batchRecvLoop error when receive"] [target=down_tikv_ip:20164] [error="rpc error: code = Unavailable desc = transport is closing"] [stack="github.com/pingcap/tidb/store/tikv.(*batchCommandsClient).batchRecvLoop\
\t/home/jenkins/workspace/release_tidb_3.0/go/src/github.com/pingcap/tidb/store/tikv/client.go:197"]
[2020/04/09 15:16:13.806 +08:00] [ERROR] [client.go:169] ["batchRecvLoop re-create streaming fail"] [target=down_tikv_ip:20163] [error="rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: connection error: desc = \"transport: Error while dialing dial tcp down_tikv_ip:20163: i/o timeout\""] [stack="github.com/pingcap/tidb/store/tikv.(*batchCommandsClient).reCreateStreamingClient\
\t/home/jenkins/workspace/release_tidb_3.0/go/src/github.com/pingcap/tidb/store/tikv/client.go:169\
github.com/pingcap/tidb/store/tikv.(*batchCommandsClient).batchRecvLoop\
\t/home/jenkins/workspace/release_tidb_3.0/go/src/github.com/pingcap/tidb/store/tikv/client.go:203

TIDB 3.0.2以及以后的版本修复了一系列 TiDB 与 TiKV 之间网络连接出现异常的 bug,包括但不限于:

  1. region cache 相关:https://github.com/pingcap/tidb/pull/11344
  2. tikvclient 相关:https://github.com/pingcap/tidb/pull/11531
  3. https://github.com/pingcap/tidb/pull/11370

建议可以的话升级到 3.0 的最新 release 版本再进行观察。

好的,升级后再观察下。

好的,如有问题可开新帖继续提问哦,感谢回复。