重启TiKV集群,已连接的客户端client-go无法恢复

【TiDB 使用环境】生产环境 /测试/ Poc
测试
【TiDB 版本】
裸tikv +pd 集群,v8.1.0
【操作系统】

【部署方式】云上部署(什么云)/机器部署(什么机器配置、什么硬盘)
高配物理机:SSD
【集群数据量】
900亿
【集群节点数】
7
【问题复现路径】做过哪些操作出现的问题
使用官方client-go访问tikv集群,然后中途将tikv 整个集群通过tiup reload cluster,重启完成后已连接的客户端一直无法恢复持续报错
报loadstore 失败
【遇到的问题:问题现象及影响】
客户正常访问一个tikv 集群,然后重启了整个tikv集群,重启完成后,客户端无法恢复持续报错
【资源配置】进入到 TiDB Dashboard -集群信息 (Cluster Info) -主机(Hosts) 截图此页面
【复制黏贴 ERROR 报错的日志】
[2025/07/09 18:16:59.708 +08:00] [ERROR] [store_cache.go:440] [“loadStore from PD failed”] [id=6] [error=“rpc error: code = Unavailable desc = not leader”] [errorVerbose=“rp
c error: code = Unavailable desc = not leader\ngithub.com/tikv/pd/client.(*client).respForErr\n\t/home/linzongbin/go/pkg/mod/github.com/tikv/pd/client@v0.0.0-20240320081713-
c00c42e77b31/client.go:1596\ngithub.com/tikv/pd/client.(*client).GetStore\n\t/home/linzongbin/go/pkg/mod/github.com/tikv/pd/client@v0.0.0-20240320081713-c00c42e77b31/client.
go:1163\ngithub.com/tikv/client-go/v2/internal/locate.(*storeCacheImpl).fetchStore\n\t/home/linzongbin/go/pkg/mod/github.com/tikv/client-go/v2@v2.0.8-0.20240913090512-3777c3
84feb1/internal/locate/store_cache.go:116\ngithub.com/tikv/client-go/v2/internal/locate.(*Store).reResolve\n\t/home/linzongbin/go/pkg/mod/github.com/tikv/client-go/v2@v2.0.8
-0.20240913090512-3777c384feb1/internal/locate/store_cache.go:430\ngithub.com/tikv/client-go/v2/internal/locate.(*RegionCache).checkAndResolve\n\t/home/linzongbin/go/pkg/mod
/github.com/tikv/client-go/v2@v2.0.8-0.20240913090512-3777c384feb1/internal/locate/region_cache.go:762\ngithub.com/tikv/client-go/v2/internal/locate.NewRegionCache.func1\n\t
/home/linzongbin/go/pkg/mod/github.com/tikv/client-go/v2@v2.0.8-0.20240913090512-3777c384feb1/internal/locate/region_cache.go:703\ngithub.com/tikv/client-go/v2/internal/loca
te.(*bgRunner).scheduleWithTrigger.func1\n\t/home/linzongbin/go/pkg/mod/github.com/tikv/client-go/v2@v2.0.8-0.20240913090512-3777c384feb1/internal/locate/region_cache.go:602
\nruntime.goexit\n\t/opt/go1.23.0/src/runtime/asm_amd64.s:1700”] [stack=“github.com/tikv/client-go/v2/internal/locate.(*Store).reResolve\n\t/home/linzongbin/go/pkg/mod/githu
b.com/tikv/client-go/v2@v2.0.8-0.20240913090512-3777c384feb1/internal/locate/store_cache.go:440\ngithub.com/tikv/client-go/v2/internal/locate.(*RegionCache).checkAndResolve
n\t/home/linzongbin/go/pkg/mod/github.com/tikv/client-go/v2@v2.0.8-0.20240913090512-3777c384feb1/internal/locate/region_cache.go:762\ngithub.com/tikv/client-go/v2/internal/l
ocate.NewRegionCache.func1\n\t/home/linzongbin/go/pkg/mod/github.com/tikv/client-go/v2@v2.0.8-0.20240913090512-3777c384feb1/internal/locate/region_cache.go:703\ngithub.com/t
ikv/client-go/v2/internal/locate.(*bgRunner).scheduleWithTrigger.func1\n\t/home/linzongbin/go/pkg/mod/github.com/tikv/client-go/v2@v2.0.8-0.20240913090512-3777c384feb1/inter
nal/locate/region_cache.go:602”]
[2025/07/09 18:16:59.708 +08:00] [ERROR] [error.go:339] [“encountered error”] [error=“rpc error: code = Unavailable desc = not leader”] [errorVerbose=“rpc error: code = Unav
ailable desc = not leader\ngithub.com/tikv/pd/client.(*client).respForErr\n\t/home/linzongbin/go/pkg/mod/github.com/tikv/pd/client@v0.0.0-20240320081713-c00c42e77b31/client.
go:1596\ngithub.com/tikv/pd/client.(*client).GetStore\n\t/home/linzongbin/go/pkg/mod/github.com/tikv/pd/client@v0.0.0-20240320081713-c00c42e77b31/client.go:1163\ngithub.com/
tikv/client-go/v2/internal/locate.(*storeCacheImpl).fetchStore\n\t/home/linzongbin/go/pkg/mod/github.com/tikv/client-go/v2@v2.0.8-0.20240913090512-3777c384feb1/internal/loca
te/store_cache.go:116\ngithub.com/tikv/client-go/v2/internal/locate.(*Store).reResolve\n\t/home/linzongbin/go/pkg/mod/github.com/tikv/client-go/v2@v2.0.8-0.20240913090512-37
77c384feb1/internal/locate/store_cache.go:430\ngithub.com/tikv/client-go/v2/internal/locate.(*RegionCache).checkAndResolve\n\t/home/linzongbin/go/pkg/mod/github.com/tikv/cli
ent-go/v2@v2.0.8-0.20240913090512-3777c384feb1/internal/locate/region_cache.go:762\ngithub.com/tikv/client-go/v2/internal/locate.NewRegionCache.func1\n\t/home/linzongbin/go/
pkg/mod/github.com/tikv/client-go/v2@v2.0.8-0.20240913090512-3777c384feb1/internal/locate/region_cache.go:703\ngithub.com/tikv/client-go/v2/internal/locate.(*bgRunner).sched
uleWithTrigger.func1\n\t/home/linzongbin/go/pkg/mod/github.com/tikv/client-go/v2@v2.0.8-0.20240913090512-3777c384feb1/internal/locate/region_cache.go:602\nruntime.goexit\n\t
/opt/go1.23.0/src/runtime/asm_amd64.s:1700”] [stack=“github.com/tikv/client-go/v2/error.Log\n\t/home/linzongbin/go/pkg/mod/github.com/tikv/client-go/v2@v2.0.8-0.202409130905
12-3777c384feb1/error/error.go:339\ngithub.com/tikv/client-go/v2/internal/locate.(*RegionCache).checkAndResolve\n\t/home/linzongbin/go/pkg/mod/github.com/tikv/client-go/v2@v
2.0.8-0.20240913090512-3777c384feb1/internal/locate/region_cache.go:763\ngithub.com/tikv/client-go/v2/internal/locate.NewRegionCache.func1\n\t/home/linzongbin/go/pkg/mod/git
hub.com/tikv/client-go/v2@v2.0.8-0.20240913090512-3777c384feb1/internal/locate/region_cache.go:703\ngithub.com/tikv/client-go/v2/internal/locate.(*bgRunner).scheduleWithTrig
ger.func1\n\t/home/linzongbin/go/pkg/mod/github.com/tikv/client-go/v2@v2.0.8-0.20240913090512-3777c384feb1/internal/locate/region_cache.go:602”] [stack=“github.com/tikv/clie
nt-go/v2/error.Log\n\t/home/linzongbin/go/pkg/mod/github.com/tikv/client-go/v2@v2.0.8-0.20240913090512-3777c384feb1/error/error.go:339\ngithub.com/tikv/client-go/v2/internal
/locate.(*RegionCache).checkAndResolve\n\t/home/linzongbin/go/pkg/mod/github.com/tikv/client-go/v2@v2.0.8-0.20240913090512-3777c384feb1/internal/locate/region_cache.go:763\n
github.com/tikv/client-go/v2/internal/locate.NewRegionCache.func1\n\t/home/linzongbin/go/pkg/mod/github.com/tikv/client-go/v2@v2.0.8-0.20240913090512-3777c384feb1/internal/l
ocate/region_cache.go:703\ngithub.com/tikv/client-go/v2/internal/locate.(*bgRunner).scheduleWithTrigger.func1\n\t/home/linzongbin/go/pkg/mod/github.com/tikv/client-go/v2@v2.
0.8-0.20240913090512-3777c384feb1/internal/locate/region_cache.go:602”]
【其他附件:截图/日志/监控】

rawkv 社区用户用的比较少~

建议正常使用 TiDB 集群

  1. 检查 PD 集群状态
  • 使用tiup pdctl命令查看 PD 集群 Leader 是否正常选举
  • 示例命令:tiup pdctl -u http://pd-ip:2379 config show
  1. 刷新客户端连接
  • 重启客户端应用,重新建立与 PD 的连接
  • 若使用连接池,需确保连接池配置了合理的重试机制
  1. 检查网络连通性
  • 确认客户端与 PD 节点之间的网络连通性
  • 检查防火墙规则是否阻止了 PD 通信端口(默认 2379)
  1. 监控 PD 集群负载
  • 通过 TiDB Dashboard 查看 PD 节点的 CPU、内存使用情况
  • 若负载过高,可考虑增加 PD 节点或调整调度参数