导入数据遇到 switch region leader to specific leader due to kv return NotLeader

为提高效率,请提供以下信息,问题描述清晰能够更快得到解决:

[TiDB 版本]
tidb 4.0

[问题描述]
导入数据2000/s, 系统不断报警

[Warning][prod][tidb] - tidb_tikvclient_backoff_seconds_count
cluster:xxxxx instance: xxxxx:10080, values:31.794871794871792

去tidb的机器上查看log

2021/01/12 16:20:47.851 +08:00] [INFO] [region_cache.go:619] [“switch region peer to next due to send request fail”] [conn=2366722] [current=“region ID: 2, meta: id:2 start_key:“t\200\000\000\000\000\000\000=” region_epoch:<conf_ver:5 version:29 > peers:<id:3 store_id:1 > peers:<id:66 store_id:4 > peers:<id:73 store_id:5 > , peer: id:73 store_id:5 , addr: 10.89.20.7:30160, idx: 2, reqStoreType: TiKvOnly, runStoreType: tikv”] [needReload=false] [error=“rpc error: code = Unavailable desc = transport is closing”]
[2021/01/12 16:20:47.913 +08:00] [INFO] [region_cache.go:839] [“switch region leader to specific leader due to kv return NotLeader”] [regionID=2] [currIdx=0] [leaderStoreID=5]
[2021/01/12 16:20:48.323 +08:00] [INFO] [region_cache.go:619] [“switch region peer to next due to send request fail”] [current=“region ID: 18, meta: id:18 start_key:“t\200\000\000\000\000\000\000\017” end_key:“t\200\000\000\000\000\000\000\021” region_epoch:<conf_ver:5 version:8 > peers:<id:19 store_id:1 > peers:<id:52 store_id:4 > peers:<id:75 store_id:5 > , peer: id:75 store_id:5 , addr: 10.89.20.7:30160, idx: 2, reqStoreType: TiKvOnly, runStoreType: tikv”] [needReload=false] [error=“rpc error: code = Unavailable desc = transport is closing”]
[2021/01/12 16:20:48.391 +08:00] [INFO] [region_cache.go:839] [“switch region leader to specific leader due to kv return NotLeader”] [regionID=18] [currIdx=0] [leaderStoreID=5]
[2021/01/12 16:20:51.553 +08:00] [INFO] [region_cache.go:619] [“switch region peer to next due to send request fail”] [conn=2366722] [current=“region ID: 2, meta: id:2 start_key:“t\200\000\000\000\000\000\000=” region_epoch:<conf_ver:5 version:29 > peers:<id:3 store_id:1 > peers:<id:66 store_id:4 > peers:<id:73 store_id:5 > , peer: id:73 store_id:5 , addr: 10.89.20.7:30160, idx: 2, reqStoreType: TiKvOnly, runStoreType: tikv”] [needReload=false] [error=“rpc error: code = Unavailable desc = transport is closing”]
[2021/01/12 16:20:51.553 +08:00] [INFO] [region_cache.go:619] [“switch region peer to next due to send request fail”] [current=“region ID: 18, meta: id:18 start_key:“t\200\000\000\000\000\000\000\017” end_key:“t\200\000\000\000\000\000\000\021” region_epoch:<conf_ver:5 version:8 > peers:<id:19 store_id:1 > peers:<id:52 store_id:4 > peers:<id:75 store_id:5 > , peer: id:75 store_id:5 , addr: 10.89.20.7:30160, idx: 2, reqStoreType: TiKvOnly, runStoreType: tikv”] [needReload=false] [error=“rpc error: code = Unavailable desc = transport is closing”]
[2021/01/12 16:20:51.668 +08:00] [INFO] [region_cache.go:839] [“switch region leader to specific leader due to kv return NotLeader”] [regionID=2] [currIdx=0] [leaderStoreID=5]
[2021/01/12 16:20:51.740 +08:00]

目前业务没有收到影响,但是 监控持续反馈 tidb联系不上tikv ,从报错上看 分配 或者访问region 遇到了问题。

若提问为性能优化、故障排查类问题,请下载脚本运行。终端输出的打印结果,请务必全选并复制粘贴上传。

1 个赞

参考 这个帖子 TiDB频繁报警`tidb_tikvclient_backoff_seconds_count`
如果 regionmiss 出现的频次不高可以忽略,出现这个报警是因为 balance 等操作 region 被调度走,属于正常操作,会进行重试。如果集群没有扩缩容等操作且 regionMiss 严重,需要在详细看下。

1、建议修改该阈值避免频繁告警,目前该阈值有点低(官网给的阈值建议根据自己实际情况调整),另外可以搜一下 backoff 相关监控指标,官网给出的阈值,不会这么低
2、频繁写入,会触发热点 region的 调度、region的 分裂、leader 的调度等,属于正常情况
3、目前只是告警,如果没有延迟、或 tidb-server 未有异常,建议只是修改告警阈值即可

1 个赞

:+1::+1::+1:

此话题已在最后回复的 1 分钟后被自动关闭。不再允许新回复。