PD在进行coordinator is stopping时耗时过长,超过24小时,导致QPS跌零

【 TiDB 使用环境`】生产环境
【 TiDB 版本】v5.3.0
【遇到的问题】pd leader lease过期,执行coordinator is stopping这一步骤过长,经过24hour才执行到coordinator is stopped,之后集群恢复正常
【复现路径】
集群规模: tikv-server 2000实例; region数量20w; pd 节点3个;
【问题现象及影响】
lease 过期之后遇到此现象;
此时cpu idle 跌到30%(80%->30%), 内存、磁盘io 无明显变化;

日志:
[kvstore@ip-xxx log]$ grep --color coordinator pd.log
[2022/06/19 01:31:04.686 +08:00] [INFO] [cluster.go:372] [“coordinator is stopping”]
[2022/06/19 01:31:04.686 +08:00] [INFO] [coordinator.go:285] [“drive push operator has been stopped”]
[2022/06/19 01:31:04.686 +08:00] [INFO] [coordinator.go:220] [“check suspect key ranges has been stopped”]
[2022/06/19 01:39:51.198 +08:00] [INFO] [coordinator.go:796] [“scheduler has been stopped”] [scheduler-name=balance-leader-scheduler] [error=“context canceled”]
[2022/06/19 02:41:52.496 +08:00] [INFO] [coordinator.go:110] [“patrol regions has been stopped”]
[2022/06/19 16:17:36.003 +08:00] [INFO] [coordinator.go:796] [“scheduler has been stopped”] [scheduler-name=balance-hot-region-scheduler] [error=“context canceled”]
[2022/06/20 08:42:31.097 +08:00] [INFO] [coordinator.go:796] [“scheduler has been stopped”] [scheduler-name=balance-region-scheduler] [error=“context canceled”]
[2022/06/20 08:42:31.097 +08:00] [INFO] [cluster.go:368] [“coordinator has been stopped”]
[2022/06/20 08:42:33.496 +08:00] [INFO] [coordinator.go:296] [“coordinator starts to collect cluster information”]
[2022/06/20 08:47:33.497 +08:00] [INFO] [coordinator.go:299] [“coordinator has finished cluster information preparation”]
[2022/06/20 08:47:33.497 +08:00] [INFO] [coordinator.go:309] [“coordinator starts to run schedulers”]
[2022/06/20 08:47:33.498 +08:00] [INFO] [coordinator.go:357] [“create scheduler with independent configuration”] [scheduler-name=balance-hot-region-scheduler]
[2022/06/20 08:47:33.499 +08:00] [INFO] [coordinator.go:357] [“create scheduler with independent configuration”] [scheduler-name=balance-leader-scheduler]
[2022/06/20 08:47:33.500 +08:00] [INFO] [coordinator.go:357] [“create scheduler with independent configuration”] [scheduler-name=balance-region-scheduler]
[2022/06/20 08:47:33.500 +08:00] [INFO] [coordinator.go:379] [“create scheduler”] [scheduler-name=balance-region-scheduler] [scheduler-args="[]"]
[2022/06/20 08:47:33.501 +08:00] [INFO] [coordinator.go:379] [“create scheduler”] [scheduler-name=balance-leader-scheduler] [scheduler-args="[]"]
[2022/06/20 08:47:33.502 +08:00] [INFO] [coordinator.go:379] [“create scheduler”] [scheduler-name=balance-hot-region-scheduler] [scheduler-args="[]"]
[2022/06/20 08:47:33.505 +08:00] [INFO] [coordinator.go:279] [“coordinator begins to actively drive push operator”]
[2022/06/20 08:47:33.505 +08:00] [INFO] [coordinator.go:102] [“coordinator starts patrol regions”]
[2022/06/20 08:47:33.505 +08:00] [INFO] [coordinator.go:214] [“coordinator begins to check suspect key ranges”]

更新:
网络正常,集群region总数是20w。
集群布署在hdd磁盘,所以pd lease容易过期,将lease调整为5s后,pd就不会频繁切leader。

反馈的 问题其实主要是 coordinator stop 花费太长时间。

经分析有两方面的原因:

  1. 故障期间,pd实例消耗cpu过高(50 core cpu)导致scheduler运行变慢,无法及时结束;
  2. scheduler 运算复杂度过高, 2000(tikv节点数) * 10(重试次数) * 2000(tikv节点数) * 5(filter 数量) 近2亿次计算,经测试机器正常负载,一次调度也得几十分钟。这个时间太长了,调度结束之前pd处于无leader状态,会影响客户端的访问。
  1. 可以检查下那段时间网络是否正常
  2. 单个 tikv 有 20万的region 吗? 这个数量有点太多了。可以考虑扩容 tikv 分散下。

网络正常,集群region总数是20w。
集群布署在hdd磁盘,所以pd lease容易过期,将lease调整为5s后,pd就不会频繁切leader。

反馈的 问题其实主要是 coordinator stop 花费太长时间。

经分析有两方面的原因:

  1. 故障期间,pd实例消耗cpu过高(50 core cpu)导致scheduler运行变慢,无法及时结束;
  2. scheduler 运算复杂度过高, 2000(tikv节点数) * 10(重试次数) * 2000(tikv节点数) * 5(filter 数量) 近2亿次计算,经测试机器正常负载,一次调度也得几十分钟。这个时间太长了,调度结束之前pd处于无leader状态,会影响客户端的访问。

pd 还是建议 ssd 吧,最起码 sata 接口的。raft 协议本来就是强实时的协议,时间调太长失去了意义。

了解,这个集群主要是用于数据归档存储,对读性能无要求。由于集群tikv-server实例较多,暴露出pd scheduler 很耗时间,且在coordinator关闭时不能够中途退出.

此话题已在最后回复的 1 分钟后被自动关闭。不再允许新回复。