【 TiDB 使用环境`】生产环境
【 TiDB 版本】v5.3.0
【遇到的问题】pd leader lease过期,执行coordinator is stopping这一步骤过长,经过24hour才执行到coordinator is stopped,之后集群恢复正常
【复现路径】
集群规模: tikv-server 2000实例; region数量20w; pd 节点3个;
【问题现象及影响】
lease 过期之后遇到此现象;
此时cpu idle 跌到30%(80%->30%), 内存、磁盘io 无明显变化;
日志:
[kvstore@ip-xxx log]$ grep --color coordinator pd.log
[2022/06/19 01:31:04.686 +08:00] [INFO] [cluster.go:372] [“coordinator is stopping”]
[2022/06/19 01:31:04.686 +08:00] [INFO] [coordinator.go:285] [“drive push operator has been stopped”]
[2022/06/19 01:31:04.686 +08:00] [INFO] [coordinator.go:220] [“check suspect key ranges has been stopped”]
[2022/06/19 01:39:51.198 +08:00] [INFO] [coordinator.go:796] [“scheduler has been stopped”] [scheduler-name=balance-leader-scheduler] [error=“context canceled”]
[2022/06/19 02:41:52.496 +08:00] [INFO] [coordinator.go:110] [“patrol regions has been stopped”]
[2022/06/19 16:17:36.003 +08:00] [INFO] [coordinator.go:796] [“scheduler has been stopped”] [scheduler-name=balance-hot-region-scheduler] [error=“context canceled”]
[2022/06/20 08:42:31.097 +08:00] [INFO] [coordinator.go:796] [“scheduler has been stopped”] [scheduler-name=balance-region-scheduler] [error=“context canceled”]
[2022/06/20 08:42:31.097 +08:00] [INFO] [cluster.go:368] [“coordinator has been stopped”]
[2022/06/20 08:42:33.496 +08:00] [INFO] [coordinator.go:296] [“coordinator starts to collect cluster information”]
[2022/06/20 08:47:33.497 +08:00] [INFO] [coordinator.go:299] [“coordinator has finished cluster information preparation”]
[2022/06/20 08:47:33.497 +08:00] [INFO] [coordinator.go:309] [“coordinator starts to run schedulers”]
[2022/06/20 08:47:33.498 +08:00] [INFO] [coordinator.go:357] [“create scheduler with independent configuration”] [scheduler-name=balance-hot-region-scheduler]
[2022/06/20 08:47:33.499 +08:00] [INFO] [coordinator.go:357] [“create scheduler with independent configuration”] [scheduler-name=balance-leader-scheduler]
[2022/06/20 08:47:33.500 +08:00] [INFO] [coordinator.go:357] [“create scheduler with independent configuration”] [scheduler-name=balance-region-scheduler]
[2022/06/20 08:47:33.500 +08:00] [INFO] [coordinator.go:379] [“create scheduler”] [scheduler-name=balance-region-scheduler] [scheduler-args="[]"]
[2022/06/20 08:47:33.501 +08:00] [INFO] [coordinator.go:379] [“create scheduler”] [scheduler-name=balance-leader-scheduler] [scheduler-args="[]"]
[2022/06/20 08:47:33.502 +08:00] [INFO] [coordinator.go:379] [“create scheduler”] [scheduler-name=balance-hot-region-scheduler] [scheduler-args="[]"]
[2022/06/20 08:47:33.505 +08:00] [INFO] [coordinator.go:279] [“coordinator begins to actively drive push operator”]
[2022/06/20 08:47:33.505 +08:00] [INFO] [coordinator.go:102] [“coordinator starts patrol regions”]
[2022/06/20 08:47:33.505 +08:00] [INFO] [coordinator.go:214] [“coordinator begins to check suspect key ranges”]
更新:
网络正常,集群region总数是20w。
集群布署在hdd磁盘,所以pd lease容易过期,将lease调整为5s后,pd就不会频繁切leader。
反馈的 问题其实主要是 coordinator stop 花费太长时间。
经分析有两方面的原因:
- 故障期间,pd实例消耗cpu过高(50 core cpu)导致scheduler运行变慢,无法及时结束;
- scheduler 运算复杂度过高, 2000(tikv节点数) * 10(重试次数) * 2000(tikv节点数) * 5(filter 数量) 近2亿次计算,经测试机器正常负载,一次调度也得几十分钟。这个时间太长了,调度结束之前pd处于无leader状态,会影响客户端的访问。