pd leader报错

如题, pd leader节点出现异常,出现leader切换

pd leader节点异常 日志如下

[2024/06/03 09:51:22.727 +08:00] [WARN] [raft.go:363] [“leader failed to send out heartbeat on time; took too long, leader is overloaded likely from slow disk”] [to=cd8ec02f78c9a682] [heartbeat-interval=500ms] [expected-duration=1s] [exceeded-duration=63.080465ms]
[2024/06/01 09:51:22.733 +08:00] [WARN] [raft.go:363] [“leader failed to send out heartbeat on time; took too long, leader is overloaded likely from slow disk”] [to=8344550797e5521f] [heartbeat-interval=500ms] [expected-duration=1s] [exceeded-duration=69.866043ms]
[2024/06/01 09:51:45.824 +08:00] [WARN] [raft.go:363] [“leader failed to send out heartbeat on time; took too long, leader is overloaded likely from slow disk”] [to=cd8ec02f78c9a682] [heartbeat-interval=500ms] [expected-duration=1s] [exceeded-duration=161.18652ms]
[2024/06/01 09:51:45.829 +08:00] [WARN] [raft.go:363] [“leader failed to send out heartbeat on time; took too long, leader is overloaded likely from slow disk”] [to=8344550797e5521f] [heartbeat-interval=500ms] [expected-duration=1s] [exceeded-duration=166.628628ms]
[2024/06/01 09:42:09.772 +08:00] [WARN] [raft.go:363] [“leader failed to send out heartbeat on time; took too long, leader is overloaded likely from slow disk”] [to=cd8ec02f78c9a682] [heartbeat-interval=500ms] [expected-duration=1s] [exceeded-duration=108.628828ms]
[2024/06/01 09:42:09.780 +08:00] [WARN] [raft.go:363] [“leader failed to send out heartbeat on time; took too long, leader is overloaded likely from slow disk”] [to=8344550797e5521f] [heartbeat-interval=500ms] [expected-duration=1s] [exceeded-duration=117.112641ms]
[2024/06/01 09:52:26.796 +08:00] [WARN] [raft.go:363] [“leader failed to send out heartbeat on time; took too long, leader is overloaded likely from slow disk”] [to=cd8ec02f78c9a682] [heartbeat-interval=500ms] [expected-duration=1s] [exceeded-duration=16.07931ms]
[2024/06/01 09:52:26.806 +08:00] [WARN] [raft.go:363] [“leader failed to send out heartbeat on time; took too long, leader is overloaded likely from slow disk”] [to=8344550797e5521f] [heartbeat-interval=500ms] [expected-duration=1s] [exceeded-duration=25.979491ms]
[2024/06/01 09:52:54.888 +08:00] [ERROR] [server.go:229] [“region syncer send data meet error”] [error=“rpc error: code = Canceled desc = context canceled”]
[2024/06/01 09:52:54.906 +08:00] [ERROR] [server.go:229] [“region syncer send data meet error”] [error=“rpc error: code = Canceled desc = context canceled”]
[2024/06/01 09:52:54.907 +08:00] [INFO] [server.go:238] [“region syncer delete the stream”] [stream=pd_kv03]
[2024/06/01 09:52:54.907 +08:00] [INFO] [server.go:238] [“region syncer delete the stream”] [stream=pd_kv05]
[2024/06/01 09:43:39.161 +08:00] [WARN] [node.go:408] [“e4b9b3f5dd8bcee4 (leader true) A tick missed to fire. Node blocks too long!”]

看起来是你的leader的磁盘慢,心跳500ms,导致region同步不了

是这台机器负载高吧,另外这个集群版本是不是可以升升级

混合部署?

混合部署, pd server kv都有

资源问题,直接查当前机器的cpu,io等,定位一下具体进程吧

升级了哈~
趁着这次有升级活动:

好久远的版本~~

建议升级,版本太老了

升级吧 现在社区也有这个活动

版本挺古老呀

建议升级看看

3.0.9

PD 重新选举新的leader后, 新的pd leader 日志一直在刷 下面信息, 即使重新选举了pd leader,集群还是访问有问题,这是为什么? 把新的leader 重启,leader重新回到之前那个 就恢复很快

[2024/06/04 19:38:52.576 +08:00] [WARN] [cluster_info.go:92] [“region is stale”] [error="region is stale: region id:890419759 start_key:"t\200\000\000\000\000\020\317\377\316_r\200\000\000\000\000\377\003\333\241\000\000\000\000\000\372" end_key:"t\200\000\000\000\000\020\317\377\316_r\200\000\000\000\000\377\003\333\360\000\000\000\000\000\372" region_epoch:<conf_ver:1625 version:366449 > peers:<id:890419760 store_id:1 > peers:<id:890419761 store_id:4 > peers:<id:890419762 store_id:8 > origin id:890395107 start_key:"t\200\000\000\000\000\020\317\377\316_r\200\000\000\000\000\377\003\333\241\000\000\000\000\000\372" end_key:"t\200\000\000\000\000\020\317\377\316_r\200\000\000\000\000\377\003\333\360\000\000\000\000\000\372" region_epoch:<conf_ver:1625 version:366477 > peers:<id:890395108 store_id:1 > peers:<id:890395109 store_id:4 > peers:<id:890395110 store_id:8 > "] [origin=]
[2024/06/04 19:38:52.576 +08:00] [WARN] [cluster_info.go:92] [“region is stale”] [error="region is stale: region id:890419776 start_key:"t\200\000\000\000\000\0234\377#_r\200\000\000\000\001\377{\252\003\000\000\000\000\000\372" end_key:"t\200\000\000\000\000\0234\377#_r\200\000\000\000\001\377\202\235h\000\000\000\000\000\372" region_epoch:<conf_ver:3998 version:399473 > peers:<id:890419777 store_id:8 > peers:<id:890419778 store_id:4 > peers:<id:890972994 store_id:10 > origin id:890311137 start_key:"t\200\000\000\000\000\0234\377#_r\200\000\000\000\001\377{\252\003\000\000\000\000\000\372" end_key:"t\200\000\000\000\000\0234\377#_r\200\000\000\000\001\377\202\235h\000\000\000\000\000\372" region_epoch:<conf_ver:3992 version:399487 > peers:<id:890311138 store_id:8 > peers:<id:890311139 store_id:4 > peers:<id:890311140 store_id:10 > "] [origin=]

你 tikv 上报 pd 信息正常么?你要不先检查一下网络? tikv 到 pd 端口通不。

有可能是新的pd节点因为混布的原因,响应不及时,或者通信有问题。