目前看到的几个问题:
1、 5.01分发生过pd leader切换,25节点成为leader,应该是磁盘慢的问题。
[2022/10/25 05:00:30.858 +08:00] [WARN] [wal.go:712] [“slow fdatasync”] [took=2.426173237s] [expected-duration=1s]
[2022/10/25 05:00:30.858 +08:00] [WARN] [raft.go:363] [“leader failed to send out heartbeat on time; took too long, leader is overloaded likely from slow disk”] [to=a8a1d6bded45bf4b] [heartbeat-interval=500ms] [expected-duration=1s] [exceeded-duration=1.471525001s]
在成为Leader的这段时间PD上很多监控项没有数据。应该和tikv通信有问题
[2022/10/25 05:01:30.546 +08:00] [ERROR] [client.go:171] [“region sync with leader meet error”] [error=“[PD:grpc:ErrGRPCRecv]rpc error: code = Canceled desc = context canceled”]
2、25节点成为leader期间store2(17.21)、store 1(17.35)往store7(17.32)上的transfer leader调度都不能正常完成,都超时了,猜测和tikv繁忙有关。 但是后续从store7 往1、2上的调度都能成功
具体原因等官方大佬来分析 @neilshen