tikv所有节点报错14-UNAVAILABLE,之后全部pd节点down,tikv节点状态为N/A,这个怎么处理呢

[2024/09/24 20:23:49.977 +08:00] [WARN] [client.rs:149] [“failed to update PD client”] [error=“Other("[components/pd_client/src/util.rs:337]: cancel reconnection due to too small interval")”]

[2024/09/24 20:23:49.977 +08:00] [ERROR] [util.rs:456] [“request failed, retry”] [err_code=KV:Pd:Grpc] [err=“Grpc(RpcFailure(RpcStatus { code: 14-UNAVAILABLE, message: "not leader", details: }))”]

[2024/09/24 20:23:49.977 +08:00] [ERROR] [util.rs:456] [“request failed, retry”] [err_code=KV:Pd:Grpc] [err=“Grpc(RpcFailure(RpcStatus { code: 14-UNAVAILABLE, message: "not leader", details: }))”]

[2024/09/24 20:23:49.977 +08:00] [ERROR] [util.rs:456] [“request failed, retry”] [err_code=KV:Pd:Grpc] [err=“Grpc(RpcFailure(RpcStatus { code: 14-UNAVAILABLE, message: "not leader", details: }))”]

[2024/09/24 20:23:49.977 +08:00] [ERROR] [util.rs:456] [“request failed, retry”] [err_code=KV:Pd:Grpc] [err=“Grpc(RpcFailure(RpcStatus { code: 14-UNAVAILABLE, message: "not leader", details: }))”]

[2024/09/24 20:23:49.977 +08:00] [ERROR] [util.rs:456] [“request failed, retry”] [err_code=KV:Pd:Grpc] [err=“Grpc(RpcFailure(RpcStatus { code: 14-UNAVAILABLE, message: "not leader", details: }))”]

[2024/09/24 20:23:49.977 +08:00] [ERROR] [util.rs:456] [“request failed, retry”] [err_code=KV:Pd:Grpc] [err=“Grpc(RpcFailure(RpcStatus { code: 14-UNAVAILABLE, message: "not leader", details: }))”]

[2024/09/24 20:23:49.977 +08:00] [ERROR] [util.rs:456] [“request failed, retry”] [err_code=KV:Pd:Grpc] [err=“Grpc(RpcFailure(RpcStatus { code: 14-UNAVAILABLE, message: "not leader", details: }))”]

[2024/09/24 20:23:49.977 +08:00] [ERROR] [util.rs:456] [“request failed, retry”] [err_code=KV:Pd:Grpc] [err=“Grpc(RpcFailure(RpcStatus { code: 14-UNAVAILABLE, message: "not leader", details: }))”]

[2024/09/24 20:23:49.977 +08:00] [ERROR] [util.rs:456] [“request failed, retry”] [err_code=KV:Pd:Grpc] [err=“Grpc(RpcFailure(RpcStatus { code: 14-UNAVAILABLE, message: "not leader", details: }))”]

[2024/09/24 20:23:49.977 +08:00] [ERROR] [util.rs:456] [“request failed, retry”] [err_code=KV:Pd:Grpc] [err=“Grpc(RpcFailure(RpcStatus { code: 14-UNAVAILABLE, message: "not leader", details: }))”]

[2024/09/24 20:23:49.977 +08:00] [ERROR] [gc_manager.rs:369] [“failed to get safe point from pd”] [err_code=KV:Storage:Unknown] [err=“Error(Other("[src/server/gc_worker/gc_worker.rs:80]: failed to get safe point from PD: Other(\"[components/pd_client/src/util.rs:421]: request retry exceeds limit\")"))”]

[2024/09/24 20:23:49.977 +08:00] [ERROR] [util.rs:456] [“request failed, retry”] [err_code=KV:Pd:Grpc] [err=“Grpc(RpcFailure(RpcStatus { code: 14-UNAVAILABLE, message: "not leader", details: }))”]

[2024/09/24 20:23:49.977 +08:00] [ERROR] [util.rs:456] [“request failed, retry”] [err_code=KV:Pd:Grpc] [err=“Grpc(RpcFailure(RpcStatus { code: 14-UNAVAILABLE, message: "not leader", details: }))”]

[2024/09/24 20:23:49.977 +08:00] [ERROR] [util.rs:456] [“request failed, retry”] [err_code=KV:Pd:Grpc] [err=“Grpc(RpcFailure(RpcStatus { code: 14-UNAVAILABLE, message: "not leader", details: }))”]

[2024/09/24 20:23:49.977 +08:00] [ERROR] [util.rs:456] [“request failed, retry”] [err_code=KV:Pd:Grpc] [err=“Grpc(RpcFailure(RpcStatus { code: 14-UNAVAILABLE, message: "not leader", details: }))”]

[2024/09/24 20:23:49.977 +08:00] [ERROR] [util.rs:456] [“request failed, retry”] [err_code=KV:Pd:Grpc] [err=“Grpc(RpcFailure(RpcStatus { code: 14-UNAVAILABLE, message: "not leader", details: }))”]

[2024/09/24 20:23:49.977 +08:00] [ERROR] [util.rs:456] [“request failed, retry”] [err_code=KV:Pd:Grpc] [err=“Grpc(RpcFailure(RpcStatus { code: 14-UNAVAILABLE, message: "not leader", details: }))”]

[2024/09/24 20:23:49.977 +08:00] [ERROR] [util.rs:456] [“request failed, retry”] [err_code=KV:Pd:Grpc] [err=“Grpc(RpcFailure(RpcStatus { code: 14-UNAVAILABLE, message: "not leader", details: }))”]

[2024/09/24 20:23:49.977 +08:00] [ERROR] [util.rs:456] [“request failed, retry”] [err_code=KV:Pd:Grpc] [err=“Grpc(RpcFailure(RpcStatus { code: 14-UNAVAILABLE, message: "not leader", details: }))”]

[2024/09/24 20:23:49.977 +08:00] [ERROR] [util.rs:456] [“request failed, retry”] [err_code=KV:Pd:Grpc] [err=“Grpc(RpcFailure(RpcStatus { code: 14-UNAVAILABLE, message: "not leader", details: }))”]

[2024/09/24 20:23:49.977 +08:00] [WARN] [client.rs:149] [“failed to update PD client”] [error=“Other("[components/pd_client/src/util.rs:337]: cancel reconnection due to too small interval")”]

[2024/09/24 20:23:49.977 +08:00] [ERROR] [util.rs:456] [“request failed, retry”] [err_code=KV:Pd:Grpc] [err=“Grpc(RpcFailure(RpcStatus { code: 14-UNAVAILABLE, message: "not leader", details: }))”]

[2024/09/24 20:23:49.977 +08:00] [ERROR] [util.rs:456] [“request failed, retry”] [err_code=KV:Pd:Grpc] [err=“Grpc(RpcFailure(RpcStatus { code: 14-UNAVAILABLE, message: "not leader", details: }))”]

[2024/09/24 20:23:49.977 +08:00] [ERROR] [util.rs:456] [“request failed, retry”] [err_code=KV:Pd:Grpc] [err=“Grpc(RpcFailure(RpcStatus { code: 14-UNAVAILABLE, message: "not leader", details: }))”]

[2024/09/24 20:23:49.977 +08:00] [ERROR] [util.rs:456] [“request failed, retry”] [err_code=KV:Pd:Grpc] [err=“Grpc(RpcFailure(RpcStatus { code: 14-UNAVAILABLE, message: "not leader", details: }))”]

[2024/09/24 20:23:49.977 +08:00] [ERROR] [util.rs:456] [“request failed, retry”] [err_code=KV:Pd:Grpc] [err=“Grpc(RpcFailure(RpcStatus { code: 14-UNAVAILABLE, message: "not leader", details: }))”]

[2024/09/24 20:23:49.977 +08:00] [ERROR] [util.rs:456] [“request failed, retry”] [err_code=KV:Pd:Grpc] [err=“Grpc(RpcFailure(RpcStatus { code: 14-UNAVAILABLE, message: "not leader", details: }))”]

[2024/09/24 20:23:49.977 +08:00] [ERROR] [util.rs:456] [“request failed, retry”] [err_code=KV:Pd:Grpc] [err=“Grpc(RpcFailure(RpcStatus { code: 14-UNAVAILABLE, message: "not leader", details: }))”]

[2024/09/24 20:23:49.977 +08:00] [ERROR] [util.rs:456] [“request failed, retry”] [err_code=KV:Pd:Grpc] [err=“Grpc(RpcFailure(RpcStatus { code: 14-UNAVAILABLE, message: "not leader", details: }))”]

[2024/09/24 20:23:49.977 +08:00] [ERROR] [util.rs:456] [“request failed, retry”] [err_code=KV:Pd:Grpc] [err=“Grpc(RpcFailure(RpcStatus { code: 14-UNAVAILABLE, message: "not leader", details: }))”]

[2024/09/24 20:23:49.977 +08:00] [WARN] [client.rs:149] [“failed to update PD client”] [error=“Other("[components/pd_client/src/util.rs:337]: cancel reconnection due to too small interval")”]

单独启动pd节点能起来吗?

之后一两分钟集群会自动拉起来,现在想搞明白为啥会每半小时集群挂一次

看一下你的磁盘利用率吧,是不是磁盘太忙pd响应不过来了。

磁盘使用率在有问题的时间段较高,可是应该怎么分辨是集群有问题导致io高还是io高导致有问题呢

grafana里面看下重启期间的io情况,tidb-test-disk-performance页面看下

1、也可以看下 TiKV 节点与 PD 之间的通信,不断的去ping,看看是否有问题的时候,ping是有问题的
2、当前的tikv的磁盘使用率是多少,也可以看看有问题时候的top sql(dashboard中可以看到)是否有消耗大量CPU资源的SQL

看下tikv的磁盘使用情况:监控面板Disk-Performance中的Disk Latency


18:00,18:22 这两个时间段确实集群开始异常了

磁盘延迟太大了,一般得在10ms以下;write sdb的延迟都到1min了,这个得更换性能高的ssd盘,官方建议是使用Nvme盘。
所有节点的tikv unavailable就是因为磁盘延迟太大导致的。

1 个赞