PD状态异常

霸王龙的日常 · 2024 年5 月 15 日 04:45

【 TiDB 使用环境】开发环境
【 TiDB 版本】v7.1.0
【复现路径】机房异常断电，服务状态异常
【遇到的问题：问题现象及影响】
PD状态异常

tiup日志：

pd日志：

Tikv日志：

[2024/05/15 12:41:59.671 +08:00] [INFO] [] [“subchannel 0xffe8e05c5800 {address=ipv4:172.16.12.113:2379, args=grpc.client_channel_factory=0xfffd00510cf8, grpc.default_authority=172.16.12.113:2379, grpc.initial_reconnect_backoff_ms=1000, grpc.internal.subchannel_pool=0xfffd00201a40, grpc.keepalive_time_ms=10000, grpc.keepalive_timeout_ms=3000, grpc.max_receive_message_length=-1, grpc.max_reconnect_backoff_ms=5000, grpc.max_send_message_length=-1, grpc.primary_user_agent=grpc-rust/0.10.4, grpc.resource_quota=0xfffd000efd80, grpc.server_uri=dns:///172.16.12.113:2379}: Retry in 1000 milliseconds”]
[2024/05/15 12:41:59.671 +08:00] [ERROR] [util.rs:682] [“connect failed”] [error=“Grpc(RpcFailure(RpcStatus { code: 14-UNAVAILABLE, message: "failed to connect to all addresses", details: }))”] [endpoints=http://172.16.12.113:2379]
[2024/05/15 12:41:59.671 +08:00] [INFO] [util.rs:604] [“connecting to PD endpoint”] [endpoints=http://172.16.12.111:2379]
[2024/05/15 12:41:59.672 +08:00] [INFO] [] [“subchannel 0xffe8e05c5c00 {address=ipv4:172.16.12.111:2379, args=grpc.client_channel_factory=0xfffd00510cf8, grpc.default_authority=172.16.12.111:2379, grpc.initial_reconnect_backoff_ms=1000, grpc.internal.subchannel_pool=0xfffd00201a40, grpc.keepalive_time_ms=10000, grpc.keepalive_timeout_ms=3000, grpc.max_receive_message_length=-1, grpc.max_reconnect_backoff_ms=5000, grpc.max_send_message_length=-1, grpc.primary_user_agent=grpc-rust/0.10.4, grpc.resource_quota=0xfffd000efd80, grpc.server_uri=dns:///172.16.12.111:2379}: connect failed: {"created":"@1715748119.672112390","description":"Failed to connect to remote host: Connection refused","errno":111,"file":"/root/.cargo/registry/src/github.com-1ecc6299db9ec823/grpcio-sys-0.10.3+1.44.0-patched/grpc/src/core/lib/iomgr/tcp_client_posix.cc","file_line":200,"os_error":"Connection refused","syscall":"connect","target_address":"ipv4:172.16.12.111:2379"}”]
[2024/05/15 12:41:59.672 +08:00] [INFO] [] [“subchannel 0xffe8e05c5c00 {address=ipv4:172.16.12.111:2379, args=grpc.client_channel_factory=0xfffd00510cf8, grpc.default_authority=172.16.12.111:2379, grpc.initial_reconnect_backoff_ms=1000, grpc.internal.subchannel_pool=0xfffd00201a40, grpc.keepalive_time_ms=10000, grpc.keepalive_timeout_ms=3000, grpc.max_receive_message_length=-1, grpc.max_reconnect_backoff_ms=5000, grpc.max_send_message_length=-1, grpc.primary_user_agent=grpc-rust/0.10.4, grpc.resource_quota=0xfffd000efd80, grpc.server_uri=dns:///172.16.12.111:2379}: Retry in 1000 milliseconds”]
[2024/05/15 12:41:59.672 +08:00] [ERROR] [util.rs:682] [“connect failed”] [error=“Grpc(RpcFailure(RpcStatus { code: 14-UNAVAILABLE, message: "failed to connect to all addresses", details: }))”] [endpoints=http://172.16.12.111:2379]
[2024/05/15 12:41:59.672 +08:00] [INFO] [util.rs:604] [“connecting to PD endpoint”] [endpoints=http://172.16.12.112:2379]
[2024/05/15 12:41:59.751 +08:00] [WARN] [client.rs:152] [“failed to update PD client”] [error=“Other("[components/pd_client/src/util.rs:343]: cancel reconnection due to too small interval")”]
[2024/05/15 12:41:59.993 +08:00] [ERROR] [util.rs:462] [“request failed, retry”] [err_code=KV:Pd:Grpc] [err=“Grpc(RpcFailure(RpcStatus { code: 14-UNAVAILABLE, message: "not leader", details: }))”]
[2024/05/15 12:41:59.993 +08:00] [WARN] [pd.rs:1842] [“report min resolved_ts failed”] [err=“Other("[components/pd_client/src/util.rs:427]: request retry exceeds limit")”]
[2024/05/15 12:41:59.994 +08:00] [ERROR] [util.rs:462] [“request failed, retry”] [err_code=KV:Pd:Grpc] [err=“Grpc(RpcFailure(RpcStatus { code: 14-UNAVAILABLE, message: "not leader", details: }))”]
[2024/05/15 12:41:59.994 +08:00] [INFO] [util.rs:604] [“connecting to PD endpoint”] [endpoints=http://172.16.12.113:2379]
[2024/05/15 12:41:59.997 +08:00] [ERROR] [util.rs:462] [“request failed, retry”] [err_code=KV:Pd:Grpc] [err=“Grpc(RpcFailure(RpcStatus { code: 14-UNAVAILABLE, message: "not leader", details: }))”]
[2024/05/15 12:41:59.999 +08:00] [ERROR] [util.rs:462] [“request failed, retry”] [err_code=KV:Pd:Grpc] [err=“Grpc(RpcFailure(RpcStatus { code: 14-UNAVAILABLE, message: "not leader", details: }))”]
[2024/05/15 12:42:00.002 +08:00] [ERROR] [util.rs:462] [“request failed, retry”] [err_code=KV:Pd:Grpc] [err=“Grpc(RpcFailure(RpcStatus { code: 14-UNAVAILABLE, message: "not leader", details: }))”]
[2024/05/15 12:42:00.007 +08:00] [ERROR] [util.rs:462] [“request failed, retry”] [err_code=KV:Pd:Grpc] [err=“Grpc(RpcFailure(RpcStatus { code: 14-UNAVAILABLE, message: "not leader", details: }))”]
[2024/05/15 12:42:00.009 +08:00] [ERROR] [util.rs:462] [“request failed, retry”] [err_code=KV:Pd:Grpc] [err=“Grpc(RpcFailure(RpcStatus { code: 14-UNAVAILABLE, message: "not leader", details: }))”]
[2024/05/15 12:42:00.012 +08:00] [ERROR] [util.rs:462] [“request failed, retry”] [err_code=KV:Pd:Grpc] [err=“Grpc(RpcFailure(RpcStatus { code: 14-UNAVAILABLE, message: "not leader", details: }))”]
[2024/05/15 12:42:00.015 +08:00] [ERROR] [util.rs:462] [“request failed, retry”] [err_code=KV:Pd:Grpc] [err=“Grpc(RpcFailure(RpcStatus { code: 14-UNAVAILABLE, message: "not leader", details: }))”]
[2024/05/15 12:42:00.017 +08:00] [ERROR] [util.rs:462] [“request failed, retry”] [err_code=KV:Pd:Grpc] [err=“Grpc(RpcFailure(RpcStatus { code: 14-UNAVAILABLE, message: "not leader", details: }))”]
[2024/05/15 12:42:00.023 +08:00] [ERROR] [util.rs:462] ["request

Billdi表弟 · 2024 年5 月 15 日 04:49

几个pd节点？

霸王龙的日常 · 2024 年5 月 15 日 04:57

3节点，混合部署的。

Billdi表弟 · 2024 年5 月 15 日 05:14

https://docs.pingcap.com/zh/tidb/stable/pd-recover

小于同学 · 2024 年5 月 16 日 09:30

学到了

paulli · 2024 年5 月 16 日 10:04

像是少文件，可以通过pd-recovery进行恢复pd
cat /data2/tidb-deploy/pd-2881/log/pd.log | grep “init cluster id”
cat /data2/tidb-deploy/pd-2881/log/pd*.log | grep “idAllocator allocates a new id” | awk -F’=’ ‘{print $2}’ | awk -F’]’ ‘{print $1}’ | sort -r -n | head -n 1
tiup pd-recover:v6.5.1 -endpoints http://xx.xx.xx.xx:2379 -cluster-id xx -alloc-id xx

zhh_912 · 2024 年5 月 16 日 10:40

看下是几节点

system · 2024 年7 月 15 日 10:40

此话题已在最后回复的 60 天后被自动关闭。不再允许新回复。