【 TiDB 使用环境】生产环境
【 TiDB 版本】
PD版本:release-5.0
etcd版本:3.4.3
【复现路径】
pd三个节点,直接kill leader节点的进程
【遇到的问题:问题现象及影响】
问题是偶现的,大部分情形下kill leader后会由其他follower当选leader,但是有一次出现如下情形:
leader节点的etcd server一直不退出,一直在刷grpc的warn日志
pd-0(leader) 日志如下:
[2024/10/30 18:03:03.064 +08:00] [INFO] [server.go:1369] [“server is closed, return pd leader loop”]
[2024/10/30 18:03:03.064 +08:00] [INFO] [etcd.go:360] [“closing etcd server”] [name=pd-0] [data-dir=/pd-0/data] [advertise-peer-urls=“[http://pd-0:2380]”] [advertise-client-urls=“[http://10.15.252.37:2379]”]
[2024/10/30 18:03:03.064 +08:00] [WARN] [grpclog.go:60] [“grpc: addrConn.createTransport failed to connect to {0.0.0.0:2379 0 }. Err :connection error: desc = "transport: Error while dialing dial tcp 0.0.0.0:2379: connect: connection refused". Reconnecting…”]
[2024/10/30 18:03:04.065 +08:00] [WARN] [grpclog.go:60] [“grpc: addrConn.createTransport failed to connect to {0.0.0.0:2379 0 }. Err :connection error: desc = "transport: Error while dialing dial tcp 0.0.0.0:2379: connect: connection refused". Reconnecting…”]
然后其他节点也无法选出主,因为还认为原来的leader的etcd server还活着
pd-1(follower) 日志如下:
认为pd-0的etcd server还活着,无法切主
[2024/10/30 18:03:03.569 +08:00] [WARN] [grpclog.go:60] [“grpc: addrConn.createTransport failed to connect to {http://10.15.252.37:2379 0 }. Err :connection error: desc = "transport: Error while dialing dial tcp 10.15.252.37:2379: connect: connection refused". Reconnecting…”]
[2024/10/30 18:03:03.669 +08:00] [INFO] [server.go:1399] [“skip campaigning of pd leader and check later”] [server-name=pd-1] [etcd-leader-id=11612914899710741714] [member-id=764943496391167270]
[2024/10/30 18:03:03.871 +08:00] [INFO] [server.go:1399] [“skip campaigning of pd leader and check later”] [server-name=pd-1] [etcd-leader-id=11612914899710741714] [member-id=764943496391167270]
【资源配置】进入到 TiDB Dashboard -集群信息 (Cluster Info) -主机(Hosts) 截图此页面
【附件:截图/日志/监控】