TIDB在K8S环境中使用TiOperator部署,概率性出现一个节点故障重启后无法加入集群

Bug 反馈
基于TiOperator部署在K8S环境中,模拟一个节点掉电恢复,概率性出现恢复时PD无法正确加入集群
【 TiDB 版本】6.5.8 TiOperator 1.5.3
【 Bug 的影响】
节点故障恢复后,概率性出现该节点PD不能正确加入集群,可靠性下降,其余节点出现故障时集群不可用
【可能的问题复现步骤】
集群3节点运行正常的情况下,其中一个节点断电下线,过一会重新上线。
【看到的非预期行为】
该节点上电后,pd一直尝试从discovery获取信息,但一直失败:

kubectl logs -f -n namespace basic-pd-0

Server: 10.96.0.10
Address 1: 10.96.0.10 kube-dns.kube-system.svc.cluster.local

Name: basic-pd-0.basic-pd-peer.namespace.svc
Address 1: 100.126.210.188 basic-pd-0.basic-pd-peer.namespace.svc.cluster.local
nslookup domain basic-pd-0.basic-pd-peer.namespace.svc.svc success
waiting for discovery service to return start args …
waiting for discovery service to return start args …
waiting for discovery service to return start args …
waiting for discovery service to return start args …
waiting for discovery service to return start args …

【期望看到的行为】
pd应该正常从新加入集群。

【相关组件及具体版本】

【其他背景信息或者截图】

通过查看启动脚本,一直在下面的逻辑处,但由于日志不足,不能进一步判断:

then
until result=$(wget -qO- -T 3 http://${discovery_url}/new/${encoded_domain_url} 2>/dev/null); do
echo “waiting for discovery service to return start args …”
sleep $((RANDOM % 5))
done

故障pd从discovery wget 报500错:

/ # wget -qO- -T 3 http://basic-discovery.namespace.svc:10261/new/YmFzaWMtcGQtMC5iYXNpYy1wZC1wZWVyLm1haXB1LWJkY2FtcHVzLnN2YzoyMzgwCg==
wget: server returned error: HTTP/1.1 500 Internal Server Error
/ # nslookup basic-discovery
Server: 10.96.0.10
Address 1: 10.96.0.10 kube-dns.kube-system.svc.cluster.local

Name: basic-discovery
Address 1: 10.99.76.247 basic-discovery.namespace.svc.cluster.local
/ # exit

对应的discovery日志也是报错如下:

I0925 07:28:46.750851 1 discovery.go:80] advertisePeerUrl is: basic-pd-0.basic-pd-peer.namespace.svc:2380
E0925 07:28:46.778935 1 server.go:91] failed to discover: basic-pd-0.basic-pd-peer.namespace.svc:2380
, Get “http://basic-pd.namespace:2379/pd/api/v1/members”: dial tcp 10.109.90.57:2379: connect: connection refused, register-type is: pd

I0925 07:28:50.789559 1 discovery.go:80] advertisePeerUrl is: basic-pd-0.basic-pd-peer.namespace.svc:2380
E0925 07:28:55.799493 1 server.go:91] failed to discover: basic-pd-0.basic-pd-peer.namespace.svc:2380
, Get “http://basic-pd.namespace:2379/pd/api/v1/members”: context deadline exceeded (Client.Timeout exceeded while awaiting headers), register-type is: pd
I0925 07:28:55.799611 1 discovery.go:80] advertisePeerUrl is: basic-pd-0.basic-pd-peer.namespace.svc:2380
E0925 07:29:00.815882 1 server.go:91] failed to discover: basic-pd-0.basic-pd-peer.namespace.svc:2380
, Get “http://basic-pd.namespace:2379/pd/api/v1/members”: context deadline exceeded (Client.Timeout exceeded while awaiting headers), register-type is: pd
I0925 07:29:01.810966 1 discovery.go:80] advertisePeerUrl is: basic-pd-0.basic-pd-peer.namespace.svc:2380
E0925 07:29:01.822615 1 server.go:91] failed to discover: basic-pd-0.basic-pd-peer.namespace.svc:2380
, Get “http://basic-pd.namespace:2379/pd/api/v1/members”: dial tcp 10.109.90.57:2379: connect: connection refused, register-type is: pd
I0925 07:29:03.832399 1 discovery.go:80] advertisePeerUrl is: basic-pd-0.basic-pd-peer.namespace.svc:2380
E0925 07:29:08.844371 1 server.go:91] failed to discover: basic-pd-0.basic-pd-peer.namespace.svc:2380
, Get “http://basic-pd.namespace:2379/pd/api/v1/members”: context deadline exceeded (Client.Timeout exceeded while awaiting headers), register-type is: pd
I0925 07:29:12.839433 1 discovery.go:80] advertisePeerUrl is: basic-pd-0.basic-pd-peer.namespace.svc:2380
E0925 07:29:17.851134 1 server.go:91] failed to discover: basic-pd-0.basic-pd-peer.namespace.svc:2380
, Get “http://basic-pd.namespace:2379/pd/api/v1/members”: context deadline exceeded (Client.Timeout exceeded while awaiting headers), register-type is: pd
I0925 07:29:19.848595 1 discovery.go:80] advertisePeerUrl is: basic-pd-0.basic-pd-peer.namespace.svc:2380
E0925 07:29:19.858338 1 server.go:91] failed to discover: basic-pd-0.basic-pd-peer.namespace.svc:2380
, Get “http://basic-pd.namespace:2379/pd/api/v1/members”: dial tcp 10.109.90.57:2379: connect: connection refused, register-type is: pd

通过在discovery上查看此链接,dns不通,且地址刚好是这个未知的IP:10.109.90.57:
[root@basic-discovery-5f4687dbb9-w64t2 /]# wget http://basic-pd.namespace:2379/pd/api/v1/members
bash: wget: command not found
[root@basic-discovery-5f4687dbb9-w64t2 /]# nslookup basic-pd
Server: 10.96.0.10
Address: 10.96.0.10#53

Name: basic-pd.namespace.svc.cluster.local
Address: 10.109.90.57
** server can’t find basic-pd.namespace.svc.cluster.local: NXDOMAIN

[root@basic-discovery-5f4687dbb9-w64t2 /]# exit
exit
command terminated with exit code 1

PD Leader 是否正常?(断电的哪个集群)

PD Follower 节点如果数据损坏,不如缩掉,重新扩。

所以就是网络不通?
Get “http://basic-pd.namespace:2379/pd/api/v1/members”: dial tcp 10.109.90.57:2379: connect: connection refused, register-type is: pd
这里看是获取 api 返回被拒绝了
环境问题吧

环境问题? 什么环境问题。 discovery报500错的原因有哪些?

这部分报错是 pd 去 get 这个 api 失败了。这是去 pd 里面拿 pd 的 member 信息。如果你当时集群是可以访问的,那么说明 pd 是正常的。但是这个 api 调不通。如果你手动尝试可以调通但是 pd 不通 可以查查原因。