Bug 反馈
基于TiOperator部署在K8S环境中,模拟一个节点掉电恢复,概率性出现恢复时PD无法正确加入集群
【 TiDB 版本】6.5.8 TiOperator 1.5.3
【 Bug 的影响】
节点故障恢复后,概率性出现该节点PD不能正确加入集群,可靠性下降,其余节点出现故障时集群不可用
【可能的问题复现步骤】
集群3节点运行正常的情况下,其中一个节点断电下线,过一会重新上线。
【看到的非预期行为】
该节点上电后,pd一直尝试从discovery获取信息,但一直失败:
kubectl logs -f -n namespace basic-pd-0
Server: 10.96.0.10
Address 1: 10.96.0.10 kube-dns.kube-system.svc.cluster.local
Name: basic-pd-0.basic-pd-peer.namespace.svc
Address 1: 100.126.210.188 basic-pd-0.basic-pd-peer.namespace.svc.cluster.local
nslookup domain basic-pd-0.basic-pd-peer.namespace.svc.svc success
waiting for discovery service to return start args …
waiting for discovery service to return start args …
waiting for discovery service to return start args …
waiting for discovery service to return start args …
waiting for discovery service to return start args …
【期望看到的行为】
pd应该正常从新加入集群。
【相关组件及具体版本】
【其他背景信息或者截图】
通过查看启动脚本,一直在下面的逻辑处,但由于日志不足,不能进一步判断:
then
until result=$(wget -qO- -T 3 http://${discovery_url}/new/${encoded_domain_url} 2>/dev/null); do
echo “waiting for discovery service to return start args …”
sleep $((RANDOM % 5))
done
故障pd从discovery wget 报500错:
/ # wget -qO- -T 3 http://basic-discovery.namespace.svc:10261/new/YmFzaWMtcGQtMC5iYXNpYy1wZC1wZWVyLm1haXB1LWJkY2FtcHVzLnN2YzoyMzgwCg==
wget: server returned error: HTTP/1.1 500 Internal Server Error
/ # nslookup basic-discovery
Server: 10.96.0.10
Address 1: 10.96.0.10 kube-dns.kube-system.svc.cluster.local
Name: basic-discovery
Address 1: 10.99.76.247 basic-discovery.namespace.svc.cluster.local
/ # exit
对应的discovery日志也是报错如下:
I0925 07:28:46.750851 1 discovery.go:80] advertisePeerUrl is: basic-pd-0.basic-pd-peer.namespace.svc:2380
E0925 07:28:46.778935 1 server.go:91] failed to discover: basic-pd-0.basic-pd-peer.namespace.svc:2380
, Get “http://basic-pd.namespace:2379/pd/api/v1/members”: dial tcp 10.109.90.57:2379: connect: connection refused, register-type is: pd
I0925 07:28:50.789559 1 discovery.go:80] advertisePeerUrl is: basic-pd-0.basic-pd-peer.namespace.svc:2380
E0925 07:28:55.799493 1 server.go:91] failed to discover: basic-pd-0.basic-pd-peer.namespace.svc:2380
, Get “http://basic-pd.namespace:2379/pd/api/v1/members”: context deadline exceeded (Client.Timeout exceeded while awaiting headers), register-type is: pd
I0925 07:28:55.799611 1 discovery.go:80] advertisePeerUrl is: basic-pd-0.basic-pd-peer.namespace.svc:2380
E0925 07:29:00.815882 1 server.go:91] failed to discover: basic-pd-0.basic-pd-peer.namespace.svc:2380
, Get “http://basic-pd.namespace:2379/pd/api/v1/members”: context deadline exceeded (Client.Timeout exceeded while awaiting headers), register-type is: pd
I0925 07:29:01.810966 1 discovery.go:80] advertisePeerUrl is: basic-pd-0.basic-pd-peer.namespace.svc:2380
E0925 07:29:01.822615 1 server.go:91] failed to discover: basic-pd-0.basic-pd-peer.namespace.svc:2380
, Get “http://basic-pd.namespace:2379/pd/api/v1/members”: dial tcp 10.109.90.57:2379: connect: connection refused, register-type is: pd
I0925 07:29:03.832399 1 discovery.go:80] advertisePeerUrl is: basic-pd-0.basic-pd-peer.namespace.svc:2380
E0925 07:29:08.844371 1 server.go:91] failed to discover: basic-pd-0.basic-pd-peer.namespace.svc:2380
, Get “http://basic-pd.namespace:2379/pd/api/v1/members”: context deadline exceeded (Client.Timeout exceeded while awaiting headers), register-type is: pd
I0925 07:29:12.839433 1 discovery.go:80] advertisePeerUrl is: basic-pd-0.basic-pd-peer.namespace.svc:2380
E0925 07:29:17.851134 1 server.go:91] failed to discover: basic-pd-0.basic-pd-peer.namespace.svc:2380
, Get “http://basic-pd.namespace:2379/pd/api/v1/members”: context deadline exceeded (Client.Timeout exceeded while awaiting headers), register-type is: pd
I0925 07:29:19.848595 1 discovery.go:80] advertisePeerUrl is: basic-pd-0.basic-pd-peer.namespace.svc:2380
E0925 07:29:19.858338 1 server.go:91] failed to discover: basic-pd-0.basic-pd-peer.namespace.svc:2380
, Get “http://basic-pd.namespace:2379/pd/api/v1/members”: dial tcp 10.109.90.57:2379: connect: connection refused, register-type is: pd
通过在discovery上查看此链接,dns不通,且地址刚好是这个未知的IP:10.109.90.57:
[root@basic-discovery-5f4687dbb9-w64t2 /]# wget http://basic-pd.namespace:2379/pd/api/v1/members
bash: wget: command not found
[root@basic-discovery-5f4687dbb9-w64t2 /]# nslookup basic-pd
Server: 10.96.0.10
Address: 10.96.0.10#53
Name: basic-pd.namespace.svc.cluster.local
Address: 10.109.90.57
** server can’t find basic-pd.namespace.svc.cluster.local: NXDOMAIN
[root@basic-discovery-5f4687dbb9-w64t2 /]# exit
exit
command terminated with exit code 1