tidb v3.0.20部署后启动报错“wait until the PD health page is available”

【概述】tidb v3.0.20部署后启动报错“wait until the PD health page is available”

【背景】8台虚拟机按3tikv(192.168.100.131/132/133)+3pd(192.168.100.134/135/136)+1tidb(192.168.100.137)+1monitor(192.168.100.138)配置,使用ansible 2.5.0部署后启动时在“TASK [wait until the PD health page is available]”过程中报错,错误信息如下:

FAILED - RETRYING: wait until the PD health page is available (12 retries left).
。。。
FAILED - RETRYING: wait until the PD health page is available (1 retries left).
fatal: [192.168.100.136]: FAILED! => {“attempts”: 12, “changed”: false, “content”: “”, “msg”: “Status code was -1 and not [200]: Connection failure: timed out”, “redirected”: false, “status”: -1, “url”: “http://192.168.100.136:2379/health”}

按其他相同错误信息指引检查pd日志,有如下信息:
[2021/07/01 10:38:58.766 +08:00] [WARN] [server.go:2045] [“failed to publish local member to cluster through raft”] [local-member-id=3e41d22d4ba507c2] [local-member-attributes="{Name:pd_pd-1 ClientURLs:[http://192.168.100.134:2379]}"] [request-path=/0/members/3e41d22d4ba507c2/attributes] [publish-timeout=11s] [error=“etcdserver: request timed out”]
[2021/07/01 10:38:58.780 +08:00] [WARN] [probing_status.go:70] [“prober detected unhealthy status”] [round-tripper-name=ROUND_TRIPPER_RAFT_MESSAGE] [remote-peer-id=7450b719d5e549d8] [rtt=0s] [error=“dial tcp 192.168.100.136:2380: connect: no route to host”]

登录pd-ctl,执行health命令结果如下:
[tidb@controller bin]$ ./pd-ctl -i -u http://192.168.100.134:2379
» health
Get http://192.168.100.134:2379/pd/health: dial tcp 192.168.100.134:2379: connect: no route to host
» exit
[tidb@controller bin]$ ./pd-ctl -i -u http://192.168.100.135:2379
» health
Get http://192.168.100.135:2379/pd/health: dial tcp 192.168.100.135:2379: connect: no route to host
» exit
[tidb@controller bin]$ ./pd-ctl -i -u http://192.168.100.136:2379
» health
Get http://192.168.100.136:2379/pd/health: dial tcp 192.168.100.136:2379: connect: no route to host

各节点IP均能ping通,pd端口均正常,如下所示:
TASK [wait until the PD port is up] *************************************************************************************************************************
ok: [192.168.100.134]
ok: [192.168.100.135]
ok: [192.168.100.136]

[root@pd-1 ~]# ss -tpnl | grep 23
LISTEN 0 32768 192.168.100.134:2379 : users:((“pd-server”,pid=10233,fd=7))
LISTEN 0 32768 192.168.100.134:2380 : users:((“pd-server”,pid=10233,fd=6))

[root@pd-2 ~]# ss -tpnl | grep 23
LISTEN 0 32768 192.168.100.135:2379 : users:((“pd-server”,pid=9780,fd=7))
LISTEN 0 32768 192.168.100.135:2380 : users:((“pd-server”,pid=9780,fd=6))

[root@pd-3 ~]# ss -tpnl | grep 23
LISTEN 0 32768 192.168.100.136:2379 : users:((“pd-server”,pid=9770,fd=7))
LISTEN 0 32768 192.168.100.136:2380 : users:((“pd-server”,pid=9770,fd=6))
【现象】

【业务影响】

【TiDB 版本】v3.0.20

【附件】

这个报错搜索下看看呢,这里有个 FAQ [FAQ] wait until the PD health page is available

另外 ansible 这边我们已经不维护了,建议使用 TiUP 方式部署集群。