TiDB集群启动不起来

想问问,我之前步骤提示一切顺利,但是每次在启动 TiDB 集群那一步就不行,会卡在TASK [wait until the PD port is up]一步和TASK [wait until the PD health page is available]这一步。

报错信息分别是fatal: [192.168.36.152]: FAILED! => {“changed”: false, “elapsed”: 300, “msg”: “the PD port 2379 is not up”}和FAILED - RETRYING: wait until the PD health page is available (12 retries left).

总共三台机器,中控机的PD-server能启动,另外两台的pd-server没有启动

ERROR MESSAGE SUMMARY ***************************************************************************************************** [192.168.36.152]: Ansible FAILED! => playbook: start.yml; TASK: wait until the PD port is up; message: {“changed”: false, “elapsed”: 300, “msg”: “the PD port 2379 is not up”}

[192.168.36.151]: Ansible FAILED! => playbook: start.yml; TASK: wait until the PD port is up; message: {“changed”: false, “elapsed”: 300, “msg”: “the PD port 2379 is not up”}

[192.168.104.65]: Ansible FAILED! => playbook: start.yml; TASK: wait until the PD health page is available; message: {“attempts”: 12, “changed”: false, “content”: “”, “msg”: “Status code was -1 and not [200]: Request failed: <urlopen error [Errno 111] Connection refused>”, “redirected”: false, “status”: -1, “url”: “http://192.168.104.65:2379/health”}

请按提示填下必要信息。另外先从pd.log 开始排查起。

  • 系统版本 & kernel 版本 】centOS7
  • TiDB 版本

tidb_version = v2.1.8

  • 磁盘型号
  • 集群节点分布 】 [tidb_servers] 192.168.104.65 192.168.36.151

[tikv_servers] 192.168.104.65 192.168.36.151 192.168.36.152

[pd_servers] 192.168.104.65 192.168.36.151 192.168.36.152

[spark_master]

[spark_slaves]

[lightning_server]

[importer_server]

[monitoring_servers] 192.168.104.65

[grafana_servers] 192.168.104.65

[monitored_servers] 192.168.104.65 192.168.36.151 192.168.36.152

[alertmanager_servers] 192.168.104.65

  • 数据量 & region 数量 & 副本数
  • 问题描述(我做了什么) 】 运行start.yml到启动pd-server这里的时候 PLAY [pd_servers]

TASK [start PD by supervise] **********************************************************************************************

TASK [start PD by systemd] ************************************************************************************************ changed: [192.168.36.152] changed: [192.168.36.151] changed: [192.168.104.65]

TASK [wait until the PD port is up] *************************************************************************************** ok: [192.168.104.65] fatal: [192.168.36.152]: FAILED! => {“changed”: false, “elapsed”: 300, “msg”: “the PD port 2379 is not up”} fatal: [192.168.36.151]: FAILED! => {“changed”: false, “elapsed”: 300, “msg”: “the PD port 2379 is not up”}

TASK [wait until the PD health page is available] ************************************************************************* FAILED - RETRYING: wait until the PD health page is available (12 retries left). FAILED - RETRYING: wait until the PD health page is available (11 retries left). FAILED - RETRYING: wait until the PD health page is available (10 retries left). FAILED - RETRYING: wait until the PD health page is available (9 retries left). FAILED - RETRYING: wait until the PD health page is available (8 retries left). FAILED - RETRYING: wait until the PD health page is available (7 retries left). FAILED - RETRYING: wait until the PD health page is available (6 retries left). FAILED - RETRYING: wait until the PD health page is available (5 retries left). FAILED - RETRYING: wait until the PD health page is available (4 retries left). FAILED - RETRYING: wait until the PD health page is available (3 retries left). FAILED - RETRYING: wait until the PD health page is available (2 retries left). FAILED - RETRYING: wait until the PD health page is available (1 retries left). fatal: [192.168.104.65]: FAILED! => {“attempts”: 12, “changed”: false, “content”: “”, “msg”: “Status code was -1 and not [200]: Request failed: <urlopen error [Errno 111] Connection refused>”, “redirected”: false, “status”: -1, “url”: “http://192.168.104.65:2379/health”} to retry, use: --limit @/home/tidb/tidb-ansible/retry_files/start.retry

PLAY RECAP **************************************************************************************************************** 192.168.104.65 : ok=19 changed=6 unreachable=0 failed=1 192.168.36.151 : ok=10 changed=3 unreachable=0 failed=1 192.168.36.152 : ok=10 changed=3 unreachable=0 failed=1 localhost : ok=1 changed=0 unreachable=0 failed=0

ERROR MESSAGE SUMMARY ***************************************************************************************************** [192.168.36.152]: Ansible FAILED! => playbook: start.yml; TASK: wait until the PD port is up; message: {“changed”: false, “elapsed”: 300, “msg”: “the PD port 2379 is not up”}

[192.168.36.151]: Ansible FAILED! => playbook: start.yml; TASK: wait until the PD port is up; message: {“changed”: false, “elapsed”: 300, “msg”: “the PD port 2379 is not up”}

[192.168.104.65]: Ansible FAILED! => playbook: start.yml; TASK: wait until the PD health page is available; message: {“attempts”: 12, “changed”: false, “content”: “”, “msg”: “Status code was -1 and not [200]: Request failed: <urlopen error [Errno 111] Connection refused>”, “redirected”: false, “status”: -1, “url”: “http://192.168.104.65:2379/health”}

  • 关键词

提供下 pd.log 的日志

pd.log (917.6 KB)

每个 PD SERVER 都有自己的 pd.log。把另外两个 pd server 的 log 也提供一下。

151机器的pd.log (15.5 KB) 152机器的pd.log (38.5 KB)

昨天回去了,没看到消息。。。谢谢大佬帮忙啊

看了下 PD 报错日志,检查下是不是内外网 IP 的关系,这里有一个类似的问题可以参考看下:PD端口无法起来

对,问题和他的很像,昨天下午我就感觉是这个问题了,但是换了下IP还是启动报错,今天有时间我再重新部署一下。

谢谢,今天上午从新部署了一下,问题解决,确实是ip问题,换了ip