部署3.0.20在启动集群的时候提示所有tikv节点启动失败

xxxxxxxx · 2021 年6 月 30 日 10:46

inventory.ini配置文件如下（group_vars目录下关于端口的配置都已经注释）
[tidb_servers]
192.168.1.82
192.168.1.83
192.168.1.84

[tikv_servers]
TiKV-17005-01 ansible_host=192.168.1.85 tikv_data_dir=/work/tidb-data/tikv-17005 tikv_port=17005 tikv_status_port=18005 labels=“dc=tjtx,rack=K06,host=tikv01”
TiKV-17005-02 ansible_host=192.168.1.86 tikv_data_dir=/work/tidb-data/tikv-17005 tikv_port=17005 tikv_status_port=18005 labels=“dc=tjtx,rack=K11,host=tikv02”
TiKV-17005-03 ansible_host=192.168.1.87 tikv_data_dir=/work/tidb-data/tikv-17005 tikv_port=17005 tikv_status_port=18005 labels=“dc=tjtx,rack=K08,host=tikv03”

[pd_servers]
192.168.1.82
192.168.1.83
192.168.1.84

[spark_master]

[spark_slaves]

[monitoring_servers]
192.168.1.81

[grafana_servers]
192.168.1.81

[monitored_servers]
192.168.1.81
192.168.1.82
192.168.1.83
192.168.1.84
192.168.1.85
192.168.1.86
192.168.1.87

[alertmanager_servers]

[kafka_exporter_servers]

[pump_servers]

[drainer_servers]

[pd_servers:vars]

location_labels = [“dc”,“rack”,“host”]

[all:vars]

deploy_dir = /work/tidb-deploy/tidb-17005
tidb_port = 15005
tidb_status_port = 16005
pump_port = 23005
drainer_port = 24005
pd_client_port = 13005
pd_peer_port = 14005
prometheus_port = 19005
pushgateway_port = 25005
node_exporter_port = 11005
blackbox_exporter_port = 12005
kafka_exporter_port = 26005
grafana_port = 20005
grafana_collector_port = 27005
alertmanager_port = 21005
alertmanager_cluster_port = 22005
ansible_user = tidb
cluster_name = tidb-test-05
tidb_version = v3.0.20
process_supervision = supervise
timezone = Asia/Shanghai
enable_firewalld = False
enable_ntpd = True
set_hostname = False
enable_binlog = False
kafka_addrs = “”
zookeeper_addrs = “”
enable_slow_query_log = False
enable_tls = False
deploy_without_tidb = False
alertmanager_target = “192.168.1.81:21005”
grafana_admin_user = “admin”
grafana_admin_password = “admin”
collect_log_recent_hours = 2
enable_bandwidth_limit = True
collect_bandwidth_limit = 10005

部署成功后，tikv启动报错
ERROR MESSAGE SUMMARY *****************************************************************************************************************************************************
[TiKV-17005-02]: Ansible FAILED! => playbook: start.yml; TASK: wait until the TiKV port is up; message: {“changed”: false, “elapsed”: 300, “msg”: “the TiKV port 17005 is not up”}

[TiKV-17005-03]: Ansible FAILED! => playbook: start.yml; TASK: wait until the TiKV port is up; message: {“changed”: false, “elapsed”: 300, “msg”: “the TiKV port 17005 is not up”}

[TiKV-17005-01]: Ansible FAILED! => playbook: start.yml; TASK: wait until the TiKV port is up; message: {“changed”: false, “elapsed”: 300, “msg”: “the TiKV port 17005 is not up”}

tikv日志报如下错误

[2021/06/30 18:26:29.844 +08:00] [INFO] [util.rs:402] [“connecting to PD endpoint”] [endpoints=192.168.1.82:13005]
[2021/06/30 18:26:29.846 +08:00] [INFO] [subchannel.cc:841] [“New connected subchannel at 0x7fcd6be3a150 for subchannel 0x7fcd6ba21000”]
[2021/06/30 18:26:29.847 +08:00] [INFO] [util.rs:402] [“connecting to PD endpoint”] [endpoints=192.168.1.83:13005]
[2021/06/30 18:26:29.848 +08:00] [INFO] [subchannel.cc:841] [“New connected subchannel at 0x7fcd6be3a120 for subchannel 0x7fcd6ba21200”]
[2021/06/30 18:26:29.849 +08:00] [INFO] [util.rs:402] [“connecting to PD endpoint”] [endpoints=192.168.1.84:13005]
[2021/06/30 18:26:29.850 +08:00] [INFO] [subchannel.cc:841] [“New connected subchannel at 0x7fcd6be3a0c0 for subchannel 0x7fcd6ba21400”]
[2021/06/30 18:26:29.851 +08:00] [INFO] [util.rs:402] [“connecting to PD endpoint”] [endpoints=http://192.168.1.83:13005]
[2021/06/30 18:26:29.852 +08:00] [INFO] [subchannel.cc:841] [“New connected subchannel at 0x7fcd6be3a060 for subchannel 0x7fcd6ba21600”]
[2021/06/30 18:26:29.853 +08:00] [INFO] [util.rs:402] [“connecting to PD endpoint”] [endpoints=http://192.168.1.82:13005]
[2021/06/30 18:26:29.854 +08:00] [INFO] [subchannel.cc:841] [“New connected subchannel at 0x7fcd6be3a030 for subchannel 0x7fcd6ba21800”]
[2021/06/30 18:26:29.855 +08:00] [INFO] [util.rs:461] [“connected to PD leader”] [endpoints=http://192.168.1.82:13005]
[2021/06/30 18:26:29.855 +08:00] [INFO] [util.rs:390] [“all PD endpoints are consistent”] [endpoints="[“192.168.1.82:13005”, “192.168.1.83:13005”, “192.168.1.84:13005”]"]
[2021/06/30 18:26:29.855 +08:00] [INFO] [server.rs:70] [“connect to PD cluster”] [cluster_id=6979528352769834832]
[2021/06/30 18:26:29.855 +08:00] [INFO] [mod.rs:334] [“starting working thread”] [worker=addr-resolver]
[2021/06/30 18:26:29.904 +08:00] [INFO] [scheduler.rs:257] [“Scheduler::new is called to initialize the transaction scheduler”]
[2021/06/30 18:26:29.977 +08:00] [INFO] [scheduler.rs:278] [“Scheduler::new is finished, the transaction scheduler is initialized”]
[2021/06/30 18:26:29.977 +08:00] [INFO] [mod.rs:334] [“starting working thread”] [worker=gc-worker]
[2021/06/30 18:26:29.977 +08:00] [INFO] [mod.rs:722] [“Storage started.”]
[2021/06/30 18:26:29.980 +08:00] [INFO] [server.rs:147] [“listening on addr”] [addr=0.0.0.0:17005]
[2021/06/30 18:26:29.980 +08:00] [INFO] [mod.rs:334] [“starting working thread”] [worker=region-collector-worker]
[2021/06/30 18:26:29.981 +08:00] [FATAL] [server.rs:264] [“failed to start node: “[src/server/node.rs:188]: cluster ID mismatch, local 6979487783307586567 != remote 6979528352769834832, you are trying to connect to another cluster, please reconnect to the correct PD””]

tikv启动脚本
------tikv启动脚本开始-------
#!/bin/bash
set -e
ulimit -n 1000000

cd “/work/tidb-deploy/tidb-17005” || exit 1

export RUST_BACKTRACE=1

export TZ=${TZ:-/etc/localtime}

echo -n 'sync … ’
stat=$(time sync)
echo ok
echo $stat

echo $$ > “status/tikv.pid”

exec bin/tikv-server
–addr “0.0.0.0:17005”
–advertise-addr “192.168.1.85:17005”
–status-addr “192.168.1.85:18005”
–pd “192.168.1.82:13005,192.168.1.83:13005,192.168.1.84:13005”
–data-dir “/work/tidb-data/tikv-17005”
–config conf/tikv.toml
–log-file “/work/tidb-deploy/tidb-17005/log/tikv.log” 2>> “/work/tidb-deploy/tidb-17005/log/tikv_stderr.log”

------tikv启动脚本结束-------

pd启动脚本
------pd启动脚本结束-------
#!/bin/bash
set -e
ulimit -n 1000000

DEPLOY_DIR=/work/tidb-deploy/tidb-17005

cd “${DEPLOY_DIR}” || exit 1

exec bin/pd-server
–name=“pd_dba-10-242-6-82”
–client-urls=“http://192.168.1.82:13005”
–advertise-client-urls=“http://192.168.1.82:13005”
–peer-urls=“http://192.168.1.82:14005”
–advertise-peer-urls=“http://192.168.1.82:14005”
–data-dir="/work/tidb-deploy/tidb-17005/data.pd"
–initial-cluster=“pd_dba-10-242-6-82=http://192.168.1.82:14005,pd_dba-10-242-6-83=http://192.168.1.83:14005,pd_dba-10-242-6-84=http://192.168.1.84:14005”
–config=conf/pd.toml
–log-file="/work/tidb-deploy/tidb-17005/log/pd.log" 2>> “/work/tidb-deploy/tidb-17005/log/pd_stderr.log”

------pd启动脚本结束-------

tikv机器telnet pd是通的
[root@dba-192-168-1-85 ~]# telnet 192.168.1.82 13005
Trying 192.168.1.82…
Connected to 192.168.1.82.
Escape character is ‘^]’.

xfworld · 2021 年6 月 30 日 13:46

看起来集群信息好像丢失了

[2021/06/30 18:26:29.981 +08:00] [FATAL] [server.rs:264] [“failed to start node: “[src/server/node.rs:188]: cluster ID mismatch, local 6979487783307586567 != remote 6979528352769834832, you are trying to connect to another cluster, please reconnect to the correct PD””]

要不手动重建集群试试？参考这个文档
https://docs.pingcap.com/zh/tidb/v3.0/pd-recover

可以尝试执行：
./pd-recover -endpoints http://10.0.1.13:2379 -cluster-id 6747551640615446306 -alloc-id 10000