5.0.3集群启动问题

seiang · 2021 年11 月 1 日 07:15

为提高效率，请提供以下信息，问题描述清晰能够更快得到解决：
【 TiDB 使用环境】
v5.0.3

【概述】场景+问题概述
线上新部署一套集群，使用 check 及 check --apply 命令都没有问题，网络也都是通的，ssh互信也是没问题的

拓扑结构：
#TiDB Config
global:
user: “tidb”
ssh_port: 22
deploy_dir: “/data/tidb-deploy”
data_dir: “/data/tidb-data”
monitored:
node_exporter_port: 9100
blackbox_exporter_port: 9115
server_configs:
tidb:
performance.txn-total-size-limit: 1073741824
tikv:
readpool.storage.use-unified-pool: false
readpool.coprocessor.use-unified-pool: true
pd:
schedule.leader-schedule-limit: 4
schedule.region-schedule-limit: 2048
schedule.replica-schedule-limit: 64
replication.enable-placement-rules: true
pd_servers:

host: 10.22.xx.36
host: 10.22.xx.37
host: 10.22.xx.38
tidb_servers:
host: 10.22.xx.36
host: 10.22.xx.37
host: 10.22.xx.38
tikv_servers:
host: 10.22.xx.30
host: 10.22.xx.31
host: 10.22.xx.32
host: 10.22.xx.33
host: 10.22.xx.34
host: 10.22.xx.35
monitoring_servers:
host: 10.22.xx.39
grafana_servers:
host: 10.22.xx.39
alertmanager_servers:
host: 10.22.xx.39

但是在启动集群的时候，出现了下面的报错：

[ Serial ] - UpdateTopology: cluster=tidb-prod004
{“level”:“warn”,“ts”:“2021-11-01T15:09:04.763+0800”,“logger”:“etcd-client”,“caller”:“v3@v3.5.0/retry_interceptor.go:62”,“msg”:“retrying of unary invoker failed”,“target”:“etcd-endpoints://0xc00024ee00/#initially=[10.22.128.36:2379;10.22.xx.37:2379;10.22.xx.38:2379]”,“attempt”:0,“error”:“rpc error: code = DeadlineExceeded desc = latest balancer error: last connection error: connection error: desc = “transport: Error while dialing dial tcp 10.22.xx.38:2379: connect: no route to host””}

Error: context deadline exceeded

从报错看，好像是pd节点网络通信问题，但是检查pd三个节点服务是正常启动的，具体是哪里的问题呢？

相关日志和监控

TiUP Cluster Display 信息
TiUP Cluster Edit Config 信息
TiDB- Overview 监控

对应模块日志（包含问题前后1小时日志）

HHHHHHULK · 2021 年11 月 1 日 07:35

报错看，还是网络问题，防火墙关闭了吗

CuteRay · 2021 年11 月 1 日 07:36

可以看看10.22.xx.38 上防火墙关了没，再就是看看/etc/hosts里面有没有配置主机名

seiang · 2021 年11 月 1 日 07:40

防火墙已经开通了权限的，并且/etc/hosts里面也是有配置主机名的

之前部署的集群部署流程都是一样的，但是这次不清楚为啥在启动的时候就出现了上述的报错；

seiang · 2021 年11 月 1 日 07:41

防火墙已经开通了权限的，并且/etc/hosts里面也是有配置主机名的

HHHHHHULK · 2021 年11 月 1 日 08:12

这就很难排查了，之前也遇到过类似问题，网络都通的，策略都做好的，但还是报 no route to host，最后排查下来还是防火墙的问题。

所以建议可以先关下防火墙试一下，如果还是不行的话，打开防火墙再继续其他方向的排查。

Kongdom · 2021 年11 月 1 日 08:15

我这边遇到过一次是服务器DNS的故障导致访问不同，将DNS配置为114.114.114.114之后就正常了。

seiang · 2021 年11 月 2 日 01:46

已经解决了，确实是防火墙的问题导致的，感谢；

system · 2022 年10 月 31 日 19:16

此话题已在最后回复的 1 分钟后被自动关闭。不再允许新回复。