集群restart失败

zanglidong · 2021 年1 月 13 日 07:54

集群从4.0.0升级到4.0.9之后之reload过配置，没有重启。
本次重启就出现了问题。
报告pd启动失败了，可pd是up状态的。
stop后再start也是同样的。

[tidb@b16 ~]$ tiup cluster display test-cluster
Starting component `cluster`: /home/tidb/.tiup/components/cluster/v1.3.1/tiup-cluster display test-cluster
Cluster type:       tidb
Cluster name:       test-cluster
Cluster version:    v4.0.9
SSH type:           builtin
Dashboard URL:      http://192.168.241.26:12379/dashboard
ID                    Role          Host            Ports        OS/Arch       Status        Data Dir                                        Deploy Dir
--                    ----          ----            -----        -------       ------        --------                                        ----------
192.168.241.7:9093    alertmanager  192.168.241.7   9093/9094    linux/x86_64  inactive      /home/tidb/deploy/data.alertmanager             /home/tidb/deploy
192.168.241.7:3000    grafana       192.168.241.7   3000         linux/x86_64  inactive      -                                               /home/tidb/deploy
192.168.241.24:12379  pd            192.168.241.24  12379/12380  linux/x86_64  Up            /disk1/pd/data.pd                               /disk1/pd
192.168.241.26:12379  pd            192.168.241.26  12379/12380  linux/x86_64  Up|UI         /disk1/pd/data.pd                               /disk1/pd
192.168.241.49:12379  pd            192.168.241.49  12379/12380  linux/x86_64  Up|L          /disk1/pd/data.pd                               /disk1/pd
192.168.241.7:9090    prometheus    192.168.241.7   9090         linux/x86_64  inactive      /home/tidb/deploy/prometheus2.0.0.data.metrics  /home/tidb/deploy
192.168.241.26:4000   tidb          192.168.241.26  4000/10080   linux/x86_64  Down          -                                               /disk1/pd
192.168.241.7:4000    tidb          192.168.241.7   4000/10080   linux/x86_64  Down          -                                               /home/tidb/deploy
192.168.241.11:20160  tikv          192.168.241.11  20160/20180  linux/x86_64  Disconnected  /disk1/tikv/data                                /disk1/tikv
192.168.241.53:20160  tikv          192.168.241.53  20160/20180  linux/x86_64  Disconnected  /disk2/tikv/data                                /disk2/tikv
192.168.241.56:20160  tikv          192.168.241.56  20160/20180  linux/x86_64  Disconnected  /disk1/tikv/data                                /disk1/tikv
192.168.241.58:20160  tikv          192.168.241.58  20160/20180  linux/x86_64  Disconnected  /disk1/tikv/data                                /disk1/tikv
Total nodes: 12
[tidb@b16 ~]$ tiup cluster start test-cluster
Starting component `cluster`: /home/tidb/.tiup/components/cluster/v1.3.1/tiup-cluster start test-cluster
Starting cluster test-cluster...
+ [ Serial ] - SSHKeySet: privateKey=/home/tidb/.tiup/storage/cluster/clusters/test-cluster/ssh/id_rsa, publicKey=/home/tidb/.tiup/storage/cluster/clusters/test-cluster/ssh/id_rsa.pub
+ [Parallel] - UserSSH: user=tidb, host=192.168.241.7
+ [Parallel] - UserSSH: user=tidb, host=192.168.241.24
+ [Parallel] - UserSSH: user=tidb, host=192.168.241.26
+ [Parallel] - UserSSH: user=tidb, host=192.168.241.11
+ [Parallel] - UserSSH: user=tidb, host=192.168.241.56
+ [Parallel] - UserSSH: user=tidb, host=192.168.241.7
+ [Parallel] - UserSSH: user=tidb, host=192.168.241.26
+ [Parallel] - UserSSH: user=tidb, host=192.168.241.58
+ [Parallel] - UserSSH: user=tidb, host=192.168.241.53
+ [Parallel] - UserSSH: user=tidb, host=192.168.241.7
+ [Parallel] - UserSSH: user=tidb, host=192.168.241.49
+ [Parallel] - UserSSH: user=tidb, host=192.168.241.7
+ [ Serial ] - StartCluster
Starting component pd
	Starting instance pd 192.168.241.26:12379
	Starting instance pd 192.168.241.49:12379
	Starting instance pd 192.168.241.24:12379
	Start pd 192.168.241.49:12379 success
	Start pd 192.168.241.26:12379 success
	Start pd 192.168.241.24:12379 success
Starting component node_exporter
	Starting instance 192.168.241.26

Error: failed to start: pd 192.168.241.26:12379, please check the instance's log(/disk1/pd/log) for more detail.: timed out waiting for port 9100 to be started after 2m0s

Verbose debug logs has been written to /home/tidb/.tiup/logs/tiup-cluster-debug-2021-01-13-15-42-43.log.
Error: run `/home/tidb/.tiup/components/cluster/v1.3.1/tiup-cluster` (wd:/home/tidb/.tiup/data/SLxR0Yk) failed: exit status 1

[root@b26 log]# tail -f pd.log 
[2021/01/13 15:49:09.577 +08:00] [WARN] [proxy.go:181] ["fail to recv activity from remote, stay inactive and wait to next checking round"] [remote=192.168.241.7:4000] [interval=2s] [error="dial tcp 192.168.241.7:4000: connect: connection refused"]
[2021/01/13 15:49:09.577 +08:00] [WARN] [proxy.go:181] ["fail to recv activity from remote, stay inactive and wait to next checking round"] [remote=192.168.241.7:10080] [interval=2s] [error="dial tcp 192.168.241.7:10080: connect: connection refused"]
[2021/01/13 15:49:11.577 +08:00] [WARN] [proxy.go:181] ["fail to recv activity from remote, stay inactive and wait to next checking round"] [remote=192.168.241.26:4000] [interval=2s] [error="dial tcp 192.168.241.26:4000: connect: connection refused"]
[2021/01/13 15:49:11.577 +08:00] [WARN] [proxy.go:181] ["fail to recv activity from remote, stay inactive and wait to next checking round"] [remote=192.168.241.26:10080] [interval=2s] [error="dial tcp 192.168.241.26:10080: connect: connection refused"]
[2021/01/13 15:49:11.577 +08:00] [WARN] [proxy.go:181] ["fail to recv activity from remote, stay inactive and wait to next checking round"] [remote=192.168.241.7:4000] [interval=2s] [error="dial tcp 192.168.241.7:4000: connect: connection refused"]
[2021/01/13 15:49:11.577 +08:00] [WARN] [proxy.go:181] ["fail to recv activity from remote, stay inactive and wait to next checking round"] [remote=192.168.241.7:10080] [interval=2s] [error="dial tcp 192.168.241.7:10080: connect: connection refused"]
[2021/01/13 15:49:13.577 +08:00] [WARN] [proxy.go:181] ["fail to recv activity from remote, stay inactive and wait to next checking round"] [remote=192.168.241.26:10080] [interval=2s] [error="dial tcp 192.168.241.26:10080: connect: connection refused"]
[2021/01/13 15:49:13.577 +08:00] [WARN] [proxy.go:181] ["fail to recv activity from remote, stay inactive and wait to next checking round"] [remote=192.168.241.26:4000] [interval=2s] [error="dial tcp 192.168.241.26:4000: connect: connection refused"]
[2021/01/13 15:49:13.577 +08:00] [WARN] [proxy.go:181] ["fail to recv activity from remote, stay inactive and wait to next checking round"] [remote=192.168.241.7:4000] [interval=2s] [error="dial tcp 192.168.241.7:4000: connect: connection refused"]

[tidb@b16 ~]$ tiup cluster -v
Starting component `cluster`: /home/tidb/.tiup/components/cluster/v1.3.1/tiup-cluster -v
tiup version v1.3.1 tiup
Go Version: go1.13
Git Branch: release-1.3
GitHash: d51bd0c

zanglidong · 2021 年1 月 13 日 08:43

首先存在此问题：升级 TIUP 后，重启时报nodeexporter 出错

从ansible升级到tiup后，启动脚本位于/home/tidb/deploy/scripts，修改bin/node_exporter的路劲。
仍然还存在问题，继续。
我原本的bin文件位于/disk1/pd/bin,
把此文件夹复制到 /home/tidb/deploy,
然后问题解决，可以启动了。
但是此时的问题是，我的deploy目录在哪里？/home/tidb/deploy还是/disk1/pd/

我的edit-config此节点信息：

pd_servers:
- host: 192.168.241.26
  ssh_port: 17717
  imported: true
  name: pd_b26
  client_port: 12379
  peer_port: 12380
  deploy_dir: /disk1/pd
  data_dir: /disk1/pd/data.pd
  log_dir: /disk1/pd/log
  arch: amd64
  os: linux

最后我修改了 run_node_exporter中的deploy_dir

北京大爷 · 2021 年1 月 13 日 14:06

看起来是两个问题
1.请上传下完整 tiup log
2.nodeexporter 是共有组件。是用来监测 host 的基础信息的
并不受 pd_server 的部署管理
https://github.com/pingcap/docs-cn/blob/release-4.0/config-templates/complex-mini.yaml

zanglidong · 2021 年1 月 14 日 05:23

感谢指导，那就只是/home/tidb/deploy下没有bin目录，从原有目录拷贝一份即可

北京大爷 · 2021 年1 月 14 日 07:15

对的。可以看下 systemctl 相关服务 node-exporter 内引用的 shell 脚本的路径。

tiup deploy -> push service -> run.sh -> nodeexporter exec
应该是这样的流程。
tudb tikv pd 这些组件应该是有 import 这个选项的。及会整合原来 ansiable 的配置
但是 node exporter 和 blackbox 应该是没有这个设置的。要按照部署规则进行准确的部署
可以参考这个源码里面标记了可以使用的属性标签
https://github.com/pingcap/tiup/blob/master/pkg/cluster/spec/spec.go#L74

system · 2022 年10 月 31 日 19:11

此话题已在最后回复的 1 分钟后被自动关闭。不再允许新回复。