升级 TIUP 后 ,重启时 报nodeexporter 出错

集群是从低版本升级到4.0.2 ,
升级集群 时用的tiup 版本是v1.0.7( tiup cluster -v 查看) 升级后的集群正常 ,

但在后续的运维过程中执行 tiup update cluster 把 tiup 升级到 v1.0.9 ( tiup cluster -v 查看)后, 用tiup运维时 都会报 nodeexporter的错误

监控配置为:
global:
user: tidb
ssh_port: 22
deploy_dir: /data/tidb/deploy
data_dir: /data/tidb/deploy/data
log_dir: /data/tidb/deploy/log
os: linux
arch: amd64
monitored:
node_exporter_port: 9100
blackbox_exporter_port: 9115
deploy_dir: deploy/monitor-9100
data_dir: data/monitor-9100
log_dir: deploy/monitor-9100/log

参考 过 https://asktug.com/t/topic/36551, 但我的集群里没找到已启动的node-exporter ,不适用

tiup cluster start preprod-tidb-cluster
Starting component cluster: /home/tidb/.tiup/components/cluster/v1.0.9/tiup-cluster start preprod-tidb-cluster
Starting cluster preprod-tidb-cluster…

  • [ Serial ] - SSHKeySet: privateKey=/home/tidb/.tiup/storage/cluster/clusters/preprod-tidb-cluster/ssh/id_rsa, publicKey=/home/tidb/.tiup/storage/cluster/clusters/preprod-tidb-cluster/ssh/id_rsa.pub
  • [Parallel] - UserSSH: user=tidb, host=10.x.x.145
  • [Parallel] - UserSSH: user=tidb, host=10.x.x.118
  • [Parallel] - UserSSH: user=tidb, host=10.x.x.119
  • [Parallel] - UserSSH: user=tidb, host=10.x.x.120
  • [Parallel] - UserSSH: user=tidb, host=10.x.x.119
  • [Parallel] - UserSSH: user=tidb, host=10.x.x.121
  • [Parallel] - UserSSH: user=tidb, host=10.x.x.122
  • [Parallel] - UserSSH: user=tidb, host=10.x.x.116
  • [Parallel] - UserSSH: user=tidb, host=10.x.x.118
  • [Parallel] - UserSSH: user=tidb, host=10.x.x.118
  • [Parallel] - UserSSH: user=tidb, host=10.x.x.121
  • [Parallel] - UserSSH: user=tidb, host=10.x.x.117
  • [Parallel] - UserSSH: user=tidb, host=10.x.x.123
  • [Parallel] - UserSSH: user=tidb, host=10.x.x.116
  • [Parallel] - UserSSH: user=tidb, host=10.x.x.145
  • [Parallel] - UserSSH: user=tidb, host=10.x.x.145
  • [ Serial ] - StartCluster
    Starting component pd
    Starting instance pd 10.x.x.119:2379
    Starting instance pd 10.x.x.120:2379
    Starting instance pd 10.x.x.118:2379
    Start pd 10.x.x.120:2379 success
    Start pd 10.x.x.119:2379 success
    Start pd 10.x.x.118:2379 success
    Starting component node_exporter
    Starting instance 10.x.x.119
    retry error: operation timed out after 2m0s
    10.x.x.119 failed to start: timed out waiting for port 9100 to be started after 2m0s

Error: 10.x.x.119 failed to start: timed out waiting for port 9100 to be started after 2m0s: timed out waiting for port 9100 to be started after 2m0s

Verbose debug logs has been written to /home/tidb/logs/tiup-cluster-debug-2020-08-23-13-57-03.log.
Error: run /home/tidb/.tiup/components/cluster/v1.0.9/tiup-cluster (wd:/home/tidb/.tiup/data/S8SrFDy) failed: exit status 1

具体日志见附件tiup-cluster-debug-2020-08-23-13-57-03.log (80.5 KB)

  1. 看报错是 TaskFinish {“task”: “StartCluster”, “error”: “\t10.10.104.119 failed to start: timed out waiting for port 9100 to be started after 2m0s: timed out waiting for port 9100 to be started after 2m0s”, “errorVerbose”: "timed out waiting for port 9100 to be started after

  2. 麻烦执行下 tiup cluster display 集群名 展示下当前的配置,多谢。

  3. 119 的 9100 是什么? 不存在这个进程吗?

[tidb@k8s-work-161 preprod-tidb-cluster]$ tiup cluster display preprod-tidb-cluster
Starting component cluster: /home/tidb/.tiup/components/cluster/v1.0.9/tiup-cluster display preprod-tidb-cluster
tidb Cluster: preprod-tidb-cluster
tidb Version: v4.0.2
ID Role Host Ports OS/Arch Status Data Dir Deploy Dir


10.10.104.145:9093 alertmanager 10.10.104.145 9093/9094 linux/x86_64 inactive /data/tidb/deploy/data.alertmanager /data/tidb/deploy
10.10.104.116:8249 drainer 10.10.104.116 8249 linux/x86_64 Down /data/tidb/deploy/data/drainer-8249 /data/tidb/deploy/drainer-8249
10.10.104.145:3000 grafana 10.10.104.145 3000 linux/x86_64 inactive - /data/tidb/deploy
10.10.104.118:2379 pd 10.10.104.118 2379/2380 linux/x86_64 Down /data/tidb/deploy/data.pd /data/tidb/deploy
10.10.104.119:2379 pd 10.10.104.119 2379/2380 linux/x86_64 Down /data/tidb/deploy/data.pd /data/tidb/deploy
10.10.104.120:2379 pd 10.10.104.120 2379/2380 linux/x86_64 Down /data/tidb/deploy/data.pd /data/tidb/deploy
10.10.104.145:9090 prometheus 10.10.104.145 9090 linux/x86_64 inactive /data/tidb/deploy/prometheus2.0.0.data.metrics /data/tidb/deploy
10.10.104.118:8250 pump 10.10.104.118 8250 linux/x86_64 Down /data/tidb/deploy/data/pump-8250 /data/tidb/deploy/pump-8250
10.10.104.119:8250 pump 10.10.104.119 8250 linux/x86_64 Down /data/tidb/deploy/data/pump-8250 /data/tidb/deploy/pump-8250
10.10.104.116:4000 tidb 10.10.104.116 4000/10080 linux/x86_64 Down - /data/tidb/deploy
10.10.104.117:4000 tidb 10.10.104.117 4000/10080 linux/x86_64 Down - /data/tidb/deploy
10.10.104.118:4000 tidb 10.10.104.118 4000/10080 linux/x86_64 Down - /data/tidb/deploy
10.10.104.121:9000 tiflash 10.10.104.121 9000/8123/3930/20170/20292/8234 linux/x86_64 Down /data/tidb/deploy-tiflash/data-tiflash-9000 /data/tidb/deploy-tiflash
10.10.104.121:20160 tikv 10.10.104.121 20160/20180 linux/x86_64 Down /data/tidb/deploy/data /data/tidb/deploy
10.10.104.122:20160 tikv 10.10.104.122 20160/20180 linux/x86_64 Down /data/tidb/deploy/data /data/tidb/deploy
10.10.104.123:20160 tikv 10.10.104.123 20160/20180 linux/x86_64 Down /data/tidb/deploy/data /data/tidb/deploy

是node-exporter ,该进程不能正常启动,默认升级后 配置文件中monitord 默认的目录是
monitored:
node_exporter_port: 9100
blackbox_exporter_port: 9115
deploy_dir: deploy/monitor-9100
data_dir: data/monitor-9100
log_dir: deploy/monitor-9100/log

但node-exporter 实际是在 /data/tidb/deploy 目录下的 ,
于是我修改了tiup 的配置文件,指定到正确的目录 修改方式如下:

我通过vi方式 修改edit-config 对应的配置文件(用 edit-config edit-config 修改不了,保存时提示 Nothing changem. 于是用vi修改) , 把monitord的部署路径改成正确的路径
image

修改后,再 tiup reload ,但是 看到执行 node-exporter的路径不对,命令多了一个node-exporter 目录(正确的路径是:bin/node_exporter )


还是无法启动node-exporter

把 node_exporter 的二进制手工移动或者复制到上一级目录 试试, https://asktug.com/t/topic/36551 应该就是你参考的这个帖子。

手工移动是可以的, 我想到更友好的方式是
1:修改在中控机修改 run_node_exporter.sh 把多的那层 node_exporter 目录去掉, 重新reload到各节点
2:重新按中控机的配置部署node_exporter

不知道这两重方式目前能否实现 ,怎么操作?

手工移动吧,这个问题已经在修复了。

好的 ,升级时,tiup update cluster 能否指定版本升级?避免在测试环境用的是V1.0.7 ,测试通过后,过一段时间在生产环境执行 却又变成了V1.0.9的情况

tiup update cluster 默认升级到最新版本就好了,有什么顾虑吗? tiup 只是个工具。

有顾虑啊,
我生产跟测试环境的数据库原来都是3.0,需要升级到4.0
我在测试环境升级数据库时, 执行tiup update cluster 后,tiup 升级到V1.0.7 ,整个升级过程很顺利
后面当我在生产环境升级数据库时,执行tiup update cluster 后,tiup 升级到V1.0.9,升级后就出现了启动集群时 node-exporter 的问题 , 花费了大量时间做各种排查

能把版本保持一致,那保持一致是比较稳妥的方式

可以使用老版本的组件,比如 tiup cluster:v1.0.7 reload <cluster-name>