tiup升级3.0.8到4.0.6 node_exporter stop失败

【 TiDB 使用环境`】生产
【 TiDB 版本】3.0.8升级到4.0.6后
【遇到的问题】Failed to stop node_exporter-9100.service: Unit node_exporter-9100.service not loaded

failed to stop: 10.9.137.108 node_exporter-9100.service, please check the instance’s log() for more detail.:
timed out waiting for port 9100 to be stopped after 2m0s
【复现路径】tiup cluster upgrade
【问题现象及影响】

检查下 10.9.137.108 这个节点的日志,看看是什么问题,导致 node exporter 不能启动

这是 prometheus 导致的问题了

time=“2022-09-23T08:55:12+08:00” level=info msg=" - tcpstat" source=“node_exporter.go:97”
time=“2022-09-23T08:55:12+08:00” level=info msg=" - textfile" source=“node_exporter.go:97”
time=“2022-09-23T08:55:12+08:00” level=info msg=" - time" source=“node_exporter.go:97”
time=“2022-09-23T08:55:12+08:00” level=info msg=" - timex" source=“node_exporter.go:97”
time=“2022-09-23T08:55:12+08:00” level=info msg=" - uname" source=“node_exporter.go:97”
time=“2022-09-23T08:55:12+08:00” level=info msg=" - vmstat" source=“node_exporter.go:97”
time=“2022-09-23T08:55:12+08:00” level=info msg=" - xfs" source=“node_exporter.go:97”
time=“2022-09-23T08:55:12+08:00” level=info msg=" - zfs" source=“node_exporter.go:97”
time=“2022-09-23T08:55:12+08:00” level=info msg=“Listening on :9100” source=“node_exporter.go:111”

node exporter也启动了:
tidb 4050 0.5 0.1 115600 16380 ? Ssl 08:55 0:07 bin/node_exporter --web.listen-address=:9100 --coll
ector.tcpstat --collector.systemd --collector.mountstats --collector.meminfo_numa --collector.interrupts --collector
.vmstat.fields=^.* --log.level=info

之前 tidb 是采用 tiup 的方式部署的么? 如果不是还得做 ansible 的环境迁移才行的

之前ansible,改tiup升级的;ansible环境迁移是指import配置吗?

对,import 到 tiup

import过了,现在就是node_exporter实际是起来了,但是整个upgrade流程没有走完,版本还是显示3.0.8

upgrade时完整的日志:

Stopping component node_exporter
Stopping instance 10.9.99.96
Stopping instance 10.9.137.108
Stopping instance 10.19.100.221
Stopping instance 10.9.16.130
Stopping instance 10.9.48.230
Stopping instance 10.9.14.188
Stopping instance 10.9.130.145
Stopping instance 10.9.175.106
Stop 10.9.48.230 success
Failed to stop node_exporter-9100.service: Unit node_exporter-9100.service not loaded.

Failed to stop node_exporter-9100.service: Unit node_exporter-9100.service not loaded.

Stop 10.19.100.221 success

Error: failed to stop: 10.9.99.96 node_exporter-9100.service, please check the instance’s log() for more detail.: ti
med out waiting for port 9100 to be stopped after 2m0s

ansible 的拓扑信息导入到tiup,拓扑信息有没有遗漏

import 之后,tiup 接管后,能正常的操控 tidb cluster 么?
比如,用 tiup 停止集群服务,然后在启动集群服务之类的

没有遗漏

tiup升级过程中pd/tidb/tikv都完成了正常的升级和重启了, tiup cluster display显示也正常:

10.9.137.108:9093 alertmanager 10.9.137.108 9093/9094 linux/x86_64 Up /home/tidb/deploy/data.alertmana
ger /home/tidb/deploy
10.9.48.230:8249 drainer 10.9.48.230 8249 linux/x86_64 Up /home/tidb/deploy/data.drainer
/home/tidb/deploy
10.9.137.108:12379 pd 10.9.137.108 12379/12380 linux/x86_64 Up /home/tidb/deploy/data.pd
/home/tidb/deploy
10.9.14.188:12379 pd 10.9.14.188 12379/12380 linux/x86_64 Up|UI /home/tidb/deploy/data.pd
/home/tidb/deploy
10.9.175.106:12379 pd 10.9.175.106 12379/12380 linux/x86_64 Up|L /home/tidb/deploy/data.pd
/home/tidb/deploy
10.19.100.221:9090 prometheus 10.19.100.221 9090 linux/x86_64 Up /home/tidb/deploy/prometheus2.0.
0.data.metrics /home/tidb/deploy
10.9.48.230:8250 pump 10.9.48.230 8250 linux/x86_64 Up /home/tidb/deploy/data.pump
/home/tidb/deploy
10.9.130.145:4000 tidb 10.9.130.145 4000/10080 linux/x86_64 Up -
/home/tidb/deploy
10.9.16.130:4000 tidb 10.9.16.130 4000/10080 linux/x86_64 Up -
/home/tidb/deploy
10.9.99.96:4000 tidb 10.9.99.96 4000/10080 linux/x86_64 Up - /home/tidb/deploy
10.9.130.145:20160 tikv 10.9.130.145 20160/20180 linux/x86_64 Up /home/tidb/deploy/data /home/tidb/deploy
10.9.16.130:20160 tikv 10.9.16.130 20160/20180 linux/x86_64 Up /home/tidb/deploy/data /home/tidb/deploy
10.9.99.96:20160 tikv 10.9.99.96 20160/20180 linux/x86_64 Up /home/tidb/deploy/data /home/tidb/deploy

那还有啥问题? :rofl:

upgrade流程没有走完,版本号还没有更新:Cluster version: v3.0.8
监控里面这部分没有数据:

尝试 把 prometheus 先缩容掉,在 升级版本试试

要是生产环境的话,我会建议你在搭一套,把数据搬过去得了

2022-09-23T14:11:44.005+0800 INFO SSHCommand {“host”: “10.9.48.230”, “port”: “22”, “cmd”: “export LANG=C; PATH
=$PATH:/bin:/sbin:/usr/bin:/usr/sbin /usr/bin/sudo -H bash -c “systemctl daemon-reload && systemctl stop node_expor
ter-9100.service””, “stdout”: “”, “stderr”: “”}

看了下upgrade的日志,报错的机器没有上面的日志,给人的感觉就是没有发起stop命令到目标机器

另外,prometheus 先缩容掉再升级也有问题,不过扩容回来,上面那个图表有数据了

tiup 查看 cluster 的版本号,如果版本号还是 3.0.8,
那你要每个节点去检查了,看看是哪个节点没升级成功…

另外每个节点的 cli 命令行,也可以有效的核对版本
比如: