TiUP stopping component node_exporter failed

jimmyjan0824 · 2020 年11 月 3 日 02:12

Hi 顧問們你們好,

TiDB版本: v4.0.8
TiUP工具版本: v1.2.3

近日在「GCP」上依照官方建議Dev規格佈署TiDB Cluster，規格如下:

佈署與啟動Cluster皆無誤，但透過TiUP去關閉cluster時，會出現以下錯誤訊息:
(已確認過各節點Selinux與防火牆皆已關閉）

p.s. 之前在local端VM佈署不會有此狀況

[tidb@dev-tidb-tidb1 ~]$ tiup cluster stop tidbcluster
Starting component cluster: /home/tidb/.tiup/components/cluster/v1.2.3/tiup-cluster stop tidbcluster

[ Serial ] - SSHKeySet: privateKey=/home/tidb/.tiup/storage/cluster/clusters/tidbcluster/ssh/id_rsa, publicKey=/home/tidb/.tiup/storage/cluster/clusters/tidbcluster/ssh/id_rsa.pub
[Parallel] - UserSSH: user=tidb, host=10.210.1.116
[Parallel] - UserSSH: user=tidb, host=10.210.1.114
[Parallel] - UserSSH: user=tidb, host=10.210.1.115
[Parallel] - UserSSH: user=tidb, host=10.210.1.116
[Parallel] - UserSSH: user=tidb, host=10.210.1.111
[Parallel] - UserSSH: user=tidb, host=10.210.1.116
[Parallel] - UserSSH: user=tidb, host=10.210.1.112
[Parallel] - UserSSH: user=tidb, host=10.210.1.113
[ Serial ] - StopCluster
Stopping component alertmanager
Stopping instance 10.210.1.116
Stop alertmanager 10.210.1.116:9093 success
Stopping component grafana
Stopping instance 10.210.1.116
Stop grafana 10.210.1.116:3000 success
Stopping component prometheus
Stopping instance 10.210.1.116
Stop prometheus 10.210.1.116:9090 success
Stopping component node_exporter
retry error: operation timed out after 2m0s
** prometheus 10.210.1.116:9090 failed to stop: timed out waiting for port 9100 to be stopped after 2m0s**

Error: prometheus 10.210.1.116:9090 failed to stop: timed out waiting for port 9100 to be stopped after 2m0s: timed out waiting for port 9100 to be stopped after 2m0s

Verbose debug logs has been written to /home/tidb/logs/tiup-cluster-debug-2020-11-03-09-10-24.log.
Error: run /home/tidb/.tiup/components/cluster/v1.2.3/tiup-cluster (wd:/home/tidb/.tiup/data/SFEgypc) failed: exit status 1

log檔案如下:
tiup-cluster-debug-2020-11-03-09-10-24.log (110.6 KB)

Grafana那台上的node_exporter.log
https://drive.google.com/file/d/1s0n8HgsBHqv902HCQVFievSDe6FrnvBm/view?usp=sharing

主要看到的錯誤:
time=“2020-11-03T09:08:12+08:00” level=fatal msg=“listen tcp :9100: bind: address already in use” source=“node_exporter.go:114”

topology.yml檔案如下:
topology.yml (4.5 KB)

jimmyjan0824 · 2020 年11 月 3 日 03:09

Hi 顧問們你們好,

已找到root cause，原因為MIS預設在裝機時會自動啟用node_exporter的docker，導致與TiDB的node_exporter衝突，關閉docker後，目前已正常。

北京大爷 · 2020 年11 月 3 日 03:23

根据错误提示来看是 10.210.1.116 这台服务器上的 node_export 关闭失败
可以登录 10.210.1.116 手动执行

systemctl daemon-reload && systemctl stop node_exporter-9100.service

测试下是否可以关闭。
如果可以关闭，可以继续使用 tiup cluster stop {clustername} 继续关闭。tiup stop 命令具有幂等性

如果存在异常情况
可以通过 journalctl -u node_exporter-9100.service 查看下具体的 systemd 的相关日志

如果执行以上 systemd 卡主情况可以看看是否是 systemd 相关的 bug
https://bugzilla.redhat.com/show_bug.cgi?id=1408315

jimmyjan0824 · 2020 年11 月 3 日 03:35

北京大爺顧問,

感謝幫忙！Problem solved.

北京大爷 · 2020 年11 月 3 日 06:36

Hacker_3A379Z9B · 2021 年6 月 1 日 12:45

我也遇到了相同的问题，问题根因还需要分析。retrying
版本：5.0.1

节点日志：
retrying of unary invoker failed
error = “rpc error：code = notFound desc = etcdserver： requested lease not found”

system · 2022 年10 月 31 日 19:16

此话题已在最后回复的 1 分钟后被自动关闭。不再允许新回复。