TiUP stopping component node_exporter failed

Hi 顧問們 你們好,

TiDB版本: v4.0.8
TiUP工具版本: v1.2.3

近日在「GCP」上依照官方建議Dev規格佈署TiDB Cluster,規格如下:


佈署與啟動Cluster皆無誤,但透過TiUP去關閉cluster時,會出現以下錯誤訊息:
(已確認過各節點Selinux與防火牆皆已關閉)

p.s. 之前在local端VM佈署 不會有此狀況

[tidb@dev-tidb-tidb1 ~]$ tiup cluster stop tidbcluster
Starting component cluster: /home/tidb/.tiup/components/cluster/v1.2.3/tiup-cluster stop tidbcluster

  • [ Serial ] - SSHKeySet: privateKey=/home/tidb/.tiup/storage/cluster/clusters/tidbcluster/ssh/id_rsa, publicKey=/home/tidb/.tiup/storage/cluster/clusters/tidbcluster/ssh/id_rsa.pub
  • [Parallel] - UserSSH: user=tidb, host=10.210.1.116
  • [Parallel] - UserSSH: user=tidb, host=10.210.1.114
  • [Parallel] - UserSSH: user=tidb, host=10.210.1.115
  • [Parallel] - UserSSH: user=tidb, host=10.210.1.116
  • [Parallel] - UserSSH: user=tidb, host=10.210.1.111
  • [Parallel] - UserSSH: user=tidb, host=10.210.1.116
  • [Parallel] - UserSSH: user=tidb, host=10.210.1.112
  • [Parallel] - UserSSH: user=tidb, host=10.210.1.113
  • [ Serial ] - StopCluster
    Stopping component alertmanager
    Stopping instance 10.210.1.116
    Stop alertmanager 10.210.1.116:9093 success
    Stopping component grafana
    Stopping instance 10.210.1.116
    Stop grafana 10.210.1.116:3000 success
    Stopping component prometheus
    Stopping instance 10.210.1.116
    Stop prometheus 10.210.1.116:9090 success
    Stopping component node_exporter
    retry error: operation timed out after 2m0s
    ** prometheus 10.210.1.116:9090 failed to stop: timed out waiting for port 9100 to be stopped after 2m0s**

Error: prometheus 10.210.1.116:9090 failed to stop: timed out waiting for port 9100 to be stopped after 2m0s: timed out waiting for port 9100 to be stopped after 2m0s

Verbose debug logs has been written to /home/tidb/logs/tiup-cluster-debug-2020-11-03-09-10-24.log.
Error: run /home/tidb/.tiup/components/cluster/v1.2.3/tiup-cluster (wd:/home/tidb/.tiup/data/SFEgypc) failed: exit status 1

log檔案如下:
tiup-cluster-debug-2020-11-03-09-10-24.log (110.6 KB)

Grafana那台上的node_exporter.log
https://drive.google.com/file/d/1s0n8HgsBHqv902HCQVFievSDe6FrnvBm/view?usp=sharing

主要看到的錯誤:
time=“2020-11-03T09:08:12+08:00” level=fatal msg=“listen tcp :9100: bind: address already in use” source=“node_exporter.go:114”

topology.yml檔案如下:
topology.yml (4.5 KB)

Hi 顧問們 你們好,

已找到root cause,原因為MIS預設在裝機時會自動啟用node_exporter的docker,導致與TiDB的node_exporter衝突,關閉docker後,目前已正常。

根据错误提示来看是 10.210.1.116 这台服务器上的 node_export 关闭失败
可以登录 10.210.1.116 手动执行

systemctl daemon-reload && systemctl stop node_exporter-9100.service

测试下是否可以关闭。
如果可以关闭 ,可以继续使用 tiup cluster stop {clustername} 继续关闭。tiup stop 命令具有幂等性

如果存在异常情况
可以通过 journalctl -u node_exporter-9100.service 查看下具体的 systemd 的相关日志

如果执行以上 systemd 卡主情况 可以看看是否是 systemd 相关的 bug
https://bugzilla.redhat.com/show_bug.cgi?id=1408315

北京大爺顧問,

感謝幫忙!Problem solved.

:call_me_hand::call_me_hand::call_me_hand:

我也遇到了相同的问题,问题根因还需要分析。retrying
版本:5.0.1

节点日志:
retrying of unary invoker failed
error = “rpc error:code = notFound desc = etcdserver: requested lease not found”

此话题已在最后回复的 1 分钟后被自动关闭。不再允许新回复。