TIDB v5.0.1 tiup 缩容时报 failed to scale in: cannot find node id 'ip:port' in topology

ngvf · 2021 年6 月 16 日 10:42

【概述】 tiup 缩容时报 failed to scale in: cannot find node id ‘ip:port’ in topology

【背景】在shell脚本中缩容多个节点，执行sh脚本
tiup cluster scale-in -y tidb --node 172.24.0.22:2379
tiup cluster scale-in -y tidb --node 172.24.0.5:2379
tiup cluster scale-in -y tidb --node 172.24.0.22:4000
tiup cluster scale-in -y tidb --node 172.24.0.5:4000

【现象】缩容失败报如下错

Starting component cluster: /home/tidb/.tiup/components/cluster/v1.4.1/tiup-cluster scale-in -y tidb --node 172.24.0.5:4000

[ Serial ] - SSHKeySet: privateKey=/home/tidb/.tiup/storage/cluster/clusters/tidb/ssh/id_rsa, publicKey=/home/tidb/.tiup/storage/cluster/clusters/tidb/ssh/id_rsa.pub
[Parallel] - UserSSH: user=tidb, host=172.24.0.12
[Parallel] - UserSSH: user=tidb, host=172.24.0.20
[Parallel] - UserSSH: user=tidb, host=172.24.0.20
[Parallel] - UserSSH: user=tidb, host=172.24.0.23
[Parallel] - UserSSH: user=tidb, host=172.24.0.21
[Parallel] - UserSSH: user=tidb, host=172.24.0.23
[Parallel] - UserSSH: user=tidb, host=172.24.0.5
[Parallel] - UserSSH: user=tidb, host=172.24.0.21
[Parallel] - UserSSH: user=tidb, host=172.24.0.22
[Parallel] - UserSSH: user=tidb, host=172.24.0.15
[Parallel] - UserSSH: user=tidb, host=172.24.0.12
[Parallel] - UserSSH: user=tidb, host=172.24.0.19
[Parallel] - UserSSH: user=tidb, host=172.24.0.17
[Parallel] - UserSSH: user=tidb, host=172.24.0.12
[ Serial ] - ClusterOperate: operation=ScaleInOperation, options={Roles:[] Nodes:[172.24.0.5:4000] Force:false SSHTimeout:5 OptTimeout:120 APITimeout:300 IgnoreConfigCheck:false NativeSSH:false SSHType: CleanupData:false CleanupLog:false RetainDataRoles:[] RetainDataNodes:[] Operation:StartOperation}
Stopping component tidb
Stopping instance 172.24.0.5
Failed to stop tidb-4000.service: Unit tidb-4000.service not loaded.

Stop tidb 172.24.0.5:4000 success
Destroying component tidb
Destroying instance 172.24.0.5
Destroy 172.24.0.5 success

Destroy tidb paths: [/data/tidb-deploy/tidb-4000/log /data/tidb-deploy/tidb-4000 /etc/systemd/system/tidb-4000.service]
Stopping component node_exporter
Stopping component blackbox_exporter
Failed to stop blackbox_exporter-9115.service: Unit blackbox_exporter-9115.service not loaded.

Destroying monitored 172.24.0.5
Destroying instance 172.24.0.5
172.24.0.5 failed to destroy blackbox exportoer: timed out waiting for port 9115 to be stopped after 2m0s

Error: failed to scale in: failed to destroy monitor: 172.24.0.5 failed to destroy blackbox exportoer: timed out waiting for port 9115 to be stopped after 2m0s: timed out waiting for port 9115 to be stopped after 2m0s

Verbose debug logs has been written to /home/tidb/.tiup/logs/tiup-cluster-debug-2021-06-16-17-57-20.log.
Error: run /home/tidb/.tiup/components/cluster/v1.4.1/tiup-cluster (wd:/home/tidb/.tiup/data/SaURssu) failed: exit status 1
Starting component cluster: /home/tidb/.tiup/components/cluster/v1.4.1/tiup-cluster scale-in -y tidb --node 172.24.0.22:4000

[ Serial ] - SSHKeySet: privateKey=/home/tidb/.tiup/storage/cluster/clusters/tidb/ssh/id_rsa, publicKey=/home/tidb/.tiup/storage/cluster/clusters/tidb/ssh/id_rsa.pub
[Parallel] - UserSSH: user=tidb, host=172.24.0.12
[Parallel] - UserSSH: user=tidb, host=172.24.0.19
[Parallel] - UserSSH: user=tidb, host=172.24.0.20
[Parallel] - UserSSH: user=tidb, host=172.24.0.23
[Parallel] - UserSSH: user=tidb, host=172.24.0.20
[Parallel] - UserSSH: user=tidb, host=172.24.0.21
[Parallel] - UserSSH: user=tidb, host=172.24.0.23
[Parallel] - UserSSH: user=tidb, host=172.24.0.12
[Parallel] - UserSSH: user=tidb, host=172.24.0.12
[Parallel] - UserSSH: user=tidb, host=172.24.0.21
[Parallel] - UserSSH: user=tidb, host=172.24.0.15
[Parallel] - UserSSH: user=tidb, host=172.24.0.17
[ Serial ] - ClusterOperate: operation=ScaleInOperation, options={Roles:[] Nodes:[172.24.0.22:4000] Force:false SSHTimeout:5 OptTimeout:120 APITimeout:300 IgnoreConfigCheck:false NativeSSH:false SSHType: CleanupData:false CleanupLog:false RetainDataRoles:[] RetainDataNodes:[] Operation:StartOperation}

Error: failed to scale in: cannot find node id ‘172.24.0.22:4000’ in topology

Verbose debug logs has been written to /home/tidb/.tiup/logs/tiup-cluster-debug-2021-06-16-17-57-20.log.
Error: run /home/tidb/.tiup/components/cluster/v1.4.1/tiup-cluster (wd:/home/tidb/.tiup/data/SaUSPp0) failed: exit status 1

【业务影响】缩容不成功

【TiDB 版本】TIDB v5.0.1

【附件】

TiUP Cluster Display 信息正常
Starting component cluster: /home/tidb/.tiup/components/cluster/v1.4.1/tiup-cluster display tidb
Cluster type: tidb
Cluster name: tidb
Cluster version: v5.0.1
SSH type: builtin
Dashboard URL: http://172.24.0.20:2379/dashboard
ID Role Host Ports OS/Arch Status Data Dir Deploy Dir

172.24.0.12:9093 alertmanager 172.24.0.12 9093/9094 linux/x86_64 Up /data/tidb-data/alertmanager-9093 /data/tidb-deploy/alertmanager-9093
172.24.0.12:3000 grafana 172.24.0.12 3000 linux/x86_64 Up - /data/tidb-deploy/grafana-3000
172.24.0.20:2379 pd 172.24.0.20 2379/2380 linux/x86_64 Up|L|UI /data/tidb-data/pd-2379 /data/tidb-deploy/pd-2379
172.24.0.21:2379 pd 172.24.0.21 2379/2380 linux/x86_64 Up /data/tidb-data/pd-2379 /data/tidb-deploy/pd-2379
172.24.0.23:2379 pd 172.24.0.23 2379/2380 linux/x86_64 Up /data/tidb-data/pd-2379 /data/tidb-deploy/pd-2379
172.24.0.12:9090 prometheus 172.24.0.12 9090 linux/x86_64 Up /data/tidb-data/prometheus-9090 /data/tidb-deploy/prometheus-9090
172.24.0.20:4000 tidb 172.24.0.20 4000/10080 linux/x86_64 Up - /data/tidb-deploy/tidb-4000
172.24.0.21:4000 tidb 172.24.0.21 4000/10080 linux/x86_64 Up - /data/tidb-deploy/tidb-4000
172.24.0.23:4000 tidb 172.24.0.23 4000/10080 linux/x86_64 Up - /data/tidb-deploy/tidb-4000
172.24.0.15:20160 tikv 172.24.0.15 20160/20180 linux/x86_64 Up /data/tidb-data/tikv-20160 /data/tidb-deploy/tikv-20160
172.24.0.17:20160 tikv 172.24.0.17 20160/20180 linux/x86_64 Up /data/tidb-data/tikv-20160 /data/tidb-deploy/tikv-20160
172.24.0.19:20160 tikv 172.24.0.19 20160/20180 linux/x86_64 Up /data/tidb-data/tikv-20160 /data/tidb-deploy/tikv-20160
Total nodes: 12

TiUP Cluster Edit Config 信息正常

yilong · 2021 年6 月 16 日 13:29

这个就是报错提示的找不到 ip和端口，你看下你的ip:port 并不在展示的 display 信息里。

ngvf · 2021 年6 月 17 日 02:27

大佬你好，现象是你说的那样，但是在缩容前 TiUP Cluster Display 查看时，ip:port都是存在，集群都是正常的，在缩容过程中报错提示的找不到 ip和端口，还有一个前提TiDB &PD在同一台主机上，缩容会同时缩容TiDB &PD，缩容的shell是这样写的：
tiup cluster scale-in -y tidb --node 172.24.0.22:2379 #pd
tiup cluster scale-in -y tidb --node 172.24.0.5:2379 #pd
tiup cluster scale-in -y tidb --node 172.24.0.22:4000 #tidb
tiup cluster scale-in -y tidb --node 172.24.0.5:4000 #tidb
这样会有影响吗？

spc_monkey · 2021 年6 月 21 日 08:53

缩容命令很快的，不太明白，为什么放在脚本里（tikv 会慢点），另外：上面的问题没太懂

ngvf · 2021 年6 月 23 日 03:05

放在脚本里是因为结合了前端，当前端对某个些节点缩容时，通过etcd+confd的方式动态生成缩容脚本，上面的问题就是同时缩容多个节点时，缩容到某些节点会出现 tiup 缩容时报 failed to scale in: cannot find node id ‘ip:port’ in topology的错。

spc_monkey · 2021 年6 月 23 日 03:07

哦明白了

ngvf · 2021 年6 月 25 日 07:15

yong哥，请问有其他的解决思路吗？

system · 2022 年10 月 31 日 19:03

此话题已在最后回复的 1 分钟后被自动关闭。不再允许新回复。