扩容的PD节点报错，如何解决

TiDBer_oqrCNpbV · 2025 年3 月 26 日 14:41

【TiDB 使用环境】生产环境
【TiDB 版本】v6.1.0
【操作系统】centos7
【部署方式】机器部署
【集群数据量】
【集群节点数】

Generate config prometheus → 10.173.17.4:9090 … Error
Generate config grafana → 10.173.17.4:3000 … Error
Generate config alertmanager → 10.173.17.4:9093 … Error

Error: init config failed: 10.173.17.4:9090: transfer from /root/.tiup/storage/cluster/clusters/tidb-iap/config-cache/prometheus-10.173.17.4-9090.service to /tmp/prometheus_d21b6d81-b2f7-4d71-9ed7-60228725b874.service failed: failed to scp /root/.tiup/storage/cluster/clusters/tidb-iap/config-cache/prometheus-10.173.17.4-9090.service to tidb@10.173.17.4:/tmp/prometheus_d21b6d81-b2f7-4d71-9ed7-60228725b874.service: ssh: handshake failed: read tcp 10.173.17.4:59468->10.173.17.4:22: read: connection reset by peer

扩容新的PD节点，最后在在更新启动监控相关组件报错，查看集群状态PD已经扩容成功，已经修复了scp的问题，如何重新继续执行后面的流程？

前面失败了，想缩容这个PD节点在重新扩容有有了新的报错，执行命令 ./bin/tiup cluster scale-in tidb-iap --node 10.173.191.94:2379，现在这个节点是down的状态，但是下不掉了，怎么解决呢？
Stopping component pd
Stopping instance 10.173.191.94
Stop pd 10.173.191.94:2379 success
Destroying component pd
Destroying instance 10.173.191.94
Destroy 10.173.191.94 success

Destroy pd paths: [/home/data/tidb-deploy/pd-2379/log /home/data/tidb-deploy/pd-2379 /etc/systemd/system/pd-2379.service /home/data/tidb-data/pd-2379]
Stopping component node_exporter
Stopping instance 10.173.191.94

Error: failed to destroy: failed to stop monitor: failed to stop: 10.173.191.94 node_exporter-9100.service, please check the instance’s log() for more detail.: timed out waiting for port 9100 to be stopped after 2m0s

小龙虾爱大龙虾 · 2025 年3 月 27 日 00:06

测试下 ssh 到目标机器，再看下目标机器的日志

tidb菜鸟一只 · 2025 年3 月 27 日 00:57

开始ssh有问题吧，后面修复了，但是停止失败？直接scale-in 这个节点，不行就带–force，然后重新scale-out吧

乡在人间 · 2025 年3 月 27 日 01:37

scale-in –force强制缩容这个节点，再重新扩容scale-out，这样试一下是否可行

清风明月 · 2025 年3 月 27 日 02:07

重新坐下ssh免密登陆重新扩缩容吧，最好防火墙端口都开下或者关闭。

wluckdog · 2025 年3 月 27 日 02:24

从管理节点ssh-copy-id -i ~/.ssh/id_rsa.pub 10.173.17.4 的秘钥发送给主机10.173.17.4

TiDBer_oqrCNpbV · 2025 年3 月 27 日 07:04

互信解决了，结局之后缩容失败了

TiDBer_oqrCNpbV · 2025 年3 月 27 日 08:14

[root@10.173.17.4 ~]$ scp /root/.tiup/storage/cluster/clusters/tidb-iap/config-cache/prometheus-10.173.17.4-9090.service tidb@10.173.17.4:/tmp/prometheus_3756f0f7-4b8c-4830-b501-dc2de7372f27.service
prometheus-10.173.17.4-9090.service 100% 437 1.1MB/s 00:00

手动执行能成功，但是tiup命令执行报了这个错：
Error: init config failed: 10.173.17.4:9090: transfer from /root/.tiup/storage/cluster/clusters/tidb-iap/config-cache/prometheus-10.173.17.4-9090.service to /tmp/prometheus_3756f0f7-4b8c-4830-b501-dc2de7372f27.service failed: failed to scp /root/.tiup/storage/cluster/clusters/tidb-iap/config-cache/prometheus-10.173.17.4-9090.service to tidb@10.173.17.4:/tmp/prometheus_3756f0f7-4b8c-4830-b501-dc2de7372f27.service: ssh: handshake failed: ssh: unable to authenticate, attempted methods [none publickey], no supported methods remain

tidb菜鸟一只 · 2025 年3 月 28 日 00:46

你用tidb用户安装还是用root安装的，你手工执行的scp使用的root，但是命令里是tidb用户啊，你tidb用户之间做了免密吗？

Denis · 2025 年3 月 28 日 00:49

扩容命令发出来看看。

dba远航 · 2025 年3 月 28 日 00:53

SCP有问题，手动执行一下 SCP 10.173.17.4试试