tidb节点下线后,pd还在尝试连接下线的tidb节点

为提高效率,请提供以下信息,问题描述清晰能够更快得到解决:

【TiDB 版本】
v4.0.10

【问题描述】
host配置

通过 tiup cluster {cluster_name} scale-in exam_test --node tidb3:4001下线了节点tidb3:4001,
tiup cluster {cluster_name} scale-in exam_test --node tidb1:4000下线了节点tidb1:4000,但是pd日志中来看,pd节点一直还在尝试连接这两个下线的tidb节点(已经持续了一天以上)
pd日志如下:

[2021/02/23 11:40:12.075 +08:00] [WARN] [proxy.go:181] ["fail to recv activity from remote, stay inactive and wait to next checking round"] [remote=tidb1:4000] [interval=2s] [error="dial tcp 172.18.102.69:4000: connect: connection refused"]
[2021/02/23 11:40:12.075 +08:00] [WARN] [proxy.go:181] ["fail to recv activity from remote, stay inactive and wait to next checking round"] [remote=tidb3:4001] [interval=2s] [error="dial tcp 172.18.102.71:4001: connect: connection refused"]
[2021/02/23 11:40:12.082 +08:00] [WARN] [proxy.go:181] ["fail to recv activity from remote, stay inactive and wait to next checking round"] [remote=tidb1:10080] [interval=2s] [error="dial tcp 172.18.102.69:10080: connect: connection refused"]
[2021/02/23 11:40:12.082 +08:00] [WARN] [proxy.go:181] ["fail to recv activity from remote, stay inactive and wait to next checking round"] [remote=tidb3:10081] [interval=2s] [error="dial tcp 172.18.102.71:10081: connect: connection refused"]
[2021/02/23 11:40:14.075 +08:00] [WARN] [proxy.go:181] ["fail to recv activity from remote, stay inactive and wait to next checking round"] [remote=tidb1:4000] [interval=2s] [error="dial tcp 172.18.102.69:4000: connect: connection refused"]
[2021/02/23 11:40:14.075 +08:00] [WARN] [proxy.go:181] ["fail to recv activity from remote, stay inactive and wait to next checking round"] [remote=tidb3:4001] [interval=2s] [error="dial tcp 172.18.102.71:4001: connect: connection refused"]
[2021/02/23 11:40:14.082 +08:00] [WARN] [proxy.go:181] ["fail to recv activity from remote, stay inactive and wait to next checking round"] [remote=tidb1:10080] [interval=2s] [error="dial tcp 172.18.102.69:10080: connect: connection refused"]
[2021/02/23 11:40:14.082 +08:00] [WARN] [proxy.go:181] ["fail to recv activity from remote, stay inactive and wait to next checking round"] [remote=tidb3:10081] [interval=2s] [error="dial tcp 172.18.102.71:10081: connect: connection refused"]

通过tiup cluster display {cluster_name}查看集群状态是正常的

ID           Role          Host   Ports        OS/Arch       Status   Data Dir                              Deploy Dir
--           ----          ----   -----        -------       ------   --------                              ----------
tidb3:9093   alertmanager  tidb3  9093/9094    linux/x86_64  Up       /tidb/tikv/alertmanager-9093          /tidb/tidb-deploy/alertmanager-9093
tidb3:3000   grafana       tidb3  3000         linux/x86_64  Up       -                                     /tidb/tidb-deploy/grafana-3000
tidb1:2379   pd            tidb1  2379/2380    linux/x86_64  Up|L|UI  /tidb/tikv/pd-2379                    /tidb/tidb-deploy/pd-2379
tidb2:9090   prometheus    tidb2  9090         linux/x86_64  Up       /tidb/tikv/prometheus-9090            /tidb/tidb-deploy/prometheus-9090
tidb2:4000   tidb          tidb2  4000/10080   linux/x86_64  Up       -                                     /tidb/tidb-deploy/tidb-4000
tidb3:4000   tidb          tidb3  4000/10080   linux/x86_64  Up       -                                     /tidb/tidb-deploy/tidb-4000
tidb3:4002   tidb          tidb3  4002/10082   linux/x86_64  Up       -                                     /tidb/tidb-deploy/tidb-4002
tidb1:20160  tikv          tidb1  20160/20180  linux/x86_64  Up       /tidb/tikv/tikv-20160                 /tidb/tidb-deploy/tikv-20160
tidb2:20160  tikv          tidb2  20160/20180  linux/x86_64  Up       /tidb/deploy/install/data/tikv-20160  /tidb/deploy/install/deploy/tikv-20160
tidb3:20160  tikv          tidb3  20160/20180  linux/x86_64  Up       /tidb/tikv/tikv-20160                 /tidb/deploy/tikv-20160

请问下可能是哪个环节出现了问题,通过scale-in下线tidb节点后是否还需要做一些额外操作?
host配置如下:

172.18.102.69 tidb1
172.18.102.70 tidb2
172.18.102.71 tidb3

若提问为性能优化、故障排查类问题,请下载脚本运行。终端输出的打印结果,请务必全选并复制粘贴上传。

1.当时缩容时有提示缩容成功吗?类似 Scaled cluster xxx in successfully 这种信息;
2.看集群拓扑里只有一个 pd 节点,建议扩容两个 pd 节点,这样可以实现 pd 集群的高可用。

  1. scale-in是提示Scaled cluster {cluster_name} in successfully成功了的.
  2. 这个是刚搭建的测试环境,正在做扩缩容的验证,所以pd还没搭建集群.

方便重启一下集群,看下 PD 日志中是否还会报错吗?

重启了依然会有,现在再通过scale-in下线一个节点,下线成功后pd也会持续去连接新下线的节点.

[2021/02/23 17:23:07.228 +08:00] [WARN] [proxy.go:181] ["fail to recv activity from remote, stay inactive and wait to next checking round"] [remote=tidb1:10080] [interval=2s] [error="dial tcp 172.18.102.69:10080: connect: connection refused"]
[2021/02/23 17:23:07.228 +08:00] [WARN] [proxy.go:181] ["fail to recv activity from remote, stay inactive and wait to next checking round"] [remote=tidb3:10081] [interval=2s] [error="dial tcp 172.18.102.71:10081: connect: connection refused"]
[2021/02/23 17:23:07.228 +08:00] [WARN] [proxy.go:181] ["fail to recv activity from remote, stay inactive and wait to next checking round"] [remote=tidb2:10080] [interval=2s] [error="dial tcp 172.18.102.70:10080: connect: connection refused"]
[2021/02/23 17:23:07.229 +08:00] [WARN] [proxy.go:181] ["fail to recv activity from remote, stay inactive and wait to next checking round"] [remote=tidb1:4000] [interval=2s] [error="dial tcp 172.18.102.69:4000: connect: connection refused"]
[2021/02/23 17:23:07.229 +08:00] [WARN] [proxy.go:181] ["fail to recv activity from remote, stay inactive and wait to next checking round"] [remote=tidb3:4001] [interval=2s] [error="dial tcp 172.18.102.71:4001: connect: connection refused"]
[2021/02/23 17:23:07.229 +08:00] [WARN] [proxy.go:181] ["fail to recv activity from remote, stay inactive and wait to next checking round"] [remote=tidb2:4000] [interval=2s] [error="dial tcp 172.18.102.70:4000: connect: connection refused"]

pd与tidb节点的连接配置信息在哪里可以查到,有没有手动处理的方法?

麻烦检查下 pd 所在的主机上磁盘空间是否不足?另外看下 tikv 节点日志是否什么报错信息?

空间是足够的,三个kv节点最近较长一段时间的日志都挺正常的.

麻烦检查下 tiup --version 和 tiup cluster --version 是否为最新的版本 v1.3.2 ,如果不是的话可以先升级下 tiup 组件,然后再 reload 下集群,看下能否恢复正常

当前版本是1.3.1,升级到1.3.2后reload还是存在同样的问题.

我这边在相同的拓扑集群上测试了下没有复现你的问题,请再检查下节点之间的防火墙、selinux 是否关闭以及 hosts 设置是否都准确。

selinux是禁用的,各节点通信是正常的,我启动的初始配置是只有一个节点,后面的节点都是通过scale-out加下去的,请问下和这个是否有关系,初始配置如下

global:
  user: tidb
  ssh_port: 22
  deploy_dir: /tidb/tidb-deploy
  data_dir: /tidb/tikv

pd_servers:
  - host: tidb1

tidb_servers:
  - host: tidb1

tikv_servers:
  - host: tidb1

集群 region 的默认副本数是 3 个,这里初始配置的 tikv 节点只配置一个是有问题,会导致 region 缺少副本,建议参考官方文档重新部署下集群,然后再测试下扩缩容。