tiup扩容pd异常:there is a member that has not joined successfully

使用tiup扩容3台pd,启动不起来( 之前在这3台上成功扩容tidb和tikv ):
TiDB Cluster: recomsystem-cluster
TiDB Version: v4.0.0-rc
ID Role Host Ports Status Data Dir Deploy Dir


10.152.89.21:8249 drainer 10.152.89.21 8249 Up /search/data/tidb/deploy/data.drainer /search/data/tidb/deploy
10.152.89.20:2379 pd 10.152.89.20 2379/2380 Healthy|L /search/data/tidb/deploy/data.pd /search/data/tidb/deploy
10.152.89.21:2379 pd 10.152.89.21 2379/2380 Healthy /search/data/tidb/deploy/data.pd /search/data/tidb/deploy
10.152.96.44:2379 pd 10.152.96.44 2379/2380 Healthy /search/data/tidb/deploy/data.pd /search/data/tidb/deploy
10.160.16.115:2379 pd 10.160.16.115 2379/2380 Down /search/data/tidb/deploy/pd/data /search/data/tidb/deploy/pd
10.160.25.108:2379 pd 10.160.25.108 2379/2380 Down /search/data/tidb/deploy/pd/data /search/data/tidb/deploy/pd
10.160.68.20:2379 pd 10.160.68.20 2379/2380 Down /search/data/tidb/deploy/pd/data /search/data/tidb/deploy/pd
10.152.89.20:8250 pump 10.152.89.20 8250 Up /search/data/tidb/deploy/data.pump /search/data/tidb/deploy
10.152.89.21:8250 pump 10.152.89.21 8250 Up /search/data/tidb/deploy/data.pump /search/data/tidb/deploy
10.152.96.44:8250 pump 10.152.96.44 8250 Up /search/data/tidb/deploy/data.pump /search/data/tidb/deploy
10.152.89.20:4000 tidb 10.152.89.20 4000/10080 Up - /search/data/tidb/deploy
10.152.89.21:4000 tidb 10.152.89.21 4000/10080 Up - /search/data/tidb/deploy
10.160.25.108:4000 tidb 10.160.25.108 4000/10080 Up - /search/data/tidb/deploy/tidb
10.160.68.20:4000 tidb 10.160.68.20 4000/10080 Up - /search/data/tidb/deploy/tidb
10.152.89.21:20160 tikv 10.152.89.21 20160/20180 Up /search/data/tidb/deploy/data /search/data/tidb/deploy
10.152.96.44:20160 tikv 10.152.96.44 20160/20180 Up /search/data/tidb/deploy/data /search/data/tidb/deploy
10.152.96.45:20160 tikv 10.152.96.45 20160/20180 Up /search/data/tidb/deploy/data /search/data/tidb/deploy
10.160.16.115:20160 tikv 10.160.16.115 20160/20180 Up /search/data/tidb/deploy/tikv/data /search/data/tidb/deploy/tikv
10.160.25.108:20160 tikv 10.160.25.108 20160/20180 Up /search/data/tidb/deploy/tikv/data /search/data/tidb/deploy/tikv
10.160.68.20:20160 tikv 10.160.68.20 20160/20180 Up /search/data/tidb/deploy/tikv/data /search/data/tidb/deploy/tikv

然后去目标机器上执行scripts/run_pd.sh出现以下信息:

[2020/05/14 01:01:38.046 +08:00] [INFO] [util.go:51] [“Welcome to Placement Driver (PD)”]

[2020/05/14 01:01:38.047 +08:00] [INFO] [util.go:52] [PD] [release-version=v4.0.0-rc]

[2020/05/14 01:01:38.047 +08:00] [INFO] [util.go:53] [PD] [git-hash=6f06805f3b0070107fcb4af68b2fc224dee0714d]

[2020/05/14 01:01:38.047 +08:00] [INFO] [util.go:54] [PD] [git-branch=heads/refs/tags/v4.0.0-rc]

[2020/05/14 01:01:38.047 +08:00] [INFO] [util.go:55] [PD] [utc-build-time=“2020-04-08 07:49:10”]

[2020/05/14 01:01:38.047 +08:00] [INFO] [metricutil.go:81] [“disable Prometheus push client”]

2020/05/14 01:01:38.047 grpclog.go:45: [info] parsed scheme: “endpoint”

2020/05/14 01:01:38.047 grpclog.go:45: [info] ccResolverWrapper: sending new addresses to cc: [{http://10.152.89.20:2379 0 } {http://10.152.89.21:2379 0 } {http://10.152.96.44:2379 0 }]

[2020/05/14 01:01:38.057 +08:00] [FATAL] [main.go:93] [“join meet error”] [error=“there is a member that has not joined successfully”] [stack=“github.com/pingcap/log.Fatal
\t/home/jenkins/agent/workspace/build_pd_multi_branch_v4.0.0-rc/go/pkg/mod/github.com/pingcap/log@v0.0.0-20200117041106-d28c14d3b1cd/global.go:59
main.main
\t/home/jenkins/agent/workspace/build_pd_multi_branch_v4.0.0-rc/go/src/github.com/pingcap/pd/cmd/pd-server/main.go:93
runtime.main
\t/usr/local/go/src/runtime/proc.go:203
”]

scripts/run_pd.sh 脚本如下:

#!/bin/bash

set -e

WARNING: This file was auto-generated. Do not edit!

All your edit might be overwritten!

DEPLOY_DIR=/search/data/tidb/deploy/pd

cd “${DEPLOY_DIR}” || exit 1

exec bin/pd-server \

--name="pd_hbhly_68_20" \

--client-urls="[http://10.160.68.20:2379](http://10.160.68.20:2379/)" \

--advertise-client-urls="[http://10.160.68.20:2379](http://10.160.68.20:2379/)" \

--peer-urls="[http://10.160.68.20:2380](http://10.160.68.20:2380/)" \

--advertise-peer-urls="[http://10.160.68.20:2380](http://10.160.68.20:2380/)" \

--data-dir="/search/data/tidb/deploy/pd/data" \

--join="[http://10.152.89.20:2379,http://10.152.89.21:2379,http://10.152.96.44:2379](http://10.152.89.20:2379%2Chttp//10.152.89.21:2379,http://10.152.96.44:2379)" \

--log-file="/search/data/tidb/deploy/pd/log/pd.log" 2>> "/search/data/tidb/deploy/pd/log/pd_stderr.log"

新扩容的网段和当前PD的网段看下通吗? 有防火墙吗? 端口放通了吗?

通的,我已经在这3台成功扩容的tidb和tikv了。

好的,其他同事帮您处理下,之后帮忙回复下结论,多谢

使用pd-ctl删除异常member,并保证所有member的healthy是true,清理目标服务器data目录,重试后解决。

:love_you_gesture:

在 PD 的扩缩容逻辑中,会要求 PD cluster 中的 member 状态都是 health ,才可以进行 join 或者 delete 操作。否则就会出现如上问题。后面我们会在 PD 扩缩容文档中添加一下关于这方面的注意。感谢反馈 ~ :pray:

请问这个怎么操作,我是新手,盼解答。

此话题已在最后回复的 1 分钟后被自动关闭。不再允许新回复。