TIUP扩容PD失败，无法启动

longge93 · 2020 年7 月 6 日 07:25

使用TIUP扩容PD，第一次编写的PD NAME相同，所以扩容失败了。更改PD NAME后，扩容又失败了。检查配置文件发现，第一次失败后，没有回滚操作。导致现在无法启动TIDB了。

edit-config的PD相关内容：

pd_servers:

host: 192.168.30.30
ssh_port: 22
imported: true
name: pd_localhost
client_port: 2379
peer_port: 2380
deploy_dir: /data1/deploy
data_dir: /data1/deploy/data.pd
log_dir: /data1/deploy/log
arch: amd64
os: linux
host: 192.168.30.31
ssh_port: 22
imported: true
name: pd_localhost2
client_port: 2379
peer_port: 2380
deploy_dir: /data1/pd
data_dir: /data1/pd/data.pd
log_dir: /data1/pd/log
arch: amd64
os: linux
host: 192.168.30.32
ssh_port: 22
imported: true
name: pd_localhost3
client_port: 2379
peer_port: 2380
deploy_dir: /data1/pd
data_dir: /data1/pd/data.pd
log_dir: /data1/pd/log
arch: amd64
os: linux
host: 192.168.30.31
ssh_port: 22
imported: true
name: pd_localhost
client_port: 2379
peer_port: 2380
deploy_dir: /data1/deploy
data_dir: /data1/deploy/data.pd
log_dir: /data1/deploy/log
arch: amd64
os: linux
host: 192.168.30.32
ssh_port: 22
imported: true
name: pd_localhost
client_port: 2379
peer_port: 2380
deploy_dir: /data1/deploy
data_dir: /data1/deploy/data.pd
log_dir: /data1/deploy/log

启动TIDB报错：

[tidb@localhost tiup-yaml]$ tiup cluster start test-cluster
Starting component cluster: /home/tidb/.tiup/components/cluster/v1.0.7/tiup-cluster start test-cluster
Starting cluster test-cluster…

[ Serial ] - SSHKeySet: privateKey=/home/tidb/.tiup/storage/cluster/clusters/test-cluster/ssh/id_rsa, publicKey=/home/tidb/.tiup/storage/cluster/clusters/test-cluster/ssh/id_rsa.pub
[Parallel] - UserSSH: user=tidb, host=192.168.30.30
[Parallel] - UserSSH: user=tidb, host=192.168.30.32
[Parallel] - UserSSH: user=tidb, host=192.168.30.32
[Parallel] - UserSSH: user=tidb, host=192.168.30.31
[Parallel] - UserSSH: user=tidb, host=192.168.30.30
[Parallel] - UserSSH: user=tidb, host=192.168.30.31
[Parallel] - UserSSH: user=tidb, host=192.168.30.31
[Parallel] - UserSSH: user=tidb, host=192.168.30.33
[Parallel] - UserSSH: user=tidb, host=192.168.30.34
[Parallel] - UserSSH: user=tidb, host=192.168.30.32
[Parallel] - UserSSH: user=tidb, host=192.168.30.30
[Parallel] - UserSSH: user=tidb, host=192.168.30.30
[Parallel] - UserSSH: user=tidb, host=192.168.30.30
[ Serial ] - ClusterOperate: operation=StartOperation, options={Roles:[] Nodes:[] Force:false SSHTimeout:5 OptTimeout:60 APITimeout:300 IgnoreConfigCheck:false RetainDataRoles:[] RetainDataNodes:[]}
Starting component pd
Starting instance pd 192.168.30.32:2379
Starting instance pd 192.168.30.31:2379
Starting instance pd 192.168.30.31:2379
Starting instance pd 192.168.30.32:2379
Starting instance pd 192.168.30.30:2379
Failed to start pd-2379.service: Unit pd-2379.service not found.

Failed to start pd-2379.service: Unit pd-2379.service not found.

Start pd 192.168.30.30:2379 success

retry error: operation timed out after 1m0s
pd 192.168.30.32:2379 failed to start: timed out waiting for port 2379 to be started after 1m0s, please check the log of the instance
retry error: operation timed out after 1m0s
pd 192.168.30.32:2379 failed to start: timed out waiting for port 2379 to be started after 1m0s, please check the log of the instance

Error: failed to start: failed to start pd: failed to start: pd 192.168.30.31:2379: executor.ssh.execute_failed: Failed to execute command over SSH for ‘tidb@192.168.30.31:22’ {ssh_stderr: Failed to start pd-2379.service: Unit pd-2379.service not found.
, ssh_stdout: , ssh_command: export LANG=C; PATH=$PATH:/usr/bin:/usr/sbin sudo -H -u root bash -c “systemctl daemon-reload && systemctl start pd-2379.service && systemctl enable pd-2379.service”}, cause: Process exited with status 5

Verbose debug logs has been written to /home/tidb/tiup-yaml/logs/tiup-cluster-debug-2020-07-06-15-23-45.log.
Error: run /home/tidb/.tiup/components/cluster/v1.0.7/tiup-cluster (wd:/home/tidb/.tiup/data/S3wY60d) failed: exit status 1

yilong · 2020 年7 月 6 日 07:38

麻烦展示下当前 tiup cluster display 集群名称
麻烦反馈 /home/tidb/tiup-yaml/logs/tiup-cluster-debug-2020-07-06-15-23-45.log 日志
是重复了31 和 32 两个pd 节点？同时扩容的吗？

longge93 · 2020 年7 月 6 日 07:43

tiup-cluster-debug-2020-07-06-15-23-45.log (62.9 KB)

3.同时扩容的31/32.第一次name都是pd_localhost，所以失败了。然后改了name又试了一次。然后就启动不了了。

yilong · 2020 年7 月 6 日 07:49

抱歉，我这里可能有些地方不是很清楚，能否反馈下，您扩容时的命令，多谢。看起来是相同的ip地址和端口，但是使用了不同的名称？

longge93 · 2020 年7 月 6 日 07:52

先更新一下display的截图。上面发的有问题。

第一次扩容：
pd_servers:

host: 192.168.30.31
ssh_port: 22
imported: true
name: pd_localhost
client_port: 2379
peer_port: 2380
deploy_dir: /data1/pd
data_dir: /data1/pd/data.pd
log_dir: /data1/pd/log
arch: amd64
os: linux
host: 192.168.30.32
ssh_port: 22
imported: true
name: pd_localhost
client_port: 2379
peer_port: 2380
deploy_dir: /data1/pd
data_dir: /data1/pd/data.pd
log_dir: /data1/pd/log
arch: amd64
os: linux

失败了。所以更改了yaml里面的name，再次扩容

pd_servers:

host: 192.168.30.31
ssh_port: 22
imported: true
name: pd_localhost2
client_port: 2379
peer_port: 2380
deploy_dir: /data1/pd
data_dir: /data1/pd/data.pd
log_dir: /data1/pd/log
arch: amd64
os: linux
host: 192.168.30.32
ssh_port: 22
imported: true
name: pd_localhost3
client_port: 2379
peer_port: 2380
deploy_dir: /data1/pd
data_dir: /data1/pd/data.pd
log_dir: /data1/pd/log
arch: amd64
os: linux

使用的命令，就是TIUP的扩容命令

tiup cluster scale-out <cluster-name> 上面的.yaml

longge93 · 2020 年7 月 6 日 08:32

现在整个TIDB无法启动，有什么应急的办法吗？

AstroProfundis · 2020 年7 月 6 日 08:59

你好，可以尝试 edit-config 将扩容失败添加的错误结点删除（即将拓扑文件恢复到第一次失败的扩容之前的样子），再尝试启动集群之后用正确的配置扩容

longge93 · 2020 年7 月 6 日 09:21

更改配置文件后，也无法启动，TIKV超时报错。TIKV的启动启动日志如下：

AstroProfundis · 2020 年7 月 6 日 09:26

您好，可以用 reload 子命令更新 TiKV 的启动配置之后再尝试启动

另外请问一下，最开始扩容时候是执行的 tiup cluster scale-out 还是 tiup cluster edit-config 添加结点？

longge93 · 2020 年7 月 6 日 09:29

使用 tiup cluster scale-out添加的节点。观察TIKV，发现是连不上原来的PD节点（不是扩容出来的）。然后原来的PD节点无法启动。使用reload去更新原来的PD节点，报错如下：

更新TIKV节点，因为PD没有启动成功，所以一直在等待pd leader

AstroProfundis · 2020 年7 月 6 日 09:35

因为现在的情况是有错误的 PD 结点被添加到了拓扑中，所以要先去掉才能启动集群，可能的方法是：

使用 edit-config 删去错误的 PD 结点，将拓扑恢复到错误的扩容操作之前
使用 reload 命令更新各个组件（主要是 PD 和 TiKV）的启动脚本中的 PD 列表（不添加 -R 参数，reload 整个集群的配置）
尝试使用 start 来启动集群

longge93 · 2020 年7 月 6 日 09:38

不添加-R参数的话，第一个reload的是TIFLASH，然后会尝试启动TIFLASH。这个时候PD等节点还没起来。就会报错。然后reload就被打断了。

AstroProfundis · 2020 年7 月 6 日 09:41

那可以通过 -R 逐角色分别 reload 和启动，按 PD -> TiKV -> TiDB 的顺序，reload 一个组件后再执行下一个组件（reload 会自动进行一次 restart 操作），都成功之后再整个集群 reload 一次确保其他组件的配置也是最新的

longge93 · 2020 年7 月 6 日 09:44

reload pd节点，会报错第一个蓝色框的内容。start pd节点，会一直卡在第二个蓝色框这里。

AstroProfundis · 2020 年7 月 6 日 09:46

您当前 edit-config 时看到的 pd_servers 部分的内容能提供一下吗？

longge93 · 2020 年7 月 6 日 09:47

pd_servers:

host: 192.168.30.30
ssh_port: 22
imported: true
name: pd_localhost
client_port: 2379
peer_port: 2380
deploy_dir: /data1/deploy
data_dir: /data1/deploy/data.pd
log_dir: /data1/deploy/log
arch: amd64
os: linux

这个是最开始，没有扩容之前，稳定运行的配置。当前就是这样的。

本来是要在生产环境扩容的，幸亏在测试环境试了一下，要不就尴尬了

AstroProfundis · 2020 年7 月 6 日 09:55

能否提供一下 192.168.30.30 这台 PD 的日志？从配置看应该在 /data1/deploy/log 下，pd.log 和 pd_stderr.log 两个文件都麻烦上传一下

longge93 · 2020 年7 月 6 日 09:58

pd_stderr.log (4.6 MB)

pd.log太大了。我简单截图吧。

AstroProfundis · 2020 年7 月 6 日 10:13

能否确认下 192.168.30.32 上有没有 pd-server 进程在运行？如果有的话，可否提供一下这个进程的日志文件？
（如果存在，先不要结束这个进程，可能需要通过它来确认 PD 的状态）

longge93 · 2020 年7 月 6 日 10:16

存在一个pd进程。是我更改yaml 的 name后，第二次扩容的进程。

日志文件：

pd.log (443.5 KB) pd_stderr.log (10.7 KB)