[FAQ] 在当前 pd 节点扩容 pd，pd name 获取冲突问题；pd.log 报错：failed to open directory

来了老弟 · 2020 年5 月 18 日 13:47

【问题澄清】

在 tidb-ansible 时期，pd name 为默认的内部参数，不支持自定义，目前 TiUP 逐渐成为运维主流，也是开放了 pd name 的配置。
tiup version：v0.6.0

【问题复现】

1

创建扩容文件

global:
  user: tidb
  ssh_port: 22
  deploy_dir: /home/tidb/lqh-clusters/root_test/deploy02
  data_dir: /home/tidb/lqh-clusters/root_test/data02
pd_servers:
  host: 172.16.5.169
  ssh_port: 22
  client_port: 52379
  peer_port: 52380
server_configs:
  pd:
    replication.enable-placement-rules: true

2

完整执行过程：tiup cluster scale-out root_test conf/pd-scale-out.yaml

[tidb@node5169 qihang.li]$ tiup cluster scale-out root_test conf/pd-scale-out.yaml
Starting component `cluster`: /home/tidb/.tiup/components/cluster/v0.6.0/cluster scale-out root_test conf/pd-scale-out.yaml
Please confirm your topology:
TiDB Cluster: root_test
TiDB Version: v4.0.0-rc.1
Type  Host          Ports        Directories
----  ----          -----        -----------
pd    172.16.5.169  52379/52380  /home/tidb/lqh-clusters/root_test/deploy02/pd-52379,/home/tidb/lqh-clusters/root_test/data02/pd-52379
Attention:
    1. If the topology is not what you expected, check your yaml file.
    2. Please confirm there is no port/directory conflicts in same host.
Do you want to continue? [y/N]:  y
+ [ Serial ] - SSHKeySet: privateKey=/home/tidb/.tiup/storage/cluster/clusters/root_test/ssh/id_rsa, publicKey=/home/tidb/.tiup/storage/cluster/clusters/root_test/ssh/id_rsa.pub


  - Download node_exporter:v0.17.0 ... Done
+ [ Serial ] - UserSSH: user=tidb, host=172.16.5.169
+ [ Serial ] - Mkdir: host=172.16.5.169, directories='/home/tidb/lqh-clusters/root_test/deploy02/pd-52379','/home/tidb/lqh-clusters/root_test/data02/pd-52379','/home/tidb/lqh-clusters/root_test/deploy02/pd-52379/log','/home/tidb/lqh-clusters/root_test/deploy02/pd-52379/bin','/home/tidb/lqh-clusters/root_test/deploy02/pd-52379/conf','/home/tidb/lqh-clusters/root_test/deploy02/pd-52379/scripts'
+ [ Serial ] - CopyComponent: component=pd, version=v4.0.0-rc.1, remote=172.16.5.169:/home/tidb/lqh-clusters/root_test/deploy02/pd-52379
+ [ Serial ] - ScaleConfig: cluster=root_test, user=tidb, host=172.16.5.169, service=pd-52379.service, deploy_dir=/home/tidb/lqh-clusters/root_test/deploy02/pd-52379, data_dir=/home/tidb/lqh-clusters/root_test/data02/pd-52379, log_dir=/home/tidb/lqh-clusters/root_test/deploy02/pd-52379/log, cache_dir=
script path: /home/tidb/.tiup/storage/cluster/clusters/root_test/config/run_pd_172.16.5.169_52379.sh
script path: /home/tidb/.tiup/components/cluster/v0.6.0/templates/scripts/run_pd_scale.sh.tpl
+ [Parallel] - UserSSH: user=tidb, host=172.16.5.169
+ [Parallel] - UserSSH: user=tidb, host=172.16.5.169
+ [Parallel] - UserSSH: user=tidb, host=172.16.5.169
+ [Parallel] - UserSSH: user=tidb, host=172.16.5.169
+ [Parallel] - UserSSH: user=tidb, host=172.16.5.171
+ [Parallel] - UserSSH: user=tidb, host=172.16.5.169
+ [Parallel] - UserSSH: user=tidb, host=172.16.5.142
+ [Parallel] - UserSSH: user=tidb, host=172.16.5.169
+ [ Serial ] - ClusterOperate: operation=StartOperation, options={Roles:[] Nodes:[] Force:false Timeout:0}
Starting component pd
        Starting instance pd 172.16.5.169:12379
        Start pd 172.16.5.169:12379 success
Starting component node_exporter
        Starting instance 172.16.5.169
        Start 172.16.5.169 success
Starting component blackbox_exporter
        Starting instance 172.16.5.169
        Start 172.16.5.169 success
Starting component tikv
        Starting instance tikv 172.16.5.171:30163
        Starting instance tikv 172.16.5.169:30161
        Starting instance tikv 172.16.5.169:30162
        Start tikv 172.16.5.171:30163 success
        Start tikv 172.16.5.169:30162 success
        Start tikv 172.16.5.169:30161 success
Starting component node_exporter
        Starting instance 172.16.5.171
        Start 172.16.5.171 success
Starting component blackbox_exporter
        Starting instance 172.16.5.171
        Start 172.16.5.171 success
Starting component tidb
        Starting instance tidb 172.16.5.169:34000
        Start tidb 172.16.5.169:34000 success
Starting component tiflash
        Starting instance tiflash 172.16.5.142:29000
        Start tiflash 172.16.5.142:29000 success
Starting component node_exporter
        Starting instance 172.16.5.142
        Start 172.16.5.142 success
Starting component blackbox_exporter
        Starting instance 172.16.5.142
        Start 172.16.5.142 success
Starting component prometheus
        Starting instance prometheus 172.16.5.169:19090
        Start prometheus 172.16.5.169:19090 success
Starting component grafana
        Starting instance grafana 172.16.5.169:13000
        Start grafana 172.16.5.169:13000 success
Checking service state of pd
        172.16.5.169       Active: active (running) since 四 2020-05-07 22:38:47 CST; 21h ago
Checking service state of tikv
        172.16.5.171       Active: active (running) since 三 2020-05-06 10:36:52 CST; 2 days ago
        172.16.5.169       Active: active (running) since 四 2020-05-07 22:36:05 CST; 21h ago
        172.16.5.169       Active: active (running) since 四 2020-05-07 22:35:31 CST; 21h ago
Checking service state of tidb
        172.16.5.169       Active: active (running) since 三 2020-05-06 10:37:05 CST; 2 days ago
Checking service state of tiflash
        172.16.5.142       Active: active (running) since 三 2020-05-06 10:37:15 CST; 2 days ago
Checking service state of prometheus
        172.16.5.169       Active: active (running) since 三 2020-05-06 10:37:28 CST; 2 days ago
Checking service state of grafana
        172.16.5.169       Active: active (running) since 三 2020-05-06 10:37:40 CST; 2 days ago
+ [Parallel] - UserSSH: user=tidb, host=172.16.5.169
+ [ Serial ] - save meta
+ [ Serial ] - ClusterOperate: operation=StartOperation, options={Roles:[] Nodes:[] Force:false Timeout:0}
Starting component pd
        Starting instance pd 172.16.5.169:52379
        pd 172.16.5.169:52379 failed to start: timed out waiting for port 52379 to be started after 1m0s, please check the log of the instance

Error: failed to start: failed to start pd:     pd 172.16.5.169:52379 failed to start: timed out waiting for port 52379 to be started after 1m0s, please check the log of the instance: timed out waiting for port 52379 to be started after 1m0s

Verbose debug logs has been written to /home/tidb/qihang.li/logs/tiup-cluster-debug-2020-05-08-19-44-12.log.
Error: run `/home/tidb/.tiup/components/cluster/v0.6.0/cluster` (wd:/home/tidb/.tiup/data/RyOcHwQ) failed: exit status 1

3

pd.log 报错：

[tidb@node5169 qihang.li]$ less /home/tidb/lqh-clusters/root_test/deploy02/pd-52379/log/pd.log
[2020/05/08 19:43:09.692 +08:00] [INFO] [util.go:49] [“Welcome to Placement Driver (PD)”]
[2020/05/08 19:43:09.692 +08:00] [INFO] [util.go:50] [PD] [release-version=v4.0.0-rc.1]
[2020/05/08 19:43:09.692 +08:00] [INFO] [util.go:51] [PD] [git-hash=31dae220db6294f2dc2ec0df330892fe76e59edc]
[2020/05/08 19:43:09.692 +08:00] [INFO] [util.go:52] [PD] [git-branch=heads/refs/tags/v4.0.0-rc.1]
[2020/05/08 19:43:09.692 +08:00] [INFO] [util.go:53] [PD] [utc-build-time=“2020-04-28 11:56:11”]
[2020/05/08 19:43:09.693 +08:00] [INFO] [metricutil.go:81] [“disable Prometheus push client”]
[2020/05/08 19:43:09.693 +08:00] [ERROR] [join.go:213] [“failed to open directory”] [error=“open /home/tidb/lqh-clusters/root_test/data02/pd-52379/member: no such file or directory”]
[2020/05/08 19:43:09.696 +08:00] [FATAL] [main.go:93] [“join meet error”] [error=“missing data or join a duplicated pd”] [stack=“github.com/pingcap/log.Fatal
\t/home/jenkins/agent/workspace/uild_pd_multi_branch_v4.0.0-rc.1/go/pkg/mod/github.com/pingcap/log@v0.0.0-20200117041106-d28c14d3b1cd/global.go:59
main.main
\t/home/jenkins/agent/workspace/uild_pd_multi_branch_v4.0.0-rc.1/go/src/github.com/pingcap/pd/cmd/pd-server/main.go:93
runtime.main
\t/usr/local/go/src/runtime/proc.go:203”]
[2020/05/08 19:43:24.839 +08:00] [INFO] [util.go:49] [“Welcome to Placement Driver (PD)”]
[2020/05/08 19:43:24.839 +08:00] [INFO] [util.go:50] [PD] [release-version=v4.0.0-rc.1]
[2020/05/08 19:43:24.839 +08:00] [INFO] [util.go:51] [PD] [git-hash=31dae220db6294f2dc2ec0df330892fe76e59edc]
[2020/05/08 19:43:24.839 +08:00] [INFO] [util.go:52] [PD] [git-branch=heads/refs/tags/v4.0.0-rc.1]
[2020/05/08 19:43:24.839 +08:00] [INFO] [util.go:53] [PD] [utc-build-time=“2020-04-28 11:56:11”]
[2020/05/08 19:43:24.839 +08:00] [INFO] [metricutil.go:81] [“disable Prometheus push client”]
[2020/05/08 19:43:24.839 +08:00] [ERROR] [join.go:213] [“failed to open directory”] [error=“open /home/tidb/lqh-clusters/root_test/data02/pd-52379/member: no such file or directory”]
2020/05/08 19:43:24.839 grpclog.go:45: [info] parsed scheme: “endpoint”
2020/05/08 19:43:24.840 grpclog.go:45: [info] ccResolverWrapper: sending new addresses to cc: [{http://172.16.5.169:12379 0 }]
[2020/05/08 19:43:24.842 +08:00] [FATAL] [main.go:93] [“join meet error”] [error=“missing data or join a duplicated pd”] [stack="github.com/pingcap/log.Fatal
\t/home/jenkins/agent/workspace/uild_pd_multi_branch_v4.0.0-rc.1/go/p…skipping…

【解决方案】

tiup v0.6.2 解决此问题：https://github.com/pingcap-incubator/tiup-cluster/issues/383。
目前只能通过修改扩容文件中 name，来解决此问题

pd_servers:
  - host: 10.0.1.4
    # ssh_port: 22
    name: "pd-1"
    # client_port: 2379
    # peer_port: 2380

【经典案例】