tiup扩缩容测试, 缩容pd成功,扩容pd失败

为提高效率,请提供以下信息,问题描述清晰能够更快得到解决:

【TiDB 版本】

16:13:54 test> select tidb_version()\G
*************************** 1. row ***************************
tidb_version(): Release Version: v5.0.0
Edition: Community
Git Commit Hash: bdac0885cd11bdf571aad9353bfc24e13554b91c
Git Branch: heads/refs/tags/v5.0.0
UTC Build Time: 2021-04-06 16:36:29
GoVersion: go1.13
Race Enabled: false
TiKV Min Version: v3.0.0-60965b006877ca7234adaced7890d7b029ed1306
Check Table Before Drop: false
1 row in set (0.00 sec)

16:13:57 test> 

【问题描述】
tidb环境测试pd扩缩容实验,能正常缩容,再次扩容的时候执行失败。
失败告警如下

[root@flinkslave2 test]# tiup update cluster
component cluster version v1.4.1 is already installed
Updated successfully!
[root@flinkslave2 test]# tiup cluster scale-out tidb-test scale-out.yaml
Starting component `cluster`: /root/.tiup/components/cluster/v1.4.1/tiup-cluster scale-out tidb-test scale-out.yaml

Error: Deploy directory overlaps to another instance (spec.deploy.dir_overlap)

The directory you specified in the topology file is:
  Directory: monitor data directory /vdb/test/data/monitored-9600
  Component: pd 10.228.131.85

It overlaps to another instance:
  Other Directory: monitor log directory /vdb/test/data/monitored-9600/log
  Other Component: pd 10.228.131.85

Please modify the topology file and try again.
Error: run `/root/.tiup/components/cluster/v1.4.1/tiup-cluster` (wd:/root/.tiup/data/SVISgBk) failed: exit status 1
[root@flinkslave2 test]# 


[root@flinkslave2 test]# cat scale-out.yaml 
pd_servers:
  - host: 10.228.131.95   
    ssh_port: 22
    client_port: 2379
    peer_port: 2380
    deploy_dir: /vdb/test/deploy/pd-2379
    data_dir: /vdb/test/data/pd-2379
  - host: 10.228.131.68
    ssh_port: 22
    client_port: 2379
    peer_port: 2380
    deploy_dir: /vdb/test/deploy/pd-2379
    data_dir: /vdb/test/data/pd-2379
[root@flinkslave2 test]# 

参考该文档:扩容tiflash报错 未解决问题。
tidb部署:

[root@flinkslave2 test]# tiup cluster display tidb-test
Starting component `cluster`: /root/.tiup/components/cluster/v1.4.1/tiup-cluster display tidb-test
Cluster type:       tidb
Cluster name:       tidb-test
Cluster version:    v5.0.0
SSH type:           builtin
Dashboard URL:      http://10.228.131.85:2379/dashboard
ID                   Role          Host           Ports        OS/Arch       Status   Data Dir                          Deploy Dir
--                   ----          ----           -----        -------       ------   --------                          ----------
10.228.131.68:9093   alertmanager  10.228.131.68  9093/9094    linux/x86_64  Up       /vdb/test/data/alertmanager-9093  /vdb/test/deploy/alertmanager-9093
10.228.131.68:8300   cdc           10.228.131.68  8300         linux/x86_64  Up       -                                 /vdb/test/deploy/cdc-8300
10.228.131.85:8300   cdc           10.228.131.85  8300         linux/x86_64  Up       -                                 /vdb/test/deploy/cdc-8300
10.228.131.95:8300   cdc           10.228.131.95  8300         linux/x86_64  Up       -                                 /vdb/test/deploy/cdc-8300
10.228.131.68:3000   grafana       10.228.131.68  3000         linux/x86_64  Up       -                                 /vdb/test/deploy/grafana-3000
10.228.131.85:2379   pd            10.228.131.85  2379/2380    linux/x86_64  Up|L|UI  /vdb/test/data/pd-2379            /vdb/test/deploy/pd-2379
10.228.131.68:9090   prometheus    10.228.131.68  9090         linux/x86_64  Up       /vdb/test/data/prometheus-9090    /vdb/test/deploy/prometheus-9090
10.228.131.68:4001   tidb          10.228.131.68  4001/10081   linux/x86_64  Up       -                                 /vdb/test/deploy/tidb-4001
10.228.131.85:4001   tidb          10.228.131.85  4001/10081   linux/x86_64  Up       -                                 /vdb/test/deploy/tidb-4001
10.228.131.95:4001   tidb          10.228.131.95  4001/10081   linux/x86_64  Up       -                                 /vdb/test/deploy/tidb-4001
10.228.131.71:20160  tikv          10.228.131.71  20160/20180  linux/x86_64  Up       /vdb/test/data/tikv-20160         /vdb/test/deploy/tikv-20160
10.228.131.72:20160  tikv          10.228.131.72  20160/20180  linux/x86_64  Up       /vdb/test/data/tikv-20160         /vdb/test/deploy/tikv-20160
10.228.131.75:20160  tikv          10.228.131.75  20160/20180  linux/x86_64  Up       /vdb/test/data/tikv-20160         /vdb/test/deploy/tikv-20160
Total nodes: 13
[root@flinkslave2 test

若提问为性能优化、故障排查类问题,请下载脚本运行。终端输出的打印结果,请务必全选并复制粘贴上传。

查看报错,这两个目录存在冲突(/vdb/test/data/monitored-9600 || /vdb/test/data/monitored-9600/log )。导致所有的组件都扩容都失败,目前不知道怎样修改配置文件中monitored的内容。使用# tiup cluster edit-config tidb-test提示monitored.LogDir不能修改:

The directory you specified in the topology file is:
  Directory: monitor data directory /vdb/test/data/monitored-9600    <<<<<<<<<<
  Component: pd 10.228.131.85

It overlaps to another instance:
  Other Directory: monitor log directory /vdb/test/data/monitored-9600/log   <<<<<<<<<<
  Other Component: pd 10.228.131.85

查了一下,应该是这个pr对目录做了限制。
https://github.com/pingcap/tiup/pull/1093

麻烦提供下集群的配置文件和 tiup --version 结果

[root@flinkslave2 ~]# tiup --version
1.4.1 tiup
Go Version: go1.16.3
Git Ref: v1.4.1
GitHash: cd19b75b6418f627d121d43d4b1e41af673526cf
[root@flinkslave2 ~]#
配置文件

[root@flinkslave2 tidb-test]# cat /root/.tiup/storage/cluster/clusters/tidb-test/meta.yaml 
user: tidb
tidb_version: v5.0.0
last_ops_ver: |-
  1.4.1 tiup
  Go Version: go1.16.3
  Git Ref: v1.4.1
  GitHash: cd19b75b6418f627d121d43d4b1e41af673526cf
topology:
  global:
    user: tidb
    ssh_port: 22
    ssh_type: builtin
    deploy_dir: /vdb/test/deploy
    data_dir: /vdb/test/data
    os: linux
    arch: amd64
  monitored:
    node_exporter_port: 9600
    blackbox_exporter_port: 9115
    deploy_dir: /vdb/test/data/monitored-9600
    data_dir: /vdb/test/data/monitored-9600
    log_dir: /vdb/test/data/monitored-9600/log
  server_configs:
    tidb:
      oom-action: cancel
      oom-use-tmp-storage: false
    tikv:
      readpool.coprocessor.use-unified-pool: true
      readpool.storage.use-unified-pool: false
      server.grpc-compression-type: gzip
    pd:
      replication.enable-placement-rules: true
      replication.location-labels:
      - zone
      - rack
      - host
      replication.strictly-match-label: true
      schedule.leader-schedule-limit: 4
      schedule.region-schedule-limit: 1024
      schedule.replica-schedule-limit: 64
      schedule.tolerant-size-ratio: 20.0
    tiflash:
      logger.level: info
    tiflash-learner:
      log-level: info
    pump: {}
    drainer: {}
    cdc: {}
  tidb_servers:
  - host: 10.228.131.68
    ssh_port: 22
    port: 4001
    status_port: 10081
    deploy_dir: /vdb/test/deploy/tidb-4001
    arch: amd64
    os: linux
  - host: 10.228.131.95
    ssh_port: 22
    port: 4001
    status_port: 10081
    deploy_dir: /vdb/test/deploy/tidb-4001
    arch: amd64
    os: linux
  - host: 10.228.131.85
    ssh_port: 22
    port: 4001
    status_port: 10081
    deploy_dir: /vdb/test/deploy/tidb-4001
    arch: amd64
    os: linux
  tikv_servers:
  - host: 10.228.131.72
    ssh_port: 22
    port: 20160
    status_port: 20180
    deploy_dir: /vdb/test/deploy/tikv-20160
    data_dir: /vdb/test/data/tikv-20160
    config:
      server.labels:
        host: host1
        rack: rack1
        zone: zone1
    arch: amd64
    os: linux
  - host: 10.228.131.75
    ssh_port: 22
    port: 20160
    status_port: 20180
    deploy_dir: /vdb/test/deploy/tikv-20160
    data_dir: /vdb/test/data/tikv-20160
    config:
      server.labels:
        host: host2
        rack: rack1
        zone: zone1
    arch: amd64
    os: linux
  - host: 10.228.131.71
    ssh_port: 22
    port: 20160
    status_port: 20180
    deploy_dir: /vdb/test/deploy/tikv-20160
    data_dir: /vdb/test/data/tikv-20160
    config:
      server.labels:
        host: host3
        rack: rack3
        zone: zone2
    arch: amd64
    os: linux
  tiflash_servers: []
  pd_servers:
  - host: 10.228.131.85
    ssh_port: 22
    name: pd-10.228.131.85-2379
    client_port: 2379
    peer_port: 2380
    deploy_dir: /vdb/test/deploy/pd-2379
    data_dir: /vdb/test/data/pd-2379
    arch: amd64
    os: linux
  cdc_servers:
  - host: 10.228.131.68
    ssh_port: 22
    port: 8300
    deploy_dir: /vdb/test/deploy/cdc-8300
    log_dir: /vdb/test/deploy/cdc-8300/log
    arch: amd64
    os: linux
  - host: 10.228.131.85
    ssh_port: 22
    port: 8300
    deploy_dir: /vdb/test/deploy/cdc-8300
    log_dir: /vdb/test/deploy/cdc-8300/log
    arch: amd64
    os: linux
  - host: 10.228.131.95
    ssh_port: 22
    port: 8300
    deploy_dir: /vdb/test/deploy/cdc-8300
    log_dir: /vdb/test/deploy/cdc-8300/log
    arch: amd64
    os: linux
  monitoring_servers:
  - host: 10.228.131.68
    ssh_port: 22
    port: 9090
    deploy_dir: /vdb/test/deploy/prometheus-9090
    data_dir: /vdb/test/data/prometheus-9090
    external_alertmanagers: []
    arch: amd64
    os: linux
    rule_dir: /vdb/test/rule
  grafana_servers:
  - host: 10.228.131.68
    ssh_port: 22
    port: 3000
    deploy_dir: /vdb/test/deploy/grafana-3000
    arch: amd64
    os: linux
    username: admin
    password: admin
    anonymous_enable: false
    root_url: ""
    domain: ""
  alertmanager_servers:
  - host: 10.228.131.68
    ssh_port: 22
    web_port: 9093
    cluster_port: 9094
    deploy_dir: /vdb/test/deploy/alertmanager-9093
    data_dir: /vdb/test/data/alertmanager-9093
    arch: amd64
    os: linux
    config_file: /vdb/test/alertmanager.yml
[root@flinkslave2 tidb-test]# 

monitoreddeploy_dirdata_dir 的路径完全一致,扩容时确实会有问题,可以参考下面这个 SOP 调整下 monitoreddata_dir 路径:

谢谢,按照文档做了修改。现在能正常扩容组件

好的,有问题可重新开贴提问。

此话题已在最后回复的 1 分钟后被自动关闭。不再允许新回复。