TIKV扩容失败

Cluster version: v4.0.8
已有三个TIKV节点,扩容两个节点,然后下线这两个节点,一直处于 Pending Offline 中,等了一夜没有变化。
第二天一早 执行了强制下线的命令 tiup cluster scale-in prod-cluster --node 192.168.4.18:20160 --force

强制下线后,执行 扩容命令
但是系统提示失败:

  • [ Serial ] - UserSSH: user=tidb, host=192.168.4.2
  • [ Serial ] - Mkdir: host=192.168.4.18, directories=’/data1/deploy’,’/data1/deploy/log’,’/data1/deploy/bin’,’/data1/deploy/conf’,’/data1/deploy/scripts’
  • [ Serial ] - Mkdir: host=192.168.4.2, directories=’/data1/deploy’,’/data1/deploy/log’,’/data1/deploy/bin’,’/data1/deploy/conf’,’/data1/deploy/scripts’
    • Copy blackbox_exporter -> 192.168.4.2 … ⠹ Mkdir: host=192.168.4.2, directories=’/home/tidb/deploy/monitor-9100’,’/home/tidb/deploy/monitor-9100/data/monitor-9100’,’/home/tidb/deploy/monitor-9100/deploy/monitor-9…
  • [ Serial ] - Mkdir: host=192.168.4.2, directories=’/data1/deploy/data’
    • Copy blackbox_exporter -> 192.168.4.2 … ⠦ Mkdir: host=192.168.4.2, directories=’/home/tidb/deploy/monitor-9100’,’/home/tidb/deploy/monitor-9100/data/monitor-9100’,’/home/tidb/deploy/monitor-9100/deploy/monitor-9…
    • Copy node_exporter -> 192.168.4.18 … ⠧ Mkdir: host=192.168.4.18, directories=’/home/tidb/deploy/monitor-9100’,’/home/tidb/deploy/monitor-9100/data/monitor-9100’,’/home/tidb/deploy/monitor-9100/deploy/monitor-910…
    • Copy blackbox_exporter -> 192.168.4.2 … ⠧ MonitoredConfig: cluster=prod-cluster, user=tidb, node_exporter_port=9100, blackbox_exporter_port=9115, deploy_dir=/home/tidb/deploy/monitor-9100, data_dir=[/home/tidb/d…
    • Copy blackbox_exporter -> 192.168.4.2 … ⠇ MonitoredConfig: cluster=prod-cluster, user=tidb, node_exporter_port=9100, blackbox_exporter_port=9115, deploy_dir=/home/tidb/deploy/monitor-9100, data_dir=[/home/tidb/d…
    • Copy node_exporter -> 192.168.4.18 … Done
  • [Parallel] - UserSSH: user=tidb, host=192.168.4.18
  • [Parallel] - UserSSH: user=tidb, host=192.168.4.2
  • [ Serial ] - Save meta
  • [ Serial ] - StartCluster
    Starting component tikv
    Starting instance tikv 192.168.4.18:20160
    Starting instance tikv 192.168.4.2:20160
    retry error: operation timed out after 2m0s
    tikv 192.168.4.2:20160 failed to start: timed out waiting for port 20160 to be started after 2m0s, please check the log of the instance
    retry error: operation timed out after 2m0s
    tikv 192.168.4.18:20160 failed to start: timed out waiting for port 20160 to be started after 2m0s, please check the log of the instance

Error: failed to start tikv: tikv 192.168.4.2:20160 failed to start: timed out waiting for port 20160 to be started after 2m0s, please check the log of the instance: timed out waiting for port 20160 to be started after 2m0s

Verbose debug logs has been written to /home/tidb/logs/tiup-cluster-debug-2020-12-03-11-26-33.log.
Error: run /home/tidb/.tiup/components/cluster/v1.2.3/tiup-cluster (wd:/home/tidb/.tiup/data/SI4f0HA) failed: exit status 1

重启两台新TIKV节点
再次在主控机器执行扩容命令提示
[tidb@i-7xug7kg6 ~]$ tiup cluster scale-out prod-cluster scale-out.yaml
Starting component cluster: /home/tidb/.tiup/components/cluster/v1.2.3/tiup-cluster scale-out prod-cluster scale-out.yaml

Error: port conflict for ‘20160’ between ‘tikv_servers:192.168.4.2.port’ and ‘tikv_servers:192.168.4.2.port’

Verbose debug logs has been written to /home/tidb/logs/tiup-cluster-debug-2020-12-03-11-40-43.log.
Error: run /home/tidb/.tiup/components/cluster/v1.2.3/tiup-cluster (wd:/home/tidb/.tiup/data/SI4j6nG) failed: exit status 1

  1. 麻烦执行一下 tiup cluster display 看看当前集群的状态
  2. 辛苦详细描述一下之前扩容以及缩容的过程。

如图中的,4.2和4.18是两个昨晚希望扩容的新TIKV节点
1.第一步扩容执行命令
tiup cluster scale-out prod-cluster scale-out.yaml
2.然后两个节点正常上线,由于我把路径配置到了错误的磁盘上,这个错误大概15分钟后发现的,然后执行了下家命令
tiup cluster scale-in prod-cluster --node 192.168.4.2:20160

tiup cluster scale-in prod-cluster --node 192.168.4.18:20160
之后两个节点一直显示 Pending Offline,第二天早晨依旧
3.然后早晨执行了强制下线命令
tiup cluster scale-in prod-cluster --node 192.168.4.2:20160 --force

tiup cluster scale-in prod-cluster --node 192.168.4.18:20160 --force

4.修改好 scale-out.yaml 文件中 部署的路径后,再次执行
tiup cluster scale-out prod-cluster scale-out.yaml
出现上面提到的错误

亲,比较着急,生产环境:joy:

再补充下其中一台tikv机器的tikv.log

业务系统错误日志,一直在刷屏:java.sql.SQLException: Region is unavailable

  1. pd-ctl store 看看目前 store 的情况。
  2. 麻烦提供一下现在 offline 的两台机器的 tikv.log 的日志看看。

非常感谢官方同学的耐心指导,目前系统已经恢复,我把处理的步骤发一下

  1. 关闭调度: (记录下原来的配置值)

[tidb@i-7xug7kg6 ~]$ tiup ctl pd -u http://192.168.4.6:2379 config show all | grep schedule-limit
“leader-schedule-limit”: 4,
“region-schedule-limit”: 64,
“replica-schedule-limit”: 64,
“merge-schedule-limit”: 8,
“hot-region-schedule-limit”: 4,

  1. 执行关闭调度命令
    pd-ctl config set region-schedule-limit 0 -u http://{pd_ip}:{pd_port}
    3. pd-ctl config set replica-schedule-limit 0 -u http://{pd_ip}:{pd_port}
    4. pd-ctl config set leader-schedule-limit 0 -u http://{pd_ip}:{pd_port}
    5. ` pd-ctl config set merge-schedule-limit 0 -u http://{pd_ip}:{pd_port}

tiup ctl pd -u http://192.168.4.6:2379 config set region-schedule-limit 0

tiup ctl pd -u http://192.168.4.6:2379 config set replica-schedule-limit 0

tiup ctl pd -u http://192.168.4.6:2379 config set leader-schedule-limit 0

tiup ctl pd -u http://192.168.4.6:2379 config set merge-schedule-limit 0

tiup ctl pd -u http://192.168.4.6:2379 config set hot-region-schedule-limit 0

3.停止所有tikv
tiup cluster stop prod-cluster -R tikv

4.删除问题region
tikv-ctl --db {/path/to/tikv-data}/db unsafe-recover remove-fail-stores -s 208430 --all-regions

./tikv-ctl --db /data1/deploy/data/db unsafe-recover remove-fail-stores -s 208430 --all-regions
./tikv-ctl --db /data1/deploy/data/db unsafe-recover remove-fail-stores -s 208431 --all-regions

5.逐台启动tikv
tiup cluster start {cluster_name} -R tikv -N {IP:port}

tiup cluster start prod-cluster -R tikv -N 192.168.4.9:20160
tiup cluster start prod-cluster -R tikv -N 192.168.4.10:20160
tiup cluster start prod-cluster -R tikv -N 192.168.4.11:20160

  1. 启动调度:( value 替换为原来的配置值 )
    1. pd-ctl config set region-schedule-limit {value} -u http://{pd_ip}:{pd_port}
    2. pd-ctl config set replica-schedule-limit {value} -u http://{pd_ip}:{pd_port}
    3. pd-ctl config set leader-schedule-limit {value} -u http://{pd_ip}:{pd_port}
    4. ` pd-ctl config set merge-schedule-limit {value} -u http://{pd_ip}:{pd_port}

tiup ctl pd -u http://192.168.4.6:2379 config set region-schedule-limit 64

tiup ctl pd -u http://192.168.4.6:2379 config set replica-schedule-limit 64

tiup ctl pd -u http://192.168.4.6:2379 config set leader-schedule-limit 4

tiup ctl pd -u http://192.168.4.6:2379 config set merge-schedule-limit 8

tiup ctl pd -u http://192.168.4.6:2379 config set hot-region-schedule-limit 4

7.清理扩容的机器
tiup cluster prune {cluster_name}

tiup cluster prune prod-cluster

8.执行正常扩容
tiup cluster scale-out prod-cluster scale-out.yaml

多说几句
下线TIKV节点的时候,逐个节点操作,成功一个再执行下一个,能不用 --force(强制下线)尽量不要用。
导致这个问题的根本原因官方学的原话
“因为挂掉了两个 PD 没有收到心跳。其实已经是 down 的状态了。然后 region 挂了两个 副本 这块不能提供服务也调度不走也补不上”

1 个赞

:+1::+1::+1: