TIKV扩容失败

非常感谢官方同学的耐心指导,目前系统已经恢复,我把处理的步骤发一下

  1. 关闭调度: (记录下原来的配置值)

[tidb@i-7xug7kg6 ~]$ tiup ctl pd -u http://192.168.4.6:2379 config show all | grep schedule-limit
“leader-schedule-limit”: 4,
“region-schedule-limit”: 64,
“replica-schedule-limit”: 64,
“merge-schedule-limit”: 8,
“hot-region-schedule-limit”: 4,

  1. 执行关闭调度命令
    pd-ctl config set region-schedule-limit 0 -u http://{pd_ip}:{pd_port}
    3. pd-ctl config set replica-schedule-limit 0 -u http://{pd_ip}:{pd_port}
    4. pd-ctl config set leader-schedule-limit 0 -u http://{pd_ip}:{pd_port}
    5. ` pd-ctl config set merge-schedule-limit 0 -u http://{pd_ip}:{pd_port}

tiup ctl pd -u http://192.168.4.6:2379 config set region-schedule-limit 0

tiup ctl pd -u http://192.168.4.6:2379 config set replica-schedule-limit 0

tiup ctl pd -u http://192.168.4.6:2379 config set leader-schedule-limit 0

tiup ctl pd -u http://192.168.4.6:2379 config set merge-schedule-limit 0

tiup ctl pd -u http://192.168.4.6:2379 config set hot-region-schedule-limit 0

3.停止所有tikv
tiup cluster stop prod-cluster -R tikv

4.删除问题region
tikv-ctl --db {/path/to/tikv-data}/db unsafe-recover remove-fail-stores -s 208430 --all-regions

./tikv-ctl --db /data1/deploy/data/db unsafe-recover remove-fail-stores -s 208430 --all-regions
./tikv-ctl --db /data1/deploy/data/db unsafe-recover remove-fail-stores -s 208431 --all-regions

5.逐台启动tikv
tiup cluster start {cluster_name} -R tikv -N {IP:port}

tiup cluster start prod-cluster -R tikv -N 192.168.4.9:20160
tiup cluster start prod-cluster -R tikv -N 192.168.4.10:20160
tiup cluster start prod-cluster -R tikv -N 192.168.4.11:20160

  1. 启动调度:( value 替换为原来的配置值 )
    1. pd-ctl config set region-schedule-limit {value} -u http://{pd_ip}:{pd_port}
    2. pd-ctl config set replica-schedule-limit {value} -u http://{pd_ip}:{pd_port}
    3. pd-ctl config set leader-schedule-limit {value} -u http://{pd_ip}:{pd_port}
    4. ` pd-ctl config set merge-schedule-limit {value} -u http://{pd_ip}:{pd_port}

tiup ctl pd -u http://192.168.4.6:2379 config set region-schedule-limit 64

tiup ctl pd -u http://192.168.4.6:2379 config set replica-schedule-limit 64

tiup ctl pd -u http://192.168.4.6:2379 config set leader-schedule-limit 4

tiup ctl pd -u http://192.168.4.6:2379 config set merge-schedule-limit 8

tiup ctl pd -u http://192.168.4.6:2379 config set hot-region-schedule-limit 4

7.清理扩容的机器
tiup cluster prune {cluster_name}

tiup cluster prune prod-cluster

8.执行正常扩容
tiup cluster scale-out prod-cluster scale-out.yaml

多说几句
下线TIKV节点的时候,逐个节点操作,成功一个再执行下一个,能不用 --force(强制下线)尽量不要用。
导致这个问题的根本原因官方学的原话
“因为挂掉了两个 PD 没有收到心跳。其实已经是 down 的状态了。然后 region 挂了两个 副本 这块不能提供服务也调度不走也补不上”

1 个赞