TIKV扩容失败

timyao · 2020 年12 月 3 日 08:36

非常感谢官方同学的耐心指导，目前系统已经恢复，我把处理的步骤发一下

关闭调度： (记录下原来的配置值)

[tidb@i-7xug7kg6 ~]$ tiup ctl pd -u http://192.168.4.6:2379 config show all | grep schedule-limit
“leader-schedule-limit”: 4,
“region-schedule-limit”: 64,
“replica-schedule-limit”: 64,
“merge-schedule-limit”: 8,
“hot-region-schedule-limit”: 4,

执行关闭调度命令
pd-ctl config set region-schedule-limit 0 -u http://{pd_ip}:{pd_port}
3. pd-ctl config set replica-schedule-limit 0 -u http://{pd_ip}:{pd_port}
4. pd-ctl config set leader-schedule-limit 0 -u http://{pd_ip}:{pd_port}
5. ` pd-ctl config set merge-schedule-limit 0 -u http://{pd_ip}:{pd_port}

tiup ctl pd -u http://192.168.4.6:2379 config set region-schedule-limit 0

tiup ctl pd -u http://192.168.4.6:2379 config set replica-schedule-limit 0

tiup ctl pd -u http://192.168.4.6:2379 config set leader-schedule-limit 0

tiup ctl pd -u http://192.168.4.6:2379 config set merge-schedule-limit 0

tiup ctl pd -u http://192.168.4.6:2379 config set hot-region-schedule-limit 0

3.停止所有tikv
tiup cluster stop prod-cluster -R tikv

4.删除问题region
tikv-ctl --db {/path/to/tikv-data}/db unsafe-recover remove-fail-stores -s 208430 --all-regions

./tikv-ctl --db /data1/deploy/data/db unsafe-recover remove-fail-stores -s 208430 --all-regions
./tikv-ctl --db /data1/deploy/data/db unsafe-recover remove-fail-stores -s 208431 --all-regions

5.逐台启动tikv
tiup cluster start {cluster_name} -R tikv -N {IP:port}

tiup cluster start prod-cluster -R tikv -N 192.168.4.9:20160
tiup cluster start prod-cluster -R tikv -N 192.168.4.10:20160
tiup cluster start prod-cluster -R tikv -N 192.168.4.11:20160

启动调度：( value 替换为原来的配置值 )
1. pd-ctl config set region-schedule-limit {value} -u http://{pd_ip}:{pd_port}
2. pd-ctl config set replica-schedule-limit {value} -u http://{pd_ip}:{pd_port}
3. pd-ctl config set leader-schedule-limit {value} -u http://{pd_ip}:{pd_port}
4. ` pd-ctl config set merge-schedule-limit {value} -u http://{pd_ip}:{pd_port}

tiup ctl pd -u http://192.168.4.6:2379 config set region-schedule-limit 64

tiup ctl pd -u http://192.168.4.6:2379 config set replica-schedule-limit 64

tiup ctl pd -u http://192.168.4.6:2379 config set leader-schedule-limit 4

tiup ctl pd -u http://192.168.4.6:2379 config set merge-schedule-limit 8

tiup ctl pd -u http://192.168.4.6:2379 config set hot-region-schedule-limit 4

7.清理扩容的机器
tiup cluster prune {cluster_name}

tiup cluster prune prod-cluster

8.执行正常扩容
tiup cluster scale-out prod-cluster scale-out.yaml

多说几句
下线TIKV节点的时候，逐个节点操作，成功一个再执行下一个，能不用 --force（强制下线）尽量不要用。
导致这个问题的根本原因官方学的原话
“因为挂掉了两个 PD 没有收到心跳。其实已经是 down 的状态了。然后 region 挂了两个副本这块不能提供服务也调度不走也补不上”