非常感谢官方同学的耐心指导,目前系统已经恢复,我把处理的步骤发一下
- 关闭调度: (记录下原来的配置值)
[tidb@i-7xug7kg6 ~]$ tiup ctl pd -u http://192.168.4.6:2379 config show all | grep schedule-limit
“leader-schedule-limit”: 4,
“region-schedule-limit”: 64,
“replica-schedule-limit”: 64,
“merge-schedule-limit”: 8,
“hot-region-schedule-limit”: 4,
- 执行关闭调度命令
pd-ctl config set region-schedule-limit 0 -u http://{pd_ip}:{pd_port}
3.pd-ctl config set replica-schedule-limit 0 -u http://{pd_ip}:{pd_port}
4.pd-ctl config set leader-schedule-limit 0 -u http://{pd_ip}:{pd_port}
5. ` pd-ctl config set merge-schedule-limit 0 -u http://{pd_ip}:{pd_port}
tiup ctl pd -u http://192.168.4.6:2379 config set region-schedule-limit 0
tiup ctl pd -u http://192.168.4.6:2379 config set replica-schedule-limit 0
tiup ctl pd -u http://192.168.4.6:2379 config set leader-schedule-limit 0
tiup ctl pd -u http://192.168.4.6:2379 config set merge-schedule-limit 0
tiup ctl pd -u http://192.168.4.6:2379 config set hot-region-schedule-limit 0
3.停止所有tikv
tiup cluster stop prod-cluster -R tikv
4.删除问题region
tikv-ctl --db {/path/to/tikv-data}/db unsafe-recover remove-fail-stores -s 208430 --all-regions
./tikv-ctl --db /data1/deploy/data/db unsafe-recover remove-fail-stores -s 208430 --all-regions
./tikv-ctl --db /data1/deploy/data/db unsafe-recover remove-fail-stores -s 208431 --all-regions
5.逐台启动tikv
tiup cluster start {cluster_name} -R tikv -N {IP:port}
tiup cluster start prod-cluster -R tikv -N 192.168.4.9:20160
tiup cluster start prod-cluster -R tikv -N 192.168.4.10:20160
tiup cluster start prod-cluster -R tikv -N 192.168.4.11:20160
- 启动调度:( value 替换为原来的配置值 )
pd-ctl config set region-schedule-limit {value} -u http://{pd_ip}:{pd_port}
pd-ctl config set replica-schedule-limit {value} -u http://{pd_ip}:{pd_port}
pd-ctl config set leader-schedule-limit {value} -u http://{pd_ip}:{pd_port}
- ` pd-ctl config set merge-schedule-limit {value} -u http://{pd_ip}:{pd_port}
tiup ctl pd -u http://192.168.4.6:2379 config set region-schedule-limit 64
tiup ctl pd -u http://192.168.4.6:2379 config set replica-schedule-limit 64
tiup ctl pd -u http://192.168.4.6:2379 config set leader-schedule-limit 4
tiup ctl pd -u http://192.168.4.6:2379 config set merge-schedule-limit 8
tiup ctl pd -u http://192.168.4.6:2379 config set hot-region-schedule-limit 4
7.清理扩容的机器
tiup cluster prune {cluster_name}
tiup cluster prune prod-cluster
8.执行正常扩容
tiup cluster scale-out prod-cluster scale-out.yaml
多说几句
下线TIKV节点的时候,逐个节点操作,成功一个再执行下一个,能不用 --force(强制下线)尽量不要用。
导致这个问题的根本原因官方学的原话
“因为挂掉了两个 PD 没有收到心跳。其实已经是 down 的状态了。然后 region 挂了两个 副本 这块不能提供服务也调度不走也补不上”