TiKV 扩容后 region 不均衡

感觉scheduler没干活,看一下配置呢
tiup ctl:v5.1.1 pd config show schedule --pd <PD_IP>:2379

好的,麻烦帮忙看下

{
  "max-snapshot-count": 5,
  "max-pending-peer-count": 32,
  "max-merge-region-size": 10,
  "max-merge-region-keys": 50000,
  "split-merge-interval": "1h0m0s",
  "enable-one-way-merge": "false",
  "enable-cross-table-merge": "true",
  "patrol-region-interval": "100ms",
  "max-store-down-time": "1h0m0s",
  "leader-schedule-limit": 64,
  "leader-schedule-policy": "size",
  "region-schedule-limit": 4096,
  "replica-schedule-limit": 64,
  "merge-schedule-limit": 8,
  "hot-region-schedule-limit": 4,
  "hot-region-cache-hits-threshold": 3,
  "store-limit": {
    "1": {
      "add-peer": 1000,
      "remove-peer": 1000
    },
    "1004812736": {
      "add-peer": 1000,
      "remove-peer": 1000
    },
    "1005346961": {
      "add-peer": 1000,
      "remove-peer": 1000
    },
    "1006624325": {
      "add-peer": 1000,
      "remove-peer": 1000
    },
    "1031573974": {
      "add-peer": 1000,
      "remove-peer": 1000
    },
    "1343129094": {
      "add-peer": 1000,
      "remove-peer": 1000
    },
    "1343129301": {
      "add-peer": 1000,
      "remove-peer": 1000
    },
    "20001": {
      "add-peer": 1000,
      "remove-peer": 1000
    }
  },
  "tolerant-size-ratio": 5,
  "low-space-ratio": 0.8,
  "high-space-ratio": 0.6,
  "region-score-formula-version": "",
  "scheduler-max-waiting-operator": 3,
  "enable-remove-down-replica": "true",
  "enable-replace-offline-replica": "true",
  "enable-make-up-replica": "true",
  "enable-remove-extra-replica": "true",
  "enable-location-replacement": "true",
  "enable-debug-metrics": "false",
  "enable-joint-consensus": "true",
  "schedulers-v2": [
    {
      "type": "balance-region",
      "args": null,
      "disable": false,
      "args-payload": ""
    },
    {
      "type": "balance-leader",
      "args": null,
      "disable": false,
      "args-payload": ""
    },
    {
      "type": "hot-region",
      "args": null,
      "disable": false,
      "args-payload": ""
    },
    {
      "type": "label",
      "args": null,
      "disable": false,
      "args-payload": ""
    },
    {
      "type": "evict-leader",
      "args": [
        "1031573974"
      ],
      "disable": false,
      "args-payload": ""
    }
  ],
  "schedulers-payload": null,
  "store-limit-mode": "manual"
}

这个影响replicaChecker补副本的速度,改回默认值10ms试试。
另外再确认一下有没有pd_no_store_for_making_replica告警

1、值我已经调整成 10ms,但是感觉没起作用
"patrol-region-interval": "10ms"

2、监控好像没有触发

3、现在我写了个脚本,人工把 region 调度到其他两个空间充裕的 tikv 节点上,但是感觉这样治标不治本

1、这个下线的节点 region 一直在新增

2、我刚刚人工调度到新增的 tikv 节点的 region 又跑回去了:joy:

24这个节点似乎是集群自动重启恢复up状态了吧,是不是可以看看这个节点上的tikv.log日志,是不是有滚动重启。感觉应该操作是让24下线,Offline,然后等待变成 Tombstone后,再移除

我手动把 24 的 region 全部移到其他节点后就变成 Tombstone 了,感觉 pd 调度失效了

它自己不往24上移region了吗,感觉有点奇怪,你不是已经down掉24了吗

对啊,ps 查看进程都没看到

可以试试调整 region weight ,tikv 的空间相差太大了。
https://docs.pingcap.com/zh/tidb/stable/pd-scheduling-best-practices#负载均衡
举例:
设置 store id 为 1 的 store 的 leader weight 为 5,Region weight 为 10:

>> store weight 1 5 10