v4.0.3 单机双实例tikv重启之后,leader region不均衡

15:50分左右

有可能是 reload 的 evict leader 的 scheduler 没有正常清理掉,再看下 pd scheduler 下面的监控?

我看了下,没有evict leader scheduler

问题重新描述一遍:

tidb 版本:

v4.0.4

1、在使用tiup reload之后,会造成leader region不均衡;重新用tiup restart 之后,leader region开始均衡

2、单机多实例,打上了label,压测的时候发现leader region不是很均衡;貌似hot region也不是太均衡

下面有 filter target 和 filter source 的面板,麻烦也截一下

filter target 麻烦截全一点

或者麻烦导一份完整的 pd 监控吧

filter targe太多了:差不多就这样

filter target 有 balance leader 相关的吗

没看到,应该是没有

pd-ctl config show all 的结果麻烦也贴一下吧

另外看到您这边有 tiflash, 可以看下 pd-ctl config placement-rules show 的结果

» config show
{
  "replication": {
    "enable-placement-rules": "true",
    "location-labels": "host",
    "max-replicas": 3,
    "strictly-match-label": "false"
  },
  "schedule": {
    "enable-cross-table-merge": "false",
    "enable-debug-metrics": "false",
    "enable-location-replacement": "true",
    "enable-make-up-replica": "true",
    "enable-one-way-merge": "false",
    "enable-remove-down-replica": "true",
    "enable-remove-extra-replica": "true",
    "enable-replace-offline-replica": "true",
    "high-space-ratio": 0.7,
    "hot-region-cache-hits-threshold": 3,
    "hot-region-schedule-limit": 64,
    "leader-schedule-limit": 16,
    "leader-schedule-policy": "count",
    "low-space-ratio": 0.8,
    "max-merge-region-keys": 200000,
    "max-merge-region-size": 20,
    "max-pending-peer-count": 16,
    "max-snapshot-count": 3,
    "max-store-down-time": "30m0s",
    "merge-schedule-limit": 8,
    "patrol-region-interval": "100ms",
    "region-schedule-limit": 2048,
    "replica-schedule-limit": 256,
    "scheduler-max-waiting-operator": 5,
    "split-merge-interval": "1h0m0s",
    "store-limit-mode": "manual",
    "tolerant-size-ratio": 0
  }
}
» config placement-rules show
[
  {
    "group_id": "pd",
    "id": "default",
    "start_key": "",
    "end_key": "",
    "role": "voter",
    "count": 3,
    "location_labels": [
      "host"
    ]
  }
]

请问您这是测试环境吗?

目前是压测环境

是否可以使用 v4.0.2,目前推断是 v4.0.3 引入的修改导致 balance-leader 没有正常工作。

可以使用

意思是v4.0.3就有这个问题么?

是的,建议测试的话可以先使用 v4.0.2