tikv region分布不均

为提高效率,请提供以下信息,问题描述清晰能够更快得到解决:
【 TiDB 使用环境】3 tidb,3 pd,8 tikv,配置都是官方配置,tidb数据盘500G,pd 200G,tikv 350G,都是ssd。
【概述】2个tikv节点的数据远大于其他节点
【背景】今天上线了一台tkiv,有两台tikv在下线中
【现象】同步的业务会出现 region unavailable
【业务影响】因为下线了两台tikv节点,目前有两台tikv的磁盘快达到300G了,怕爆盘了。其他的tikv节点的空间还很富裕,新加的节点空间使用也不高。
【TiDB 版本】2.1.9
【附件】

  1. tikv 空间 信息
  2. pd 面板信息
  3. TiDB- Overview 监控
  • 对应模块日志(包含问题前后1小时日志)

【TiDB 版本】2.1.9
这个版本就点年头了…

这个只能当参考了

db78cfc0010b2dfeb12c176f7546d8a
这个调度器我给remove了 可以remove,add却不行
00d572ac726b007455815a852617393

这个版本太低了,能选择升级么?

我也想啊,老大不让啊 没回滚机制,不敢乱升。
这次节点变迁也是硬件的原因,不然也不会动的。

  • 如果你需要获得快速 “加急”处理问题的权限,加快问题响应速度, 点击完成认证,获得“加急”处理问题的权限,方便你更快速地解决问题。
  1. 看下这两个节点,目录中是哪些文件占用的多?
  2. 先下线了两个,是同时下线吗? 为什么要下线,是遇到什么故障了吗?
  3. 又为什么要上线一个?

/data/tidb/deploy/data/db/*.sst 文件很多占用很大,还有old开头的
image
一个是周一下线的,还没完全下线完,周二出现问题了,有一台tikv老是重启导致跑的任务失败,昨天我就把这台tikv停止了,今天还是重启,我就下线了,然后重新上线了一台新的。

故障比较多的报错是 Region is unavailable[try again later

  1. pd-ctl 取一下 store 信息吧
  2. tikv 的空间差距这么大? 300G 和 2 T 的混用? 你可以参考上面的帖子,试试配置 region-weight 和 leader-weight, 降低这两个 tikv 的使用。
{
  "count": 10,
  "stores": [
    {
      "store": {
        "id": 817018,
        "address": "10.1.195.200:20160",
        "version": "2.1.9",
        "state_name": "Up"
      },
      "status": {
        "capacity": "1.7 TiB",
        "available": "1.4 TiB",
        "leader_count": 38933,
        "leader_weight": 1,
        "leader_score": 720301,
        "leader_size": 720301,
        "region_count": 80037,
        "region_weight": 1,
        "region_score": 1414001,
        "region_size": 1414001,
        "start_ts": "2021-12-27T16:01:22+08:00",
        "last_heartbeat_ts": "2021-12-29T16:48:54.784309628+08:00",
        "uptime": "48h47m32.784309628s"
      }
    },
    {
      "store": {
        "id": 19287218,
        "address": "10.5.62.65:20160",
        "version": "2.1.9",
        "state_name": "Up"
      },
      "status": {
        "capacity": "492 GiB",
        "available": "173 GiB",
        "leader_count": 10696,
        "leader_weight": 1,
        "leader_score": 839141,
        "leader_size": 839141,
        "region_count": 98238,
        "region_weight": 1,
        "region_score": 264819958.37087202,
        "region_size": 2958753,
        "start_ts": "2021-12-29T15:12:38+08:00",
        "last_heartbeat_ts": "2021-12-29T16:48:52.750797041+08:00",
        "uptime": "1h36m14.750797041s"
      }
    },
    {
      "store": {
        "id": 22890551,
        "address": "10.5.62.73:20160",
        "version": "2.1.9",
        "state_name": "Up"
      },
      "status": {
        "capacity": "344 GiB",
        "available": "264 GiB",
        "leader_count": 44243,
        "leader_weight": 1,
        "leader_score": 720392,
        "leader_size": 720392,
        "region_count": 61642,
        "region_weight": 1,
        "region_score": 1087342,
        "region_size": 1087342,
        "start_ts": "2021-12-29T09:51:12+08:00",
        "last_heartbeat_ts": "2021-12-29T16:48:51.316407249+08:00",
        "uptime": "6h57m39.316407249s"
      }
    },
    {
      "store": {
        "id": 15001145,
        "address": "10.5.62.63:20160",
        "version": "2.1.9",
        "state_name": "Up"
      },
      "status": {
        "capacity": "344 GiB",
        "available": "197 GiB",
        "leader_count": 39030,
        "leader_weight": 1,
        "leader_score": 720421,
        "leader_size": 720421,
        "region_count": 81406,
        "region_weight": 1,
        "region_score": 1448211,
        "region_size": 1448211,
        "start_ts": "2021-12-27T16:04:55+08:00",
        "last_heartbeat_ts": "2021-12-29T16:48:46.509116741+08:00",
        "uptime": "48h43m51.509116741s"
      }
    },
    {
      "store": {
        "id": 22270353,
        "address": "10.5.62.68:20160",
        "version": "2.1.9",
        "state_name": "Up"
      },
      "status": {
        "capacity": "492 GiB",
        "available": "186 GiB",
        "leader_count": 15095,
        "leader_weight": 1,
        "leader_score": 839265,
        "leader_size": 839265,
        "region_count": 99621,
        "region_weight": 1,
        "region_score": 123749955.77292633,
        "region_size": 3013695,
        "start_ts": "2021-12-28T19:24:39+08:00",
        "last_heartbeat_ts": "2021-12-29T16:48:54.016901966+08:00",
        "uptime": "21h24m15.016901966s"
      }
    },
    {
      "store": {
        "id": 22497342,
        "address": "10.5.62.71:20160",
        "version": "2.1.9",
        "state_name": "Up"
      },
      "status": {
        "capacity": "344 GiB",
        "available": "240 GiB",
        "leader_count": 38318,
        "leader_weight": 1,
        "leader_score": 720422,
        "leader_size": 720422,
        "region_count": 79439,
        "region_weight": 1,
        "region_score": 1443872,
        "region_size": 1443872,
        "start_ts": "2021-12-27T16:18:43+08:00",
        "last_heartbeat_ts": "2021-12-29T16:48:54.035914364+08:00",
        "uptime": "48h30m11.035914364s"
      }
    },
    {
      "store": {
        "id": 22634393,
        "address": "10.5.62.72:20160",
        "state": 1,
        "version": "2.1.9",
        "state_name": "Offline"
      },
      "status": {
        "leader_weight": 1,
        "region_count": 77809,
        "region_weight": 1,
        "region_score": 1780723,
        "region_size": 1780723,
        "start_ts": "1970-01-01T08:00:00+08:00"
      }
    },
    {
      "store": {
        "id": 23006281,
        "address": "10.5.62.74:20160",
        "version": "2.1.9",
        "state_name": "Up"
      },
      "status": {
        "capacity": "344 GiB",
        "available": "336 GiB",
        "leader_count": 2988,
        "leader_weight": 2,
        "leader_score": 17622.5,
        "leader_size": 35245,
        "region_count": 3011,
        "region_weight": 3,
        "region_score": 12049.333333333334,
        "region_size": 36148,
        "start_ts": "2021-12-29T10:09:02+08:00",
        "last_heartbeat_ts": "2021-12-29T16:48:54.661365284+08:00",
        "uptime": "6h39m52.661365284s"
      }
    },
    {
      "store": {
        "id": 4,
        "address": "10.1.195.64:20160",
        "state": 1,
        "version": "2.1.9",
        "state_name": "Offline"
      },
      "status": {
        "leader_weight": 1,
        "region_count": 15961,
        "region_weight": 1,
        "region_score": 222232,
        "region_size": 222232,
        "start_ts": "1970-01-01T08:00:00+08:00"
      }
    },
    {
      "store": {
        "id": 15001144,
        "address": "10.5.62.64:20160",
        "version": "2.1.9",
        "state_name": "Up"
      },
      "status": {
        "capacity": "344 GiB",
        "available": "205 GiB",
        "leader_count": 36180,
        "leader_weight": 1,
        "leader_score": 720375,
        "leader_size": 720375,
        "region_count": 80779,
        "region_weight": 1,
        "region_score": 1468812,
        "region_size": 1468812,
        "start_ts": "2021-12-27T16:08:27+08:00",
        "last_heartbeat_ts": "2021-12-29T16:48:51.635729338+08:00",
        "uptime": "48h40m24.635729338s"
      }
    }
  ]
}

权重是越大,优先级越高对吧
目前默认的都是1,我给新节点分别设置为2,3了 但是感觉还是很慢

周一和今天加的一台,同步数据好慢,感觉都优先给了顶部的两台了,顶部那两台我已经作了扩容了,不然就盘就爆了。

参考下 https://docs.pingcap.com/zh/tidb/v2.1/pd-scheduling-best-practices 调整参数试试。除了截图,其他场景也查看下,你的问题设计多种。

好的
下面这个怎么加回来呢 误删了要紧么?

另外咨询下leader missing指标为什么会有怎么多的miss的leader呢,一式3份的region,当出现down或Offline的region,应该会重新选举leader呀

  1. 测试了一下,高版本可以添加。 2.1 Available command 看起来不能添加。。如果不能添加没法 balance region 了。为什么要 remove balance schedule 呢?
  2. 尽快加速缩容和扩容的节点,region 尽快均衡
  3. 不要 2 T 和 300G 混用,或者提前调整好 weight
  4. 如果下线长时间无法完成,查看 tikv.log 是否有报错信息。

1.当时觉得 balance-region-scheduler 未生效,所以就删除了,但是可以删除,却不能重建。
2.这个加大了一些limit的值了
3.后续会下线2T的实例
4.找了你们同事沟通,目前看已经ok了
5.后续我们将规划升级事宜