Tikv扩容成功后,几天后没有自动均衡数据

为提高效率,提问时请尽量提供详细背景信息,问题描述清晰可优先响应。以下信息点请尽量提供:

  • 系统版本 & kernel 版本:
  • TiDB 版本:v3.0.0-beta.1-133-g27a56180b
  • 磁盘型号:阿里云本地ssd
  • 集群节点分布:tidb:4,pd:3,tikv:7
  • 数据量 & region 数量 & 副本数:2.9T 45466 3
  • 集群 QPS、.999-Duration、读写比例:2k select:1min r1:w5
  • 问题描述(我做了什么): 扩容了2台tikv,扩容后数据没有自动均衡,目前两个tikv的store数量还是0
1赞

store 信息没有截全,如扩容成功,store 应该是 4 个 tikv 的节点信息。

{
  "count": 7,
  "stores": [
    {
      "store": {
        "id": 128620,
        "address": "10.x.x.98:20160",
        "version": "3.0.0-rc.2",
        "state_name": "Up"
      },
      "status": {
        "capacity": "880 GiB",
        "available": "315 GiB",
        "leader_count": 9278,
        "leader_weight": 1,
        "leader_score": 689583,
        "leader_size": 689583,
        "region_count": 25956,
        "region_weight": 1,
        "region_score": 226454055.6613083,
        "region_size": 1959724,
        "start_ts": "2019-06-03T15:09:20+08:00",
        "last_heartbeat_ts": "2019-09-06T14:04:20.814441853+08:00",
        "uptime": "2278h55m0.814441853s"
      }
    },
    {
      "store": {
        "id": 1005501,
        "address": "10.s.s.210:20160",
        "version": "3.0.0-rc.2",
        "state_name": "Up"
      },
      "status": {
        "capacity": "1.7 TiB",
        "available": "969 GiB",
        "leader_count": 8638,
        "leader_weight": 1,
        "leader_score": 689926,
        "leader_size": 689926,
        "region_count": 27219,
        "region_weight": 1,
        "region_score": 2051580,
        "region_size": 2051580,
        "start_ts": "2019-06-03T15:12:01+08:00",
        "last_heartbeat_ts": "2019-09-06T14:04:17.379300614+08:00",
        "uptime": "2278h52m16.379300614s"
      }
    },
    {
      "store": {
        "id": 1087486,
        "address": "10.s.s.26:20160",
        "version": "3.0.0-beta.1",
        "state_name": "Up"
      },
      "status": {
        "capacity": "819 GiB",
        "available": "818 GiB",
        "leader_weight": 1,
        "region_weight": 1,
        "start_ts": "2019-09-04T18:39:47+08:00",
        "last_heartbeat_ts": "2019-09-06T14:04:23.608842339+08:00",
        "uptime": "43h24m36.608842339s"
      }
    },
    {
      "store": {
        "id": 1087487,
        "address": "10.s.s.27:20160",
        "version": "3.0.0-beta.1",
        "state_name": "Up"
      },
      "status": {
        "capacity": "819 GiB",
        "available": "818 GiB",
        "leader_weight": 1,
        "region_weight": 1,
        "start_ts": "2019-09-04T18:39:47+08:00",
        "last_heartbeat_ts": "2019-09-06T14:04:24.663803365+08:00",
        "uptime": "43h24m37.663803365s"
      }
    },
    {
      "store": {
        "id": 1,
        "address": "10.s.s.88:20160",
        "version": "3.0.0-rc.2",
        "state_name": "Up"
      },
      "status": {
        "capacity": "880 GiB",
        "available": "225 GiB",
        "leader_count": 8946,
        "leader_weight": 1,
        "leader_score": 689817,
        "leader_size": 689817,
        "region_count": 27631,
        "region_weight": 1,
        "region_score": 775024609.2387075,
        "region_size": 2096274,
        "start_ts": "2019-06-03T15:00:26+08:00",
        "last_heartbeat_ts": "2019-09-06T14:04:24.440627017+08:00",
        "uptime": "2279h3m58.440627017s"
      }
    },
    {
      "store": {
        "id": 4,
        "address": "10.s.s.89:20160",
        "version": "3.0.0-rc.2",
        "state_name": "Up"
      },
      "status": {
        "capacity": "880 GiB",
        "available": "230 GiB",
        "leader_count": 9032,
        "leader_weight": 1,
        "leader_score": 689351,
        "leader_size": 689351,
        "region_count": 27456,
        "region_weight": 1,
        "region_score": 745101077.5505033,
        "region_size": 2096212,
        "start_ts": "2019-06-03T15:03:38+08:00",
        "last_heartbeat_ts": "2019-09-06T14:04:19.714337232+08:00",
        "uptime": "2279h0m41.714337232s"
      }
    },
    {
      "store": {
        "id": 5,
        "address": "10.s.s.90:20160",
        "version": "3.0.0-rc.2",
        "state_name": "Up"
      },
      "status": {
        "capacity": "880 GiB",
        "available": "224 GiB",
        "leader_count": 9576,
        "leader_weight": 1,
        "leader_score": 689488,
        "leader_size": 689488,
        "region_count": 28148,
        "region_weight": 1,
        "region_score": 778774023.4445243,
        "region_size": 2140705,
        "start_ts": "2019-06-03T15:06:19+08:00",
        "last_heartbeat_ts": "2019-09-06T14:04:22.709312097+08:00",
        "uptime": "2278h58m3.709312097s"
      }
    }
  ]
}

使用tidb-anslible扩容的,过程中没有发现有异常

辛苦使用 pd-ctl 提供下 scheduler show 的情况。

1赞
» scheduler show 
[
  "balance-region-scheduler",
  "balance-leader-scheduler",
  "balance-hot-region-scheduler",
  "label-scheduler"
]

可以参考下此贴,排查下监控调度情况。Tikv节点region分布不均

image

监控region size 没有扩容的2台机器

建议检查下新扩容的节点是否有 node_exporter 进程。另外可以手动调度 region 到扩容节点,看下情况。

#pd-ctl -u http://10.x.x.91:2379 -d  operator add transfer-leader 103120 1087486
[500]:"region has no voter in store 1087486"'
pd-ctl -u http://10.x.x.91:2379 -d  operator add transfer-peer 103120 5 1087486
[tidb@aliyun-10-43-101-96 ~]$ pd-ctl -u http://10.x.x.91:2379 -d operator show admin                         
[
  "admin-move-peer (kind:leader,region,admin, region:103120(498,113), createAt:2019-09-07 11:45:28.407925757 +0800 CST m=+8283002.030301353, currentStep:2, steps:[add learner peer 1089644 on store 1087486 promote learner peer 1089644 on store 1087486 to voter transfer leader from store 5 to store 128620 remove peer on store 5]) "
]
pd-ctl -u http://10.43.101.91:2379 -d  operator add transfer-leader 103120 1087486 

也迁了一个主,放到新扩容的store上,能正常迁过来。不过好像也没有发现自动在均衡数据。

应该是找到问题了,后面扩容的tikv还是老版本TiKV 3.0.0-beta.1,(版本是升级上来的,从TiKV 3.0.0-beta.1升级到TiKV 3.0.0-rc.2了,本次扩容用错了版本)。

升级TiKV 3.0.0-rc.2后,还是没有自动均衡。:sweat_smile:

建议检查下 tikv 版本是否保持一致。另外可以再次参考下以上监控,是否调度正常。

现在情况怎么样?均衡了吗?

image 很长一段时间了,还只均衡了很小一部分数据过来。

“version”: “3.0.0-beta.1”,

“version”: “3.0.0-rc.2”,

你的TiKV 为什么会有两种不同的版本?

3.0.0-beta.1 之前是这个版本安装的,后面做了升级,升级成3.0.0-rc.2。本次扩容两台机器后,用错了版本,导致新扩容的机器还是老的版本3.0.0-beta.1,后面又单独对这两个新扩容的机器做了升级,升级到3.0.0-rc.2。

如果版本一致,数据还是不能均衡吗?

1赞