集群扩缩容后新机器Region个数一致但Store Size不一致,正常吗

为提高效率,请提供以下信息,问题描述清晰能够更快得到解决:

【TiDB 版本】4.0.8

【问题描述】集群进行扩缩容后新机器Region个数是一致但Store Size不一致

操作步骤:
1、先使用tiup 进行集群的扩容,将新机器上部署上TiDB、TiKV、PD服务
2、手动缩容旧机器服务,顺序PD->TIDB->TIKV
3、一段时间后观察到监控页面,region个数3台机器一致,但是store size不一致,新机器和旧机器容量大概少了一半

想请教下这种情况是否正常?


若提问为性能优化、故障排查类问题,请下载脚本运行。终端输出的打印结果,请务必全选并复制粘贴上传。

可以导出 PD 面板的监控数据看下,导出监控方式:https://metricstool.pingcap.com/

辛苦帮忙看一下,谢谢
tidb-cluster-PD_2021-02-22T10_44_49.013Z.json (1.5 MB)

能提供一下 pd-ctl 执行 store 命令的结果么?正常 region 的调度是基于 region score 进行调度的,从监控上看到有 3 个节点的 region score 的确是相同的,但是还有一个节点的 score 不太一样,这个不确定是监控显示的问题还是什么,需要确认一下

另外麻烦描述一下扩缩容的步骤。

是这样的,之前在96机器上部署了还部署了Tiflash服务,这个不一样的score命名前是tiflash的端口,所以我猜测是这个原因。

我操作的步骤是:
1、先进行扩容,定义好scale-out.yaml,再执行tiup cluster scale-out tidb-cluster scale-out.yaml
tidb_servers:

  • host: 172.21.23.136
    tikv_servers:
  • host: 172.21.23.136
    pd_servers:
  • host: 172.21.23.136

2、等出现Scaled cluster tidb-cluster in successfully 后,display确认服务都是up状态后开始缩容
3、tiup cluster scale-in tidb-cluster --node 172.21.23.96:2379 --PD缩容
tiup cluster scale-in tidb-cluster --node 172.21.23.96:4000 --TiDB缩容
tiup cluster scale-in tidb-cluster --node 172.21.23.96:20160 --TiKV缩容
tiup cluster scale-in tidb-cluster --node 172.21.23.96:9000 – TiFlash缩容
4、Tikv缩容完,display之后发现是Tombstone状态,之后参考文档,执行tiup cluster prune tidb-cluster
5、tiflash一直是Pending Offline状态

所以我猜测是手动缩容的时候,应该是有顺序的?但是文档上好像没看到有这个提示。

pd-ctl执行结果

{
  "count": 4,
  "stores": [
    {
      "store": {
        "id": 1,
        "address": "172.21.23.138:20160",
        "labels": [
          {
            "key": "host",
            "value": "kv-host-138"
          }
        ],
        "version": "4.0.8",
        "status_address": "0.0.0.0:20180",
        "git_hash": "83091173e960e5a0f5f417e921a0801d2f6635ae",
        "start_timestamp": 1611222452,
        "deploy_path": "/data/tidb-deploy/tikv-20160/bin",
        "last_heartbeat": 1614043787479839137,
        "state_name": "Up"
      },
      "status": {
        "capacity": "251.9GiB",
        "available": "218.1GiB",
        "used_size": "1.585GiB",
        "leader_count": 194,
        "leader_weight": 1,
        "leader_score": 194,
        "leader_size": 2155,
        "region_count": 564,
        "region_weight": 1,
        "region_score": 6027,
        "region_size": 6027,
        "start_ts": "2021-01-21T17:47:32+08:00",
        "last_heartbeat_ts": "2021-02-23T09:29:47.479839137+08:00",
        "uptime": "783h42m15.479839137s"
      }
    },
    {
      "store": {
        "id": 5,
        "address": "172.21.23.128:20160",
        "labels": [
          {
            "key": "host",
            "value": "kv-host-128"
          }
        ],
        "version": "4.0.8",
        "status_address": "0.0.0.0:20180",
        "git_hash": "83091173e960e5a0f5f417e921a0801d2f6635ae",
        "start_timestamp": 1611222432,
        "deploy_path": "/data/tidb-deploy/tikv-20160/bin",
        "last_heartbeat": 1614043780657840226,
        "state_name": "Up"
      },
      "status": {
        "capacity": "251.9GiB",
        "available": "202.6GiB",
        "used_size": "1.614GiB",
        "leader_count": 186,
        "leader_weight": 1,
        "leader_score": 186,
        "leader_size": 2211,
        "region_count": 564,
        "region_weight": 1,
        "region_score": 6027,
        "region_size": 6027,
        "start_ts": "2021-01-21T17:47:12+08:00",
        "last_heartbeat_ts": "2021-02-23T09:29:40.657840226+08:00",
        "uptime": "783h42m28.657840226s"
      }
    },
    {
      "store": {
        "id": 88,
        "address": "172.21.23.96:3930",
        "state": 1,
        "labels": [
          {
            "key": "engine",
            "value": "tiflash"
          }
        ],
        "version": "v4.0.8",
        "peer_address": "172.21.23.96:20170",
        "status_address": "172.21.23.96:20292",
        "git_hash": "f0a78d93e440dac7c7935ea7e67c656b1bb5f913",
        "start_timestamp": 1611222351,
        "deploy_path": "/data/tidb-deploy/tiflash-9000/bin/tiflash",
        "last_heartbeat": 1614043788032387822,
        "state_name": "Offline"
      },
      "status": {
        "capacity": "196.7GiB",
        "available": "196.6GiB",
        "used_size": "167.8MiB",
        "leader_count": 0,
        "leader_weight": 1,
        "leader_score": 0,
        "leader_size": 0,
        "region_count": 66,
        "region_weight": 1,
        "region_score": 1237,
        "region_size": 1237,
        "start_ts": "2021-01-21T17:45:51+08:00",
        "last_heartbeat_ts": "2021-02-23T09:29:48.032387822+08:00",
        "uptime": "783h43m57.032387822s"
      }
    },
    {
      "store": {
        "id": 4191,
        "address": "172.21.23.136:20160",
        "labels": [
          {
            "key": "host",
            "value": "kv-host-136"
          }
        ],
        "version": "4.0.8",
        "status_address": "0.0.0.0:20180",
        "git_hash": "83091173e960e5a0f5f417e921a0801d2f6635ae",
        "start_timestamp": 1613982253,
        "deploy_path": "/data/tidb-deploy/tikv-20160/bin",
        "last_heartbeat": 1614043790057614330,
        "state_name": "Up"
      },
      "status": {
        "capacity": "251.9GiB",
        "available": "248.2GiB",
        "used_size": "914.6MiB",
        "leader_count": 184,
        "leader_weight": 1,
        "leader_score": 184,
        "leader_size": 1661,
        "region_count": 564,
        "region_weight": 1,
        "region_score": 6027,
        "region_size": 6027,
        "start_ts": "2021-02-22T16:24:13+08:00",
        "last_heartbeat_ts": "2021-02-23T09:29:50.05761433+08:00",
        "uptime": "17h5m37.05761433s"
      }
    }
  ]
}

可以先参考官方文档将 TiFlash 完全缩容掉之后再观察一下 region 情况
https://docs.pingcap.com/zh/tidb/stable/scale-tidb-using-tiup#缩容-tiflash-节点

完全缩容完,也还是这样,想问下
1、这个store size是统计了哪些数据存储生成的?机器上能否物理统计?
2、在data路径下,看了一下存储,db文件夹下新旧机器容量大概也是差一半,之前旧的机器做过一段时间的性能测试,新的没有,和这个会有关系不?
image image
image
3、这个store size不一致对实际使用会不会有影响?

可以看下 db 目录下 sst 文件大小和 LOG 文件大小差距大吗,是不是主要是 LOG 文件的差距。region count 一致的话,store size 对于实际使用影响应该不大,你们这个集群数据量还比较少,可以增加数据量测试看下有没有性能影响。

好的,谢谢

:handshake::handshake::handshake:

此话题已在最后回复的 1 分钟后被自动关闭。不再允许新回复。