TIUP 缩容 TiKV region 一直有剩余

使用TIUP缩容TiKV,剩最后3个region的时候,卡住不动了。其中一个region如下:

{
  "id": 2922,
  "start_key": "7480000000000002FFA65F698000000000FF0000040133333139FF66366238FF2D3230FF38632D3433FF6538FF2D616534332DFF39FF64363436366239FFFF3337383800000000FFFB0380000000000DFFD960000000000000F9",
  "end_key": "7480000000000002FFA65F698000000000FF0000040133653765FF33636539FF2D6535FF64342D3433FF3932FF2D386438322DFF30FF39646562623864FFFF6465306200000000FFFB038000000000D4FFAEED000000000000F9",
  "epoch": {
    "conf_ver": 9,
    "version": 251
  },
  "peers": [
    {
      "id": 2924,
      "store_id": 3
    },
    {
      "id": 2925,
      "store_id": 7
    },
    {
      "id": 2934042,
      "store_id": 49133
    },
    {
      "id": 2934090,
      "store_id": 1,
      "is_learner": true
    }
  ],
  "leader": {
    "id": 2924,
    "store_id": 3
  },
  "written_bytes": 0,
  "read_bytes": 0,
  "written_keys": 0,
  "read_keys": 0,
  "approximate_size": 127,
  "approximate_keys": 1197303
}

1.为什么会卡住,应该怎么办?

2.is_learner是什么状态,发现卡住的3个region,都是有一个这个状态的副本。

你好,

  1. 请提供下 tidb 的版本:select tidb_version();

是该 peer 因为需要下线,此为补副本的状态。

Release Version: v4.0.0
Edition: Community
Git Commit Hash: 689a6b6439ae7835947fcaccf329a3fc303986cb
Git Branch: heads/refs/tags/v4.0.1
UTC Build Time: 2020-06-12 06:01:55
GoVersion: go1.13
Race Enabled: false
TiKV Min Version: v3.0.0-60965b006877ca7234adaced7890d7b029ed1306
Check Table Before Drop: false

你好,

可以先根据此帖进行排查试下

已经调小至1MB,但是并没有恢复。

你好

  1. 反馈下 offline 节点 tikv.log 看下
  2. pd-ctl config show 和 store / member 信息均反馈下。
  1. store

    {
    “count”: 6,
    “stores”: [
    {
    “store”: {
    “id”: 1,
    “address”: “192.168.30.31:20160”,
    “version”: “4.0.1”,
    “status_address”: “192.168.30.31:20180”,
    “git_hash”: “78d7a854026962669ceb2ee0ac343a5e88faa310”,
    “start_timestamp”: 1592206538,
    “last_heartbeat”: 1592213940089237899,
    “state_name”: “Up”
    },
    “status”: {
    “capacity”: “360.8GiB”,
    “available”: “245.6GiB”,
    “used_size”: “100.2GiB”,
    “leader_count”: 1680,
    “leader_weight”: 1,
    “leader_score”: 1680,
    “leader_size”: 111993,
    “region_count”: 5026,
    “region_weight”: 1,
    “region_score”: 336445,
    “region_size”: 336445,
    “start_ts”: “2020-06-15T15:35:38+08:00”,
    “last_heartbeat_ts”: “2020-06-15T17:39:00.089237899+08:00”,
    “uptime”: “2h3m22.089237899s”
    }
    },
    {
    “store”: {
    “id”: 3,
    “address”: “192.168.30.32:20160”,
    “version”: “4.0.1”,
    “status_address”: “192.168.30.32:20180”,
    “git_hash”: “78d7a854026962669ceb2ee0ac343a5e88faa310”,
    “start_timestamp”: 1592206557,
    “last_heartbeat”: 1592213939630323871,
    “state_name”: “Up”
    },
    “status”: {
    “capacity”: “360.8GiB”,
    “available”: “245.6GiB”,
    “used_size”: “100.3GiB”,
    “leader_count”: 1675,
    “leader_weight”: 1,
    “leader_score”: 1675,
    “leader_size”: 110847,
    “region_count”: 5026,
    “region_weight”: 1,
    “region_score”: 336445,
    “region_size”: 336445,
    “start_ts”: “2020-06-15T15:35:57+08:00”,
    “last_heartbeat_ts”: “2020-06-15T17:38:59.630323871+08:00”,
    “uptime”: “2h3m2.630323871s”
    }
    },
    {
    “store”: {
    “id”: 7,
    “address”: “192.168.30.33:20160”,
    “version”: “4.0.1”,
    “status_address”: “192.168.30.33:20180”,
    “git_hash”: “78d7a854026962669ceb2ee0ac343a5e88faa310”,
    “start_timestamp”: 1592206576,
    “last_heartbeat”: 1592213937391690705,
    “state_name”: “Up”
    },
    “status”: {
    “capacity”: “364.2GiB”,
    “available”: “238.3GiB”,
    “used_size”: “99.89GiB”,
    “leader_count”: 1671,
    “leader_weight”: 1,
    “leader_score”: 1671,
    “leader_size”: 113605,
    “region_count”: 5026,
    “region_weight”: 1,
    “region_score”: 336445,
    “region_size”: 336445,
    “start_ts”: “2020-06-15T15:36:16+08:00”,
    “last_heartbeat_ts”: “2020-06-15T17:38:57.391690705+08:00”,
    “uptime”: “2h2m41.391690705s”
    }
    },
    {
    “store”: {
    “id”: 49133,
    “address”: “192.168.30.34:20160”,
    “state”: 1,
    “version”: “4.0.1”,
    “status_address”: “192.168.30.34:20180”,
    “git_hash”: “78d7a854026962669ceb2ee0ac343a5e88faa310”,
    “start_timestamp”: 1592206577,
    “deploy_path”: “/data1/deploy/bin”,
    “last_heartbeat”: 1592213938375478552,
    “state_name”: “Offline”
    },
    “status”: {
    “capacity”: “395.1GiB”,
    “available”: “330.6GiB”,
    “used_size”: “101.4MiB”,
    “leader_count”: 0,
    “leader_weight”: 1,
    “leader_score”: 0,
    “leader_size”: 0,
    “region_count”: 3,
    “region_weight”: 1,
    “region_score”: 270,
    “region_size”: 270,
    “start_ts”: “2020-06-15T15:36:17+08:00”,
    “last_heartbeat_ts”: “2020-06-15T17:38:58.375478552+08:00”,
    “uptime”: “2h2m41.375478552s”
    }
    },
    {
    “store”: {
    “id”: 2653512,
    “address”: “192.168.30.34:3930”,
    “labels”: [
    {
    “key”: “engine”,
    “value”: “tiflash”
    }
    ],
    “version”: “v4.0.1”,
    “peer_address”: “192.168.30.34:20170”,
    “status_address”: “192.168.30.34:20292”,
    “git_hash”: “e134030e694906860b3d5b8729092351258e256c”,
    “start_timestamp”: 1592206469,
    “deploy_path”: “/data1/tiflash/tiflash-deploy/bin/tiflash”,
    “last_heartbeat”: 1592213940373688043,
    “state_name”: “Up”
    },
    “status”: {
    “capacity”: “395.1GiB”,
    “available”: “310.5GiB”,
    “used_size”: “4.634GiB”,
    “leader_count”: 0,
    “leader_weight”: 1,
    “leader_score”: 0,
    “leader_size”: 0,
    “region_count”: 121,
    “region_weight”: 1,
    “region_score”: 11110,
    “region_size”: 11110,
    “start_ts”: “2020-06-15T15:34:29+08:00”,
    “last_heartbeat_ts”: “2020-06-15T17:39:00.373688043+08:00”,
    “uptime”: “2h4m31.373688043s”
    }
    },
    {
    “store”: {
    “id”: 2869654,
    “address”: “192.168.30.30:3930”,
    “labels”: [
    {
    “key”: “engine”,
    “value”: “tiflash”
    }
    ],
    “version”: “v4.0.1”,
    “peer_address”: “192.168.30.30:20170”,
    “status_address”: “192.168.30.30:20292”,
    “git_hash”: “e134030e694906860b3d5b8729092351258e256c”,
    “start_timestamp”: 1592206477,
    “deploy_path”: “/data1/tiflash/tiflash-deploy/bin/tiflash”,
    “last_heartbeat”: 1592213936775959629,
    “state_name”: “Up”
    },
    “status”: {
    “capacity”: “771.6GiB”,
    “available”: “599GiB”,
    “used_size”: “6.079GiB”,
    “leader_count”: 0,
    “leader_weight”: 1,
    “leader_score”: 0,
    “leader_size”: 0,
    “region_count”: 123,
    “region_weight”: 1,
    “region_score”: 11234,
    “region_size”: 11234,
    “start_ts”: “2020-06-15T15:34:37+08:00”,
    “last_heartbeat_ts”: “2020-06-15T17:38:56.775959629+08:00”,
    “uptime”: “2h4m19.775959629s”
    }
    }
    ]
    }


member:

{

“header”: {
“cluster_id”: 6775469463473339511
},
“members”: [
{
“name”: “pd_localhost”,
“member_id”: 8842615892144546616,
“peer_urls”: [
http://192.168.30.30:2380
],
“client_urls”: [
http://192.168.30.30:2379
],
“deploy_path”: “/data1/deploy/bin”,
“binary_version”: “v4.0.1”,
“git_hash”: “30f0b014b7ff3cd1b5f041bf7ce73448dc0d0fe8”
}
],
“leader”: {
“name”: “pd_localhost”,
“member_id”: 8842615892144546616,
“peer_urls”: [
http://192.168.30.30:2380
],
“client_urls”: [
http://192.168.30.30:2379
]
},
“etcd_leader”: {
“name”: “pd_localhost”,
“member_id”: 8842615892144546616,
“peer_urls”: [
http://192.168.30.30:2380
],
“client_urls”: [
http://192.168.30.30:2379
],
“deploy_path”: “/data1/deploy/bin”,
“binary_version”: “v4.0.1”,
“git_hash”: “30f0b014b7ff3cd1b5f041bf7ce73448dc0d0fe8”
}
}


  1. tikv.log

tikv.log (49.5 KB)

你好,

是否可以将 tikv.log 截取到最近一次 welcome 我们需要确认下 raft-max-size-per-msg 的值,并且辛苦上传下 edit-config 中 server_configs - tikv 中的 raft-max-size-per-msg 配置项。

edit-config如下图所示

image

日志太多了,welcome我得写个脚本查一下日志。

你好,

raft-max-size-per-msg 默认为 1M 所以并没有将其调小,请在确认下

image

建议值多少?这个1M是我设置的,我看之前的版本是12M。

你好

在 tikv.log 看下 raft-max-size-per-msg 是否生效,值为 1M。确认下。我们反馈下

tikv中的log,这个参数就是1M。

你好。

server_configs:
  tikv:
    raftstore.raft-max-size-per-message: 0.5MB

辛苦在试下,如上配置,看是否生效。

使用TIUP edit-config配置后,去TIKV看启动日志,发现参数还是1M。但是tiup edit-config中显示的是0.5了。

你好,

确认一下操作细节

  1. edit-config cluster-name
  2. reload -R tikv
  3. 查看 tikv.log 该参数值为 1M

第2步使用的是:

tiup cluster reload test-cluster

这个会有问题么?这样应该reload的更彻底吧。

其他确认没问题。确实是改了之后,TIKV和配置不一致。

这样操作可以的,但是时间会比较长,并不存在 reload 更加彻底的情况。

region 迁移慢的问题我们已经反馈到研发小伙伴,你的问题与下面帖子中类似,目前先请按照提供的方法操作下,有任何新的进展我们会在帖子中跟帖

手动搬迁是可以的,我已经搞过一次了

你好,

请问下是在 offline 节点上,store 显示 peer 为 4个,并且有 1 个为 learner peer 状态下进行的 add-peer 和 remove-peer 吗?