因业务需要调整集群节点,因此进行了先扩容后缩容的操作,但速度较慢
调整了相关schedule limit
调度器生成数量上来了但是大量集中在check阶段
finish的很少
查看了节点的io Util发现有一个节点出现了100%的情况,其他大部分节点包括正在缩扩容的节点数值都很低
应如何调整?
因业务需要调整集群节点,因此进行了先扩容后缩容的操作,但速度较慢
调整了相关schedule limit
调度器生成数量上来了但是大量集中在check阶段
先确认下当前的操作步骤:
1、对集群做了 TiKV 扩容和缩容两种类型的操作
2、在扩容或缩容没有完成的情况下又做了缩容或扩容的操作
请收集下下面的信息:
操作步骤是:
首先进行了tiup cluster scale-out命令,数据正在均衡,但还没稳定,新节点region还在增加,然后执行了scale-in命令,目前仍处于offline状态
store信息:
{
“count”: 11,
“stores”: [
{
“store”: {
“id”: 38833310,
“address”: “10.12.5.147:20160”,
“state”: 1,
“version”: “4.0.13”,
“status_address”: “10.12.5.147:20180”,
“git_hash”: “a448d617f79ddf545be73931525bb41af0f790f3”,
“start_timestamp”: 1625578102,
“deploy_path”: “/home/tidb/deploy/bin”,
“last_heartbeat”: 1625733012081441916,
“state_name”: “Offline”
},
“status”: {
“capacity”: “5.952TiB”,
“available”: “3.508TiB”,
“used_size”: “2.119TiB”,
“leader_count”: 32381,
“leader_weight”: 2,
“leader_score”: 16190.5,
“leader_size”: 2481181,
“region_count”: 76687,
“region_weight”: 2,
“region_score”: 3032644.5,
“region_size”: 6065289,
“start_ts”: “2021-07-06T13:28:22Z”,
“last_heartbeat_ts”: “2021-07-08T08:30:12.081441916Z”,
“uptime”: “43h1m50.081441916s”
}
},
{
“store”: {
“id”: 534172210,
“address”: “10.12.5.32:20160”,
“version”: “4.0.13”,
“status_address”: “10.12.5.32:20180”,
“git_hash”: “a448d617f79ddf545be73931525bb41af0f790f3”,
“start_timestamp”: 1625607928,
“deploy_path”: “/home/tidb/deploy/tikv-20180/bin”,
“last_heartbeat”: 1625733012528123203,
“state_name”: “Up”
},
“status”: {
“capacity”: “3.444TiB”,
“available”: “2.765TiB”,
“used_size”: “687.2GiB”,
“leader_count”: 8585,
“leader_weight”: 1,
“leader_score”: 8585,
“leader_size”: 621878,
“region_count”: 21849,
“region_weight”: 1,
“region_score”: 1628949,
“region_size”: 1628949,
“start_ts”: “2021-07-06T21:45:28Z”,
“last_heartbeat_ts”: “2021-07-08T08:30:12.528123203Z”,
“uptime”: “34h44m44.528123203s”
}
},
{
“store”: {
“id”: 534402022,
“address”: “10.12.5.35:20160”,
“version”: “4.0.13”,
“status_address”: “10.12.5.35:20180”,
“git_hash”: “a448d617f79ddf545be73931525bb41af0f790f3”,
“start_timestamp”: 1625756847,
“deploy_path”: “/home/tidb/deploy/tikv-20180/bin”,
“last_heartbeat”: 1625733013980007630,
“state_name”: “Up”
},
“status”: {
“capacity”: “3.444TiB”,
“available”: “3.401TiB”,
“used_size”: “36.02GiB”,
“leader_count”: 5,
“leader_weight”: 1,
“leader_score”: 5,
“leader_size”: 169,
“region_count”: 1195,
“region_weight”: 1,
“region_score”: 85777,
“region_size”: 85777,
“receiving_snap_count”: 9,
“start_ts”: “2021-07-08T15:07:27Z”,
“last_heartbeat_ts”: “2021-07-08T08:30:13.98000763Z”
}
},
{
“store”: {
“id”: 534414223,
“address”: “10.12.5.36:20160”,
“version”: “4.0.13”,
“status_address”: “10.12.5.36:20180”,
“git_hash”: “a448d617f79ddf545be73931525bb41af0f790f3”,
“start_timestamp”: 1625727955,
“deploy_path”: “/home/tidb/deploy/tikv-20180/bin”,
“last_heartbeat”: 1625733015933843135,
“state_name”: “Up”
},
“status”: {
“capacity”: “3.444TiB”,
“available”: “3.402TiB”,
“used_size”: “35.55GiB”,
“leader_count”: 4,
“leader_weight”: 1,
“leader_score”: 4,
“leader_size”: 340,
“region_count”: 1118,
“region_weight”: 1,
“region_score”: 85988,
“region_size”: 85988,
“receiving_snap_count”: 11,
“start_ts”: “2021-07-08T07:05:55Z”,
“last_heartbeat_ts”: “2021-07-08T08:30:15.933843135Z”,
“uptime”: “1h24m20.933843135s”
}
},
{
“store”: {
“id”: 24480822,
“address”: “10.12.5.239:20160”,
“version”: “4.0.13”,
“status_address”: “10.12.5.239:20180”,
“git_hash”: “a448d617f79ddf545be73931525bb41af0f790f3”,
“start_timestamp”: 1625578218,
“deploy_path”: “/home/tidb/deploy/bin”,
“last_heartbeat”: 1625733007820334618,
“state_name”: “Up”
},
“status”: {
“capacity”: “5.952TiB”,
“available”: “3.596TiB”,
“used_size”: “2.178TiB”,
“leader_count”: 37199,
“leader_weight”: 2,
“leader_score”: 18599.5,
“leader_size”: 2905123,
“region_count”: 78598,
“region_weight”: 2,
“region_score”: 3090818,
“region_size”: 6181636,
“sending_snap_count”: 2,
“start_ts”: “2021-07-06T13:30:18Z”,
“last_heartbeat_ts”: “2021-07-08T08:30:07.820334618Z”,
“uptime”: “42h59m49.820334618s”
}
},
{
“store”: {
“id”: 24590972,
“address”: “10.12.5.240:20160”,
“version”: “4.0.13”,
“status_address”: “10.12.5.240:20180”,
“git_hash”: “a448d617f79ddf545be73931525bb41af0f790f3”,
“start_timestamp”: 1625578262,
“deploy_path”: “/home/tidb/deploy/bin”,
“last_heartbeat”: 1625733009002060111,
“state_name”: “Up”
},
“status”: {
“capacity”: “5.952TiB”,
“available”: “3.731TiB”,
“used_size”: “1.989TiB”,
“leader_count”: 35719,
“leader_weight”: 2,
“leader_score”: 17859.5,
“leader_size”: 2847451,
“region_count”: 78202,
“region_weight”: 2,
“region_score”: 3091605.5,
“region_size”: 6183211,
“sending_snap_count”: 22,
“start_ts”: “2021-07-06T13:31:02Z”,
“last_heartbeat_ts”: “2021-07-08T08:30:09.002060111Z”,
“uptime”: “42h59m7.002060111s”
}
},
{
“store”: {
“id”: 262397455,
“address”: “10.12.5.13:20160”,
“version”: “4.0.13”,
“status_address”: “10.12.5.13:20180”,
“git_hash”: “a448d617f79ddf545be73931525bb41af0f790f3”,
“start_timestamp”: 1625607134,
“deploy_path”: “/home/tidb/deploy/bin”,
“last_heartbeat”: 1625733017177920922,
“state_name”: “Up”
},
“status”: {
“capacity”: “5.952TiB”,
“available”: “4.808TiB”,
“used_size”: “1.075TiB”,
“leader_count”: 17176,
“leader_weight”: 1,
“leader_score”: 17176,
“leader_size”: 1363247,
“region_count”: 40622,
“region_weight”: 1,
“region_score”: 3090918,
“region_size”: 3090918,
“start_ts”: “2021-07-06T21:32:14Z”,
“last_heartbeat_ts”: “2021-07-08T08:30:17.177920922Z”,
“uptime”: “34h58m3.177920922s”
}
},
{
“store”: {
“id”: 268391998,
“address”: “10.12.5.119:20160”,
“state”: 1,
“version”: “4.0.13”,
“status_address”: “10.12.5.119:20180”,
“git_hash”: “a448d617f79ddf545be73931525bb41af0f790f3”,
“start_timestamp”: 1625578469,
“deploy_path”: “/home/tidb/deploy/bin”,
“last_heartbeat”: 1625733017625156934,
“state_name”: “Offline”
},
“status”: {
“capacity”: “320TiB”,
“available”: “282.5TiB”,
“used_size”: “151.2GiB”,
“leader_count”: 1468,
“leader_weight”: 1,
“leader_score”: 1468,
“leader_size”: 116899,
“region_count”: 4193,
“region_weight”: 1,
“region_score”: 364585,
“region_size”: 364585,
“start_ts”: “2021-07-06T13:34:29Z”,
“last_heartbeat_ts”: “2021-07-08T08:30:17.625156934Z”,
“uptime”: “42h55m48.625156934s”
}
},
{
“store”: {
“id”: 534172077,
“address”: “10.12.5.31:20160”,
“version”: “4.0.13”,
“status_address”: “10.12.5.31:20180”,
“git_hash”: “a448d617f79ddf545be73931525bb41af0f790f3”,
“start_timestamp”: 1625607322,
“deploy_path”: “/home/tidb/deploy/tikv-20180/bin”,
“last_heartbeat”: 1625733009731723576,
“state_name”: “Up”
},
“status”: {
“capacity”: “3.444TiB”,
“available”: “2.761TiB”,
“used_size”: “691GiB”,
“leader_count”: 8581,
“leader_weight”: 1,
“leader_score”: 8581,
“leader_size”: 665465,
“region_count”: 21680,
“region_weight”: 1,
“region_score”: 1628862,
“region_size”: 1628862,
“start_ts”: “2021-07-06T21:35:22Z”,
“last_heartbeat_ts”: “2021-07-08T08:30:09.731723576Z”,
“uptime”: “34h54m47.731723576s”
}
},
{
“store”: {
“id”: 534204559,
“address”: “10.12.5.34:20160”,
“version”: “4.0.13”,
“status_address”: “10.12.5.34:20180”,
“git_hash”: “a448d617f79ddf545be73931525bb41af0f790f3”,
“start_timestamp”: 1625625018,
“deploy_path”: “/home/tidb/deploy/tikv-20180/bin”,
“last_heartbeat”: 1625733012129945017,
“state_name”: “Up”
},
“status”: {
“capacity”: “3.444TiB”,
“available”: “2.784TiB”,
“used_size”: “667.5GiB”,
“leader_count”: 107,
“leader_weight”: 1,
“leader_score”: 107,
“leader_size”: 6943,
“region_count”: 21712,
“region_weight”: 1,
“region_score”: 1568104,
“region_size”: 1568104,
“start_ts”: “2021-07-07T02:30:18Z”,
“last_heartbeat_ts”: “2021-07-08T08:30:12.129945017Z”,
“uptime”: “29h59m54.129945017s”
}
},
{
“store”: {
“id”: 24478148,
“address”: “10.12.5.236:20160”,
“version”: “4.0.13”,
“status_address”: “10.12.5.236:20180”,
“git_hash”: “a448d617f79ddf545be73931525bb41af0f790f3”,
“start_timestamp”: 1625578141,
“deploy_path”: “/home/tidb/deploy/bin”,
“last_heartbeat”: 1625733008702933060,
“state_name”: “Up”
},
“status”: {
“capacity”: “5.952TiB”,
“available”: “3.057TiB”,
“used_size”: “2.191TiB”,
“leader_count”: 58,
“leader_weight”: 2,
“leader_score”: 29,
“leader_size”: 5824,
“region_count”: 78199,
“region_weight”: 2,
“region_score”: 3091112.5,
“region_size”: 6182225,
“start_ts”: “2021-07-06T13:29:01Z”,
“last_heartbeat_ts”: “2021-07-08T08:30:08.70293306Z”,
“uptime”: “43h1m7.70293306s”
}
}
]
}
config信息:
{
“client-urls”: “http://0.0.0.0:2379”,
“peer-urls”: “http://0.0.0.0:2380”,
“advertise-client-urls”: “http://10.12.5.118:2379”,
“advertise-peer-urls”: “http://10.12.5.118:2380”,
“name”: “pd-5”,
“data-dir”: “/data/deploy/install/data/pd-2379”,
“force-new-cluster”: false,
“enable-grpc-gateway”: true,
“initial-cluster”: “pd_pd3=http://10.12.5.115:2380,pd-1=http://10.12.5.116:2380,pd-5=http://10.12.5.118:2380”,
“initial-cluster-state”: “new”,
“initial-cluster-token”: “pd-cluster”,
“join”: “”,
“lease”: 3,
“log”: {
“level”: “”,
“format”: “text”,
“disable-timestamp”: false,
“file”: {
“filename”: “/data/deploy/install/log/pd-2379/pd.log”,
“max-size”: 300,
“max-days”: 0,
“max-backups”: 0
},
“development”: false,
“disable-caller”: false,
“disable-stacktrace”: false,
“disable-error-verbose”: true,
“sampling”: null
},
“tso-save-interval”: “3s”,
“metric”: {
“job”: “pd-5”,
“address”: “”,
“interval”: “15s”
},
“schedule”: {
“max-snapshot-count”: 16,
“max-pending-peer-count”: 64,
“max-merge-region-size”: 20,
“max-merge-region-keys”: 200000,
“split-merge-interval”: “1h0m0s”,
“enable-one-way-merge”: “false”,
“enable-cross-table-merge”: “false”,
“patrol-region-interval”: “20ms”,
“max-store-down-time”: “30m0s”,
“leader-schedule-limit”: 16,
“leader-schedule-policy”: “count”,
“region-schedule-limit”: 100,
“replica-schedule-limit”: 64,
“merge-schedule-limit”: 8,
“hot-region-schedule-limit”: 8,
“hot-region-cache-hits-threshold”: 3,
“store-limit”: {
“24478148”: {
“add-peer”: 15,
“remove-peer”: 15
},
“24480822”: {
“add-peer”: 100,
“remove-peer”: 100
},
“24590972”: {
“add-peer”: 15,
“remove-peer”: 15
},
“256634687”: {
“add-peer”: 15,
“remove-peer”: 15
},
“262397455”: {
“add-peer”: 15,
“remove-peer”: 15
},
“268391998”: {
“add-peer”: 100,
“remove-peer”: 100
},
“38546296”: {
“add-peer”: 15,
“remove-peer”: 15
},
“38833310”: {
“add-peer”: 15,
“remove-peer”: 100000000
},
“534172077”: {
“add-peer”: 15,
“remove-peer”: 15
},
“534172210”: {
“add-peer”: 15,
“remove-peer”: 15
},
“534204559”: {
“add-peer”: 100,
“remove-peer”: 100
},
“534402022”: {
“add-peer”: 15,
“remove-peer”: 15
},
“534414223”: {
“add-peer”: 15,
“remove-peer”: 15
}
},
“tolerant-size-ratio”: 5,
“low-space-ratio”: 0.8,
“high-space-ratio”: 0.6,
“scheduler-max-waiting-operator”: 3,
“enable-remove-down-replica”: “true”,
“enable-replace-offline-replica”: “true”,
“enable-make-up-replica”: “true”,
“enable-remove-extra-replica”: “true”,
“enable-location-replacement”: “true”,
“enable-debug-metrics”: “false”,
“schedulers-v2”: [
{
“type”: “balance-region”,
“args”: null,
“disable”: false,
“args-payload”: “”
},
{
“type”: “balance-leader”,
“args”: null,
“disable”: false,
“args-payload”: “”
},
{
“type”: “hot-region”,
“args”: null,
“disable”: false,
“args-payload”: “”
},
{
“type”: “label”,
“args”: null,
“disable”: false,
“args-payload”: “”
},
{
“type”: “evict-leader”,
“args”: [
“268391998”
],
“disable”: false,
“args-payload”: “”
},
{
“type”: “evict-leader”,
“args”: [
“24478148”
],
“disable”: false,
“args-payload”: “”
}
],
“schedulers-payload”: {
“balance-hot-region-scheduler”: null,
“balance-leader-scheduler”: {
“name”: “balance-leader-scheduler”,
“ranges”: [
{
“end-key”: “”,
“start-key”: “”
}
]
},
“balance-region-scheduler”: {
“name”: “balance-region-scheduler”,
“ranges”: [
{
“end-key”: “”,
“start-key”: “”
}
]
},
“evict-leader-scheduler”: {
“store-id-ranges”: {
“24478148”: [
{
“end-key”: “”,
“start-key”: “”
}
]
}
},
“label-scheduler”: {
“name”: “label-scheduler”,
“ranges”: [
{
“end-key”: “”,
“start-key”: “”
}
]
}
},
“store-limit-mode”: “manual”
},
“replication”: {
“max-replicas”: 3,
“location-labels”: “”,
“strictly-match-label”: “false”,
“enable-placement-rules”: “false”
},
“pd-server”: {
“use-region-storage”: “true”,
“max-gap-reset-ts”: “24h0m0s”,
“key-type”: “table”,
“runtime-services”: “”,
“metric-storage”: “http://10.12.5.232:9090”,
“dashboard-address”: “http://10.12.5.115:2379”,
“trace-region-flow”: “true”
},
“cluster-version”: “4.0.13”,
“quota-backend-bytes”: “8GiB”,
“auto-compaction-mode”: “periodic”,
“auto-compaction-retention-v2”: “1h”,
“TickInterval”: “500ms”,
“ElectionInterval”: “3s”,
“PreVote”: true,
“security”: {
“cacert-path”: “”,
“cert-path”: “”,
“key-path”: “”,
“cert-allowed-cn”: null
},
“label-property”: {},
“WarningMsgs”: null,
“DisableStrictReconfigCheck”: false,
“HeartbeatStreamBindInterval”: “1m0s”,
“LeaderPriorityCheckInterval”: “1m0s”,
“dashboard”: {
“tidb-cacert-path”: “”,
“tidb-cert-path”: “”,
“tidb-key-path”: “”,
“public-path-prefix”: “”,
“internal-proxy”: false,
“enable-telemetry”: true,
“enable-experimental”: false
},
“replication-mode”: {
“replication-mode”: “majority”,
“dr-auto-sync”: {
“label-key”: “”,
“primary”: “”,
“dr”: “”,
“primary-replicas”: 0,
“dr-replicas”: 0,
“wait-store-timeout”: “1m0s”,
“wait-sync-timeout”: “1m0s”
}
},
“enable-redact-log”: false
}
监控信息:
test-cluster-PD_2021-07-08T08_32_46.407Z.json (3.1 MB)
在上面输出的 pd-ctl store
的信息里面看到两个 store id
为 268391998
的记录,一个是 offline
,一个是 up
:
{
“store”: {
“id”: 268391998,
“address”: “10.12.5.35:20160”,
“version”: “4.0.13”,
“status_address”: “10.12.5.35:20180”,
“git_hash”: “a448d617f79ddf545be73931525bb41af0f790f3”,
“start_timestamp”: 1625756847,
“deploy_path”: “/home/tidb/deploy/tikv-20180/bin”,
“last_heartbeat”: 1625733013980007630,
“state_name”: “Up”
},
“status”: {
“capacity”: “3.444TiB”,
“available”: “3.401TiB”,
“used_size”: “36.02GiB”,
“leader_count”: 5,
“leader_weight”: 1,
“leader_score”: 5,
“leader_size”: 169,
“region_count”: 1195,
“region_weight”: 1,
“region_score”: 85777,
“region_size”: 85777,
“receiving_snap_count”: 9,
“start_ts”: “2021-07-08T15:07:27Z”,
“last_heartbeat_ts”: “2021-07-08T08:30:13.98000763Z”
}
}
{
“store”: {
“id”: 268391998,
“address”: “10.12.5.119:20160”,
“state”: 1,
“version”: “4.0.13”,
“status_address”: “10.12.5.119:20180”,
“git_hash”: “a448d617f79ddf545be73931525bb41af0f790f3”,
“start_timestamp”: 1625578469,
“deploy_path”: “/home/tidb/deploy/bin”,
“last_heartbeat”: 1625733017625156934,
“state_name”: “Offline”
},
“status”: {
“capacity”: “320TiB”,
“available”: “282.5TiB”,
“used_size”: “151.2GiB”,
“leader_count”: 1468,
“leader_weight”: 1,
“leader_score”: 1468,
“leader_size”: 116899,
“region_count”: 4193,
“region_weight”: 1,
“region_score”: 364585,
“region_size”: 364585,
“start_ts”: “2021-07-06T13:34:29Z”,
“last_heartbeat_ts”: “2021-07-08T08:30:17.625156934Z”,
“uptime”: “42h55m48.625156934s”
}
}
请确认下
PD
没有做过 pd-recover
的操作吧?pd-ctl store 268391998
的输出store
是新扩容的没有进行过pd-recover操作
store 268391998:
{
“store”: {
“id”: 268391998,
“address”: “10.12.5.119:20160”,
“state”: 1,
“version”: “4.0.13”,
“status_address”: “10.12.5.119:20180”,
“git_hash”: “a448d617f79ddf545be73931525bb41af0f790f3”,
“start_timestamp”: 1625578469,
“deploy_path”: “/home/tidb/deploy/bin”,
“last_heartbeat”: 1625736040016426811,
“state_name”: “Offline”
},
“status”: {
“capacity”: “320TiB”,
“available”: “282.4TiB”,
“used_size”: “146.7GiB”,
“leader_count”: 1456,
“leader_weight”: 1,
“leader_score”: 1456,
“leader_size”: 116226,
“region_count”: 4177,
“region_weight”: 1,
“region_score”: 363338,
“region_size”: 363338,
“start_ts”: “2021-07-06T13:34:29Z”,
“last_heartbeat_ts”: “2021-07-08T09:20:40.016426811Z”,
“uptime”: “43h46m11.016426811s”
}
}
新加节点如下:
10.12.5.31:534172077
10.12.5.32:534172210
10.12.5.34:534204559
10.12.5.35:534402022
10.12.5.36:534414223
刚才那个两个store id相同不知是何原因,现在不相同了其中有一个10.12.5.35节点就是新加的节点
现在的store信息如下:
{
“count”: 11,
“stores”: [
{
“store”: {
“id”: 24480822,
“address”: “10.12.5.239:20160”,
“version”: “4.0.13”,
“status_address”: “10.12.5.239:20180”,
“git_hash”: “a448d617f79ddf545be73931525bb41af0f790f3”,
“start_timestamp”: 1625578218,
“deploy_path”: “/home/tidb/deploy/bin”,
“last_heartbeat”: 1625736062911807453,
“state_name”: “Up”
},
“status”: {
“capacity”: “5.952TiB”,
“available”: “3.602TiB”,
“used_size”: “2.171TiB”,
“leader_count”: 37483,
“leader_weight”: 2,
“leader_score”: 18741.5,
“leader_size”: 2924735,
“region_count”: 78287,
“region_weight”: 2,
“region_score”: 3080474,
“region_size”: 6160948,
“sending_snap_count”: 4,
“start_ts”: “2021-07-06T13:30:18Z”,
“last_heartbeat_ts”: “2021-07-08T09:21:02.911807453Z”,
“uptime”: “43h50m44.911807453s”
}
},
{
“store”: {
“id”: 24590972,
“address”: “10.12.5.240:20160”,
“version”: “4.0.13”,
“status_address”: “10.12.5.240:20180”,
“git_hash”: “a448d617f79ddf545be73931525bb41af0f790f3”,
“start_timestamp”: 1625578262,
“deploy_path”: “/home/tidb/deploy/bin”,
“last_heartbeat”: 1625736065021513483,
“state_name”: “Up”
},
“status”: {
“capacity”: “5.952TiB”,
“available”: “3.737TiB”,
“used_size”: “1.983TiB”,
“leader_count”: 35974,
“leader_weight”: 2,
“leader_score”: 17987,
“leader_size”: 2865883,
“region_count”: 77953,
“region_weight”: 2,
“region_score”: 3080931.5,
“region_size”: 6161863,
“sending_snap_count”: 5,
“start_ts”: “2021-07-06T13:31:02Z”,
“last_heartbeat_ts”: “2021-07-08T09:21:05.021513483Z”,
“uptime”: “43h50m3.021513483s”
}
},
{
“store”: {
“id”: 38833310,
“address”: “10.12.5.147:20160”,
“state”: 1,
“version”: “4.0.13”,
“status_address”: “10.12.5.147:20180”,
“git_hash”: “a448d617f79ddf545be73931525bb41af0f790f3”,
“start_timestamp”: 1625578102,
“deploy_path”: “/home/tidb/deploy/bin”,
“last_heartbeat”: 1625736065716564707,
“state_name”: “Offline”
},
“status”: {
“capacity”: “5.952TiB”,
“available”: “3.542TiB”,
“used_size”: “2.084TiB”,
“leader_count”: 31612,
“leader_weight”: 2,
“leader_score”: 15806,
“leader_size”: 2426506,
“region_count”: 75262,
“region_weight”: 2,
“region_score”: 2976809,
“region_size”: 5953618,
“start_ts”: “2021-07-06T13:28:22Z”,
“last_heartbeat_ts”: “2021-07-08T09:21:05.716564707Z”,
“uptime”: “43h52m43.716564707s”
}
},
{
“store”: {
“id”: 534172210,
“address”: “10.12.5.32:20160”,
“version”: “4.0.13”,
“status_address”: “10.12.5.32:20180”,
“git_hash”: “a448d617f79ddf545be73931525bb41af0f790f3”,
“start_timestamp”: 1625607928,
“deploy_path”: “/home/tidb/deploy/tikv-20180/bin”,
“last_heartbeat”: 1625736062883413381,
“state_name”: “Up”
},
“status”: {
“capacity”: “3.444TiB”,
“available”: “2.763TiB”,
“used_size”: “689.1GiB”,
“leader_count”: 8636,
“leader_weight”: 1,
“leader_score”: 8636,
“leader_size”: 624934,
“region_count”: 21907,
“region_weight”: 1,
“region_score”: 1633540,
“region_size”: 1633540,
“sending_snap_count”: 1,
“receiving_snap_count”: 1,
“start_ts”: “2021-07-06T21:45:28Z”,
“last_heartbeat_ts”: “2021-07-08T09:21:02.883413381Z”,
“uptime”: “35h35m34.883413381s”
}
},
{
“store”: {
“id”: 534402022,
“address”: “10.12.5.35:20160”,
“version”: “4.0.13”,
“status_address”: “10.12.5.35:20180”,
“git_hash”: “a448d617f79ddf545be73931525bb41af0f790f3”,
“start_timestamp”: 1625756847,
“deploy_path”: “/home/tidb/deploy/tikv-20180/bin”,
“last_heartbeat”: 1625736064321908515,
“state_name”: “Up”
},
“status”: {
“capacity”: “3.444TiB”,
“available”: “3.38TiB”,
“used_size”: “58.16GiB”,
“leader_count”: 6,
“leader_weight”: 1,
“leader_score”: 6,
“leader_size”: 264,
“region_count”: 1990,
“region_weight”: 1,
“region_score”: 137747,
“region_size”: 137747,
“receiving_snap_count”: 2,
“start_ts”: “2021-07-08T15:07:27Z”,
“last_heartbeat_ts”: “2021-07-08T09:21:04.321908515Z”
}
},
{
“store”: {
“id”: 534414223,
“address”: “10.12.5.36:20160”,
“version”: “4.0.13”,
“status_address”: “10.12.5.36:20180”,
“git_hash”: “a448d617f79ddf545be73931525bb41af0f790f3”,
“start_timestamp”: 1625727955,
“deploy_path”: “/home/tidb/deploy/tikv-20180/bin”,
“last_heartbeat”: 1625736066362915007,
“state_name”: “Up”
},
“status”: {
“capacity”: “3.444TiB”,
“available”: “3.381TiB”,
“used_size”: “57.22GiB”,
“leader_count”: 7,
“leader_weight”: 1,
“leader_score”: 7,
“leader_size”: 418,
“region_count”: 1788,
“region_weight”: 1,
“region_score”: 137692,
“region_size”: 137692,
“receiving_snap_count”: 1,
“start_ts”: “2021-07-08T07:05:55Z”,
“last_heartbeat_ts”: “2021-07-08T09:21:06.362915007Z”,
“uptime”: “2h15m11.362915007s”
}
},
{
“store”: {
“id”: 24478148,
“address”: “10.12.5.236:20160”,
“version”: “4.0.13”,
“status_address”: “10.12.5.236:20180”,
“git_hash”: “a448d617f79ddf545be73931525bb41af0f790f3”,
“start_timestamp”: 1625578141,
“deploy_path”: “/home/tidb/deploy/bin”,
“last_heartbeat”: 1625736059064503213,
“state_name”: “Up”
},
“status”: {
“capacity”: “5.952TiB”,
“available”: “3.062TiB”,
“used_size”: “2.186TiB”,
“leader_count”: 58,
“leader_weight”: 2,
“leader_score”: 29,
“leader_size”: 5824,
“region_count”: 77969,
“region_weight”: 2,
“region_score”: 3080723.5,
“region_size”: 6161447,
“start_ts”: “2021-07-06T13:29:01Z”,
“last_heartbeat_ts”: “2021-07-08T09:20:59.064503213Z”,
“uptime”: “43h51m58.064503213s”
}
},
{
“store”: {
“id”: 262397455,
“address”: “10.12.5.13:20160”,
“version”: “4.0.13”,
“status_address”: “10.12.5.13:20180”,
“git_hash”: “a448d617f79ddf545be73931525bb41af0f790f3”,
“start_timestamp”: 1625607134,
“deploy_path”: “/home/tidb/deploy/bin”,
“last_heartbeat”: 1625736067518892976,
“state_name”: “Up”
},
“status”: {
“capacity”: “5.952TiB”,
“available”: “4.812TiB”,
“used_size”: “1.072TiB”,
“leader_count”: 17276,
“leader_weight”: 1,
“leader_score”: 17276,
“leader_size”: 1370137,
“region_count”: 40471,
“region_weight”: 1,
“region_score”: 3080919,
“region_size”: 3080919,
“start_ts”: “2021-07-06T21:32:14Z”,
“last_heartbeat_ts”: “2021-07-08T09:21:07.518892976Z”,
“uptime”: “35h48m53.518892976s”
}
},
{
“store”: {
“id”: 268391998,
“address”: “10.12.5.119:20160”,
“state”: 1,
“version”: “4.0.13”,
“status_address”: “10.12.5.119:20180”,
“git_hash”: “a448d617f79ddf545be73931525bb41af0f790f3”,
“start_timestamp”: 1625578469,
“deploy_path”: “/home/tidb/deploy/bin”,
“last_heartbeat”: 1625736060017727753,
“state_name”: “Offline”
},
“status”: {
“capacity”: “320TiB”,
“available”: “282.4TiB”,
“used_size”: “146.7GiB”,
“leader_count”: 1456,
“leader_weight”: 1,
“leader_score”: 1456,
“leader_size”: 116226,
“region_count”: 4177,
“region_weight”: 1,
“region_score”: 363338,
“region_size”: 363338,
“start_ts”: “2021-07-06T13:34:29Z”,
“last_heartbeat_ts”: “2021-07-08T09:21:00.017727753Z”,
“uptime”: “43h46m31.017727753s”
}
},
{
“store”: {
“id”: 534172077,
“address”: “10.12.5.31:20160”,
“version”: “4.0.13”,
“status_address”: “10.12.5.31:20180”,
“git_hash”: “a448d617f79ddf545be73931525bb41af0f790f3”,
“start_timestamp”: 1625607322,
“deploy_path”: “/home/tidb/deploy/tikv-20180/bin”,
“last_heartbeat”: 1625736060074100471,
“state_name”: “Up”
},
“status”: {
“capacity”: “3.444TiB”,
“available”: “2.759TiB”,
“used_size”: “693GiB”,
“leader_count”: 8630,
“leader_weight”: 1,
“leader_score”: 8630,
“leader_size”: 668938,
“region_count”: 21728,
“region_weight”: 1,
“region_score”: 1632767,
“region_size”: 1632767,
“start_ts”: “2021-07-06T21:35:22Z”,
“last_heartbeat_ts”: “2021-07-08T09:21:00.074100471Z”,
“uptime”: “35h45m38.074100471s”
}
},
{
“store”: {
“id”: 534204559,
“address”: “10.12.5.34:20160”,
“version”: “4.0.13”,
“status_address”: “10.12.5.34:20180”,
“git_hash”: “a448d617f79ddf545be73931525bb41af0f790f3”,
“start_timestamp”: 1625625018,
“deploy_path”: “/home/tidb/deploy/tikv-20180/bin”,
“last_heartbeat”: 1625736062714283266,
“state_name”: “Up”
},
“status”: {
“capacity”: “3.444TiB”,
“available”: “2.758TiB”,
“used_size”: “694.3GiB”,
“leader_count”: 145,
“leader_weight”: 1,
“leader_score”: 145,
“leader_size”: 9278,
“region_count”: 22518,
“region_weight”: 1,
“region_score”: 1631772,
“region_size”: 1631772,
“receiving_snap_count”: 1,
“start_ts”: “2021-07-07T02:30:18Z”,
“last_heartbeat_ts”: “2021-07-08T09:21:02.714283266Z”,
“uptime”: “30h50m44.714283266s”
}
}
]
}
现在梳理信息如下:
下线的 store
10.12.5.147:38833310
10.12.5.119:268391998
上线的 store
10.12.5.31:534172077
10.12.5.32:534172210
10.12.5.34:534204559
10.12.5.35:534402022
10.12.5.36:534414223
当前环境中既包括上线也包括下线的 store,所以 PD 在调度上可能会出现争用,辛苦收集下下面的信息:
>> scheduler config evict-leader-scheduler
>> operator show leader
>> operator show region
您好,我与楼主是同事,以上三个命令输出如下
scheduler config evict-leader-scheduler
scheduler commands
Usage:
scheduler [command]
Available Commands:
add add a scheduler
remove remove a scheduler
show show schedulers
Use "help scheduler [command] " for more information about a command.
operator show leader leader.txt (17.0 KB)
operator show regionregion.txt (43.5 KB)
通过 operator region 以及 operator leader 看到对应的调度大部分为 replace-offline-replica,表示节点下线相关的 scheduler,但是速度较慢。
通过昨天提供的 PD 监控,能看到集群中存在 1.7+w 左右的空 region,请评估下面的操作步骤:
1、(已操作无需重复操作)remove store 24478148(非下线 store) 上的 evict leader 调度,并且在 store 38833310 上 add evict leader 调度,参考文档:
https://docs.pingcap.com/zh/tidb/v4.0/pd-control#scheduler-show–add–remove–pause–resume–config](https://asktug.com/t/topic/94879)
2、将节点下线相关的调度调小,region merge 相关的调度调大,加速空 region 的 merge,文档参考如下。当前集群看起来没有配置 tiflash 节点,但是建议确认下 placement-rules
参数后操作:
https://docs.pingcap.com/zh/tidb/v4.0/pd-scheduling-best-practices#region-merge-速度慢
3、待集群的空 region 基本 merge 完成后,可以将 merge region 相关的参数调整为默认值
4、调大下线相关的调度参数 replica-schedule-limit
(当前是 64,可以在 region merge 完成后,在此基础上适当调大)。当前能看到各个 store 的 store-limit 配比不均衡,建议原有节点的 store-limit 保持默认值,新增节点以及下线节点可适当调大(store 38833310 参数是 10000,有点过于大了 ):
5、下线节点变为 tomstone 状态后,可将 replica-schedule-limit 等参数调整为默认值
注意:调大调度相关的参数,在一定程度上会影响整套集群的响应时间,强烈建议在业务低峰期进行调整,高峰期到来前调整为默认值降低对业务的影响。
另外,比较好奇,在 store 信息中看到 store-268391998 的空间异常大,其对应的磁盘类型,方便讲下吗?
如图,147节点下线中已经卡住有一段时间了,region数量没有减少
region store 38833310
{
“count”: 13,
“regions”: [
{
“id”: 248409811,
“start_key”: “748000000000029DFF3600000000000000F8”,
“end_key”: “748000000000029DFF3F00000000000000F8”,
“epoch”: {
“conf_ver”: 79451,
“version”: 65200
},
“peers”: [
{
“id”: 299906990,
“store_id”: 38833310
},
{
“id”: 514464857,
“store_id”: 24478148
},
{
“id”: 531390089,
“store_id”: 262397455
},
{
“id”: 533807673,
“store_id”: 24590972,
“is_learner”: true
}
],
“leader”: {
“id”: 299906990,
“store_id”: 38833310
},
“down_peers”: [
{
“peer”: {
“id”: 533807673,
“store_id”: 24590972,
“is_learner”: true
},
“down_seconds”: 25979
}
],
“pending_peers”: [
{
“id”: 533807673,
“store_id”: 24590972,
“is_learner”: true
}
],
“written_bytes”: 0,
“read_bytes”: 0,
“written_keys”: 0,
“read_keys”: 0,
“approximate_size”: 37,
“approximate_keys”: 0
},
{
“id”: 9321369,
“start_key”: “7480000000000004FFBA00000000000000F8”,
“end_key”: “7480000000000004FFD200000000000000F8”,
“epoch”: {
“conf_ver”: 1449,
“version”: 7577
},
“peers”: [
{
“id”: 262213308,
“store_id”: 38833310
},
{
“id”: 319914998,
“store_id”: 262397455
},
{
“id”: 322225697,
“store_id”: 24590972,
“is_learner”: true
}
],
“leader”: {
“id”: 319914998,
“store_id”: 262397455
},
“down_peers”: [
{
“peer”: {
“id”: 322225697,
“store_id”: 24590972,
“is_learner”: true
},
“down_seconds”: 308
}
],
“pending_peers”: [
{
“id”: 322225697,
“store_id”: 24590972,
“is_learner”: true
}
],
“written_bytes”: 0,
“read_bytes”: 0,
“written_keys”: 0,
“read_keys”: 0,
“approximate_size”: 1,
“approximate_keys”: 0
},
{
“id”: 10531674,
“start_key”: “748000000000001FFFA300000000000000F8”,
“end_key”: “7480000000000022FF4100000000000000F8”,
“epoch”: {
“conf_ver”: 1481,
“version”: 10095
},
“peers”: [
{
“id”: 264065756,
“store_id”: 262397455
},
{
“id”: 268649089,
“store_id”: 38833310
},
{
“id”: 326611569,
“store_id”: 24590972,
“is_learner”: true
}
],
“leader”: {
“id”: 264065756,
“store_id”: 262397455
},
“down_peers”: [
{
“peer”: {
“id”: 326611569,
“store_id”: 24590972,
“is_learner”: true
},
“down_seconds”: 300
}
],
“pending_peers”: [
{
“id”: 326611569,
“store_id”: 24590972,
“is_learner”: true
}
],
“written_bytes”: 0,
“read_bytes”: 0,
“written_keys”: 0,
“read_keys”: 0,
“approximate_size”: 1,
“approximate_keys”: 0
},
{
“id”: 10570162,
“start_key”: “7480000000000024FF4E00000000000000F8”,
“end_key”: “7480000000000027FFC300000000000000F8”,
“epoch”: {
“conf_ver”: 1437,
“version”: 10552
},
“peers”: [
{
“id”: 260587441,
“store_id”: 38833310
},
{
“id”: 318339335,
“store_id”: 262397455
},
{
“id”: 322137698,
“store_id”: 24590972,
“is_learner”: true
}
],
“leader”: {
“id”: 318339335,
“store_id”: 262397455
},
“down_peers”: [
{
“peer”: {
“id”: 322137698,
“store_id”: 24590972,
“is_learner”: true
},
“down_seconds”: 304
}
],
“pending_peers”: [
{
“id”: 322137698,
“store_id”: 24590972,
“is_learner”: true
}
],
“written_bytes”: 0,
“read_bytes”: 0,
“written_keys”: 0,
“read_keys”: 0,
“approximate_size”: 1,
“approximate_keys”: 0
},
{
“id”: 10577712,
“start_key”: “7480000000000027FFC300000000000000F8”,
“end_key”: “7480000000000029FFD600000000000000F8”,
“epoch”: {
“conf_ver”: 1479,
“version”: 10729
},
“peers”: [
{
“id”: 297424880,
“store_id”: 24480822
},
{
“id”: 298924143,
“store_id”: 38833310
},
{
“id”: 319912825,
“store_id”: 24590972,
“is_learner”: true
}
],
“leader”: {
“id”: 297424880,
“store_id”: 24480822
},
“down_peers”: [
{
“peer”: {
“id”: 319912825,
“store_id”: 24590972,
“is_learner”: true
},
“down_seconds”: 342
}
],
“pending_peers”: [
{
“id”: 319912825,
“store_id”: 24590972,
“is_learner”: true
}
],
“written_bytes”: 0,
“read_bytes”: 0,
“written_keys”: 0,
“read_keys”: 0,
“approximate_size”: 1,
“approximate_keys”: 0
},
{
“id”: 12214715,
“start_key”: “7480000000000040FF6700000000000000F8”,
“end_key”: “7480000000000048FF2000000000000000F8”,
“epoch”: {
“conf_ver”: 1491,
“version”: 13315
},
“peers”: [
{
“id”: 297946494,
“store_id”: 38833310
},
{
“id”: 310494211,
“store_id”: 262397455
},
{
“id”: 319816817,
“store_id”: 24590972,
“is_learner”: true
}
],
“leader”: {
“id”: 310494211,
“store_id”: 262397455
},
“down_peers”: [
{
“peer”: {
“id”: 319816817,
“store_id”: 24590972,
“is_learner”: true
},
“down_seconds”: 13525
}
],
“pending_peers”: [
{
“id”: 319816817,
“store_id”: 24590972,
“is_learner”: true
}
],
“written_bytes”: 0,
“read_bytes”: 0,
“written_keys”: 0,
“read_keys”: 0,
“approximate_size”: 1,
“approximate_keys”: 0
},
{
“id”: 23560887,
“start_key”: “7480000000000174FF5600000000000000F8”,
“end_key”: “7480000000000174FF5C00000000000000F8”,
“epoch”: {
“conf_ver”: 6970,
“version”: 38842
},
“peers”: [
{
“id”: 257026531,
“store_id”: 38833310
},
{
“id”: 320239368,
“store_id”: 24480822
},
{
“id”: 326350353,
“store_id”: 24590972,
“is_learner”: true
}
],
“leader”: {
“id”: 320239368,
“store_id”: 24480822
},
“down_peers”: [
{
“peer”: {
“id”: 326350353,
“store_id”: 24590972,
“is_learner”: true
},
“down_seconds”: 327
}
],
“pending_peers”: [
{
“id”: 326350353,
“store_id”: 24590972,
“is_learner”: true
}
],
“written_bytes”: 0,
“read_bytes”: 0,
“written_keys”: 0,
“read_keys”: 0,
“approximate_size”: 1,
“approximate_keys”: 0
},
{
“id”: 23757219,
“start_key”: “7480000000000179FFD800000000000000F8”,
“end_key”: “7480000000000179FFDE00000000000000F8”,
“epoch”: {
“conf_ver”: 7926,
“version”: 39312
},
“peers”: [
{
“id”: 257919297,
“store_id”: 38833310
},
{
“id”: 319109362,
“store_id”: 24480822
},
{
“id”: 518804123,
“store_id”: 262397455
},
{
“id”: 533458278,
“store_id”: 24590972,
“is_learner”: true
}
],
“leader”: {
“id”: 319109362,
“store_id”: 24480822
},
“down_peers”: [
{
“peer”: {
“id”: 533458278,
“store_id”: 24590972,
“is_learner”: true
},
“down_seconds”: 328
}
],
“pending_peers”: [
{
“id”: 533458278,
“store_id”: 24590972,
“is_learner”: true
}
],
“written_bytes”: 0,
“read_bytes”: 0,
“written_keys”: 0,
“read_keys”: 0,
“approximate_size”: 1,
“approximate_keys”: 0
},
{
“id”: 80237503,
“start_key”: “74800000000001E8FFCB00000000000000F8”,
“end_key”: “74800000000001E8FFD700000000000000F8”,
“epoch”: {
“conf_ver”: 18635,
“version”: 48749
},
“peers”: [
{
“id”: 256901413,
“store_id”: 24480822
},
{
“id”: 298923235,
“store_id”: 38833310
},
{
“id”: 320209350,
“store_id”: 24590972,
“is_learner”: true
}
],
“leader”: {
“id”: 256901413,
“store_id”: 24480822
},
“down_peers”: [
{
“peer”: {
“id”: 320209350,
“store_id”: 24590972,
“is_learner”: true
},
“down_seconds”: 362
}
],
“pending_peers”: [
{
“id”: 320209350,
“store_id”: 24590972,
“is_learner”: true
}
],
“written_bytes”: 0,
“read_bytes”: 0,
“written_keys”: 0,
“read_keys”: 0,
“approximate_size”: 40,
“approximate_keys”: 40960
},
{
“id”: 80245844,
“start_key”: “74800000000001E9FF8500000000000000F8”,
“end_key”: “74800000000001E9FF8E00000000000000F8”,
“epoch”: {
“conf_ver”: 18637,
“version”: 48809
},
“peers”: [
{
“id”: 246312297,
“store_id”: 38833310
},
{
“id”: 515784436,
“store_id”: 24480822
},
{
“id”: 529761379,
“store_id”: 24478148
},
{
“id”: 533454378,
“store_id”: 24590972,
“is_learner”: true
}
],
“leader”: {
“id”: 515784436,
“store_id”: 24480822
},
“down_peers”: [
{
“peer”: {
“id”: 533454378,
“store_id”: 24590972,
“is_learner”: true
},
“down_seconds”: 266221
}
],
“pending_peers”: [
{
“id”: 533454378,
“store_id”: 24590972,
“is_learner”: true
}
],
“written_bytes”: 0,
“read_bytes”: 0,
“written_keys”: 0,
“read_keys”: 0,
“approximate_size”: 28,
“approximate_keys”: 0
},
{
“id”: 80660222,
“start_key”: “748000000000020AFF9B00000000000000F8”,
“end_key”: “748000000000020AFFB000000000000000F8”,
“epoch”: {
“conf_ver”: 18623,
“version”: 52142
},
“peers”: [
{
“id”: 258296254,
“store_id”: 38833310
},
{
“id”: 319804898,
“store_id”: 24480822
},
{
“id”: 326369783,
“store_id”: 24590972,
“is_learner”: true
}
],
“leader”: {
“id”: 319804898,
“store_id”: 24480822
},
“down_peers”: [
{
“peer”: {
“id”: 326369783,
“store_id”: 24590972,
“is_learner”: true
},
“down_seconds”: 346
}
],
“pending_peers”: [
{
“id”: 326369783,
“store_id”: 24590972,
“is_learner”: true
}
],
“written_bytes”: 0,
“read_bytes”: 0,
“written_keys”: 0,
“read_keys”: 0,
“approximate_size”: 59,
“approximate_keys”: 42933
},
{
“id”: 177385916,
“start_key”: “7480000000000234FFB500000000000000F8”,
“end_key”: “7480000000000234FFC100000000000000F8”,
“epoch”: {
“conf_ver”: 31031,
“version”: 55882
},
“peers”: [
{
“id”: 247994505,
“store_id”: 24480822
},
{
“id”: 297362637,
“store_id”: 38833310
},
{
“id”: 320241481,
“store_id”: 24590972,
“is_learner”: true
}
],
“leader”: {
“id”: 247994505,
“store_id”: 24480822
},
“down_peers”: [
{
“peer”: {
“id”: 320241481,
“store_id”: 24590972,
“is_learner”: true
},
“down_seconds”: 344
}
],
“pending_peers”: [
{
“id”: 320241481,
“store_id”: 24590972,
“is_learner”: true
}
],
“written_bytes”: 0,
“read_bytes”: 0,
“written_keys”: 0,
“read_keys”: 0,
“approximate_size”: 39,
“approximate_keys”: 0
},
{
“id”: 249205334,
“start_key”: “74800000000002B1FF585F728000000000FF2D2E260000000000FA”,
“end_key”: “74800000000002B1FF585F728000000000FF3394730000000000FA”,
“epoch”: {
“conf_ver”: 91458,
“version”: 66935
},
“peers”: [
{
“id”: 249317519,
“store_id”: 24480822
},
{
“id”: 298207941,
“store_id”: 38833310
},
{
“id”: 326500160,
“store_id”: 24590972,
“is_learner”: true
}
],
“leader”: {
“id”: 249317519,
“store_id”: 24480822
},
“down_peers”: [
{
“peer”: {
“id”: 326500160,
“store_id”: 24590972,
“is_learner”: true
},
“down_seconds”: 349
}
],
“pending_peers”: [
{
“id”: 326500160,
“store_id”: 24590972,
“is_learner”: true
}
],
“written_bytes”: 0,
“read_bytes”: 0,
“written_keys”: 0,
“read_keys”: 0,
“approximate_size”: 58,
“approximate_keys”: 419724
}
]
}
pd显示region与监控不符,通过手动transfer,将pd内的region全部清除
然而监控中还有剩余region
请问问题解决了吗?这个参数调整以后,速度有增加吗?