TIKV 随机2台CPU过高

为提高效率,提问时请提供以下信息,问题描述清晰可优先响应。

  • 【TiDB 版本】:4.0
  • 【问题描述】:连接数在10个左右,不定时的2台TIKV cpu占有率达到1000%,导致不可用

集群监测后发现Unified Read Pool耗时太长了,查看日志发现在重新选举,请问是什么问题呢

hi,可以看下 tidb duration ,qps ,slow query 是否有明显升高?


都有提升,大概9点10分左右

请使用一下方式,提供下 tikv-detail 和 tikv-trouble-shorting 的监控 pdf ,感谢配合

打开 grafana 监控,先按 d 再按 shift+e 可以打开所有监控项。

(1)、chrome 安装这个插件https://chrome.google.com/webstore/detail/full-page-screen-capture/fdpohaocaechififmbbbbbknoalclacl

(2)、鼠标焦点置于 Dashboard 上,按 ?可显示所有快捷键,先按 d 再按 E 可将所有 Rows 的 Panels 打开,需等待一段时间待页面加载完成。

(3)、使用这个 full-page-screen-capture 插件进行截屏保存

https://space.dingtalk.com/s/gwHOAiKzIgLOLUeCBQPaACAwNjNjZGJmOGU1M2M0ZThmYmNmYmRiMzIzOWMwNzg2MA 密码: tWeL
https://space.dingtalk.com/s/gwHOAiKzIwLOLUeCBQPaACA2NDI1OTM0Y2IzYzk0ZTYwODcxNDNjMGU4OGYzN2NhZQ 密码: Bm2c
文件过大无法上传,这个钉盘链接,麻烦您帮排查下问题哈

百度网盘辛苦上传一份把,这边钉钉不是很方便

链接:https://pan.baidu.com/s/11evrvMUAcPEF8Ngi1W5j4A
提取码:hhy9

这个百度链接,麻烦您帮排查下问题哈

image
监控中 1.102 上并有 region leader, 所以在有请求过来时,其他两个 tikv 可能承受所有的压力和计算。辛苦执行下下面命令,看下当前集群的调度情况。并将结果文本反馈下

pd-ctl scheduler show
pd-ctl config show all

config show all
{
“client-urls”: “http://192.168.1.107:2379”,
“peer-urls”: “http://192.168.1.107:2380”,
“advertise-client-urls”: “http://192.168.1.107:2379”,
“advertise-peer-urls”: “http://192.168.1.107:2380”,
“name”: “pd-192.168.1.107-2379”,
“data-dir”: “/rhzy/tidb/tidb-data/pd-2379”,
“force-new-cluster”: false,
“enable-grpc-gateway”: true,
“initial-cluster”: “pd-192.168.1.106-2379=http://192.168.1.106:2380,pd-192.168.1.107-2379=http://192.168.1.107:2380,pd-192.168.1.108-2379=http://192.168.1.108:2380”,
“initial-cluster-state”: “new”,
“join”: “”,
“lease”: 3,
“log”: {
“level”: “”,
“format”: “text”,
“disable-timestamp”: false,
“file”: {
“filename”: “/rhzy/tidb/log/pd.log”,
“max-size”: 300,
“max-days”: 0,
“max-backups”: 0
},
“development”: false,
“disable-caller”: false,
“disable-stacktrace”: false,
“disable-error-verbose”: true,
“sampling”: null
},
“tso-save-interval”: “3s”,
“metric”: {
“job”: “pd-192.168.1.107-2379”,
“address”: “”,
“interval”: “15s”
},
“schedule”: {
“max-snapshot-count”: 3,
“max-pending-peer-count”: 16,
“max-merge-region-size”: 20,
“max-merge-region-keys”: 200000,
“split-merge-interval”: “1h0m0s”,
“enable-one-way-merge”: “false”,
“enable-cross-table-merge”: “false”,
“patrol-region-interval”: “100ms”,
“max-store-down-time”: “30m0s”,
“leader-schedule-limit”: 4,
“leader-schedule-policy”: “count”,
“region-schedule-limit”: 2048,
“replica-schedule-limit”: 64,
“merge-schedule-limit”: 8,
“hot-region-schedule-limit”: 4,
“hot-region-cache-hits-threshold”: 3,
“store-balance-rate”: 15,
“tolerant-size-ratio”: 0,
“low-space-ratio”: 0.8,
“high-space-ratio”: 0.7,
“scheduler-max-waiting-operator”: 5,
“enable-remove-down-replica”: “true”,
“enable-replace-offline-replica”: “true”,
“enable-make-up-replica”: “true”,
“enable-remove-extra-replica”: “true”,
“enable-location-replacement”: “true”,
“enable-debug-metrics”: “false”,
“schedulers-v2”: [
{
“type”: “balance-region”,
“args”: null,
“disable”: false,
“args-payload”: “”
},
{
“type”: “balance-leader”,
“args”: null,
“disable”: false,
“args-payload”: “”
},
{
“type”: “hot-region”,
“args”: null,
“disable”: false,
“args-payload”: “”
},
{
“type”: “label”,
“args”: null,
“disable”: false,
“args-payload”: “”
},
{
“type”: “evict-leader”,
“args”: [
“1”
],
“disable”: false,
“args-payload”: “”
}
],
“schedulers-payload”: {
“balance-hot-region-scheduler”: “null”,
“balance-leader-scheduler”: “{“name”:“balance-leader-scheduler”,“ranges”:[{“start-key”:”",“end-key”:""}]}",
“balance-region-scheduler”: “{“name”:“balance-region-scheduler”,“ranges”:[{“start-key”:”",“end-key”:""}]}",
“evict-leader-scheduler”: “{“store-id-ranges”:{“1”:[{“start-key”:”",“end-key”:""}]}}",
“label-scheduler”: “{“name”:“label-scheduler”,“ranges”:[{“start-key”:”",“end-key”:""}]}"
},
“store-limit-mode”: “manual”
},
“replication”: {
“max-replicas”: 3,
“location-labels”: “”,
“strictly-match-label”: “false”,
“enable-placement-rules”: “true”
},
“pd-server”: {
“use-region-storage”: “true”,
“max-gap-reset-ts”: “24h0m0s”,
“key-type”: “table”,
“runtime-services”: “”,
“metric-storage”: “http://192.168.1.99:9090”,
“dashboard-address”: “http://192.168.1.107:2379
},
“cluster-version”: “4.0.0”,
“quota-backend-bytes”: “8GiB”,
“auto-compaction-mode”: “periodic”,
“auto-compaction-retention-v2”: “1h”,
“TickInterval”: “500ms”,
“ElectionInterval”: “3s”,
“PreVote”: true,
“security”: {
“cacert-path”: “”,
“cert-path”: “”,
“key-path”: “”,
“cert-allowed-cn”: null
},
“label-property”: {},
“WarningMsgs”: null,
“DisableStrictReconfigCheck”: false,
“HeartbeatStreamBindInterval”: “1m0s”,
“LeaderPriorityCheckInterval”: “1m0s”,
“dashboard”: {
“tidb_cacert_path”: “”,
“tidb_cert_path”: “”,
“tidb_key_path”: “”,
“public_path_prefix”: “/dashboard”
},
“replication-mode”: {
“replication-mode”: “majority”,
“dr-auto-sync”: {
“label-key”: “”,
“primary”: “”,
“dr”: “”,
“primary-replicas”: 0,
“dr-replicas”: 0,
“wait-store-timeout”: “1m0s”,
“wait-sync-timeout”: “1m0s”
}
}
}

scheduler show
[
“balance-hot-region-scheduler”,
“balance-leader-scheduler”,
“balance-region-scheduler”,
“evict-leader-scheduler”,
“label-scheduler”
]

pd-ctl scheduler remove evict-leader-scheduler

观察几分钟看下,通过 pd-ctl store 看下 leader count 是否均衡

变得均衡了,不过现在是3个CPU占有率都挺高的

store
{
“count”: 4,
“stores”: [
{
“store”: {
“id”: 1,
“address”: “192.168.1.102:20160”,
“version”: “4.0.0”,
“status_address”: “192.168.1.102:20180”,
“git_hash”: “198a2cea01734ce8f46d55a29708f123f9133944”,
“start_timestamp”: 1594276572,
“deploy_path”: “/rhzy/tidb/tidb-deploy/tikv-20160/bin”,
“last_heartbeat”: 1594283619138186664,
“state_name”: “Up”
},
“status”: {
“capacity”: “688.9GiB”,
“available”: “672.4GiB”,
“used_size”: “10.7GiB”,
“leader_count”: 578,
“leader_weight”: 1,
“leader_score”: 578,
“leader_size”: 16646,
“region_count”: 1745,
“region_weight”: 1,
“region_score”: 49451,
“region_size”: 49451,
“start_ts”: “2020-07-09T14:36:12+08:00”,
“last_heartbeat_ts”: “2020-07-09T16:33:39.138186664+08:00”,
“uptime”: “1h57m27.138186664s”
}
},
{
“store”: {
“id”: 4,
“address”: “192.168.1.100:20160”,
“version”: “4.0.0”,
“status_address”: “192.168.1.100:20180”,
“git_hash”: “198a2cea01734ce8f46d55a29708f123f9133944”,
“start_timestamp”: 1594276575,
“deploy_path”: “/rhzy/tidb/tidb-deploy/tikv/bin”,
“last_heartbeat”: 1594283614391452681,
“state_name”: “Up”
},
“status”: {
“capacity”: “688.9GiB”,
“available”: “672.3GiB”,
“used_size”: “10.65GiB”,
“leader_count”: 587,
“leader_weight”: 1,
“leader_score”: 587,
“leader_size”: 15919,
“region_count”: 1745,
“region_weight”: 1,
“region_score”: 49451,
“region_size”: 49451,
“start_ts”: “2020-07-09T14:36:15+08:00”,
“last_heartbeat_ts”: “2020-07-09T16:33:34.391452681+08:00”,
“uptime”: “1h57m19.391452681s”
}
},
{
“store”: {
“id”: 5,
“address”: “192.168.1.101:20160”,
“version”: “4.0.0”,
“status_address”: “192.168.1.101:20180”,
“git_hash”: “198a2cea01734ce8f46d55a29708f123f9133944”,
“start_timestamp”: 1594276573,
“deploy_path”: “/rhzy/tidb/tidb-deploy/tikv-20160/bin”,
“last_heartbeat”: 1594283619692393027,
“state_name”: “Up”
},
“status”: {
“capacity”: “688.9GiB”,
“available”: “672.4GiB”,
“used_size”: “10.52GiB”,
“leader_count”: 580,
“leader_weight”: 1,
“leader_score”: 580,
“leader_size”: 16886,
“region_count”: 1745,
“region_weight”: 1,
“region_score”: 49451,
“region_size”: 49451,
“start_ts”: “2020-07-09T14:36:13+08:00”,
“last_heartbeat_ts”: “2020-07-09T16:33:39.692393027+08:00”,
“uptime”: “1h57m26.692393027s”
}
},
{
“store”: {
“id”: 46,
“address”: “192.168.1.109:3930”,
“labels”: [
{
“key”: “engine”,
“value”: “tiflash”
}
],
“version”: “v4.0.0”,
“peer_address”: “192.168.1.109:20170”,
“status_address”: “192.168.1.109:20292”,
“git_hash”: “c51c2c5c18860aaef3b5853f24f8e9cefea167eb”,
“start_timestamp”: 1594276593,
“deploy_path”: “/rhzy/tidb/tidb-deploy/tiflash/bin/tiflash”,
“last_heartbeat”: 1594283614270239386,
“state_name”: “Up”
},
“status”: {
“capacity”: “79.46GiB”,
“available”: “75.04GiB”,
“used_size”: “857.4MiB”,
“leader_count”: 0,
“leader_weight”: 1,
“leader_score”: 0,
“leader_size”: 0,
“region_count”: 75,
“region_weight”: 1,
“region_score”: 6809,
“region_size”: 6809,
“start_ts”: “2020-07-09T14:36:33+08:00”,
“last_heartbeat_ts”: “2020-07-09T16:33:34.270239386+08:00”,
“uptime”: “1h57m1.270239386s”
}
}
]
}

在观察一下,理解下这个[高],是否符合当前的业务形态,如果当前属于业务峰值看下是否符合当时测试的预期。

可以在观察 30 分钟左右,如果有问题可以将当时的监控也上传下,可能需要 tikv-detail 和 tidb、overview 的监控截图,

3台tikv的cpu占有率大概500%,稍等我把监控传一下哈,谢谢大神

ok~

链接:https://pan.baidu.com/s/1qvkopnHUp0itTOduiYlbqw
提取码:pnft
这是最新的监控图,麻烦大神帮看下哈,

只看到了 tidb 的截图,可否上传下 tikv-detail 的

链接:https://pan.baidu.com/s/11AalAQcyFkJLzA1UTuc0uA
提取码:7xd5

目前 TiKV CPU 使用率是均衡的吧,均衡的话,说明 tikv 资源都使用到了
如果有觉得压测性能不足,建议重新开个帖子我们看下,一个帖子中混合多个问题的话,不方便后续参考,感谢理解