大量进程访问同一个库的业务速度较慢

【概述】 场景 + 问题概述
场景如标题

由于读写性能是一个较大的主题,我查阅了大量社区资料和回答,基本确定为热点问题,如图


针对这一情况查找相关解决方案做了如下调整
1、 提高热点调度 config set hot-region-schedule-limit 8
2、对访问数据库进程执行set@@ tidb_replica_read = follower,以求减少读压力

然而两者均未起到明显的效果,dashboard显示查询耗时极高

查阅慢查询日志可知,其原因应该主要在于读写压力集中到239节点,导致排队

附上grafana界面tikv-details信息
test-cluster-TiKV-Details_2021-07-01T03_57_43.911Z.json.zip (996.6 KB)

烦请帮忙分析一下下一步的优化方法,较急,感谢

【业务影响】
读写速度较慢,无法满足需求
【TiDB 版本】
4.0.13

1 个赞

有关键sql语句的执行计划么?

业务主要包括以下两类语句
1、select vout from tx_info_2 where txhash = “xx”;,其中txhash为主键索引,计划如下:
id task estRows operator info actRows execution info memory disk
Point_Get_1 root 1 table:tx_info_2, index:PRIMARY(txhash) 1 time:493.3ms, loops:2, Get:{num_rpc:2, total_time:493.2ms} N/A N/A

2、UPDATE tx_info_2 SET vout = ‘YY’ WHERE txhash = ‘XX’;
id task estRows operator info actRows execution info memory disk
Update_2 root 0 N/A 0 time:325.7ms, loops:1, , lock_keys: {time:325ms, region:2, keys:2, lock_rpc:324.941443ms, rpc_count:2, retry_count:1} 2.73 KB N/A
└─Point_Get_1 root 1 table:tx_info_2, index:PRIMARY(txhash), lock 1 time:325.5ms, loops:2, Get:{num_rpc:1, total_time:480.9µs} N/A N/A

文字形式可能看不太清楚,以下附上截图
1、

2、

看起来是有热点?可以参考下 https://docs.pingcap.com/zh/tidb/stable/troubleshoot-hot-spot-issues

这个文档我也看过了,但是我们既不能重新建表,当前表也不小,所以这里的方法并不适用

能否找出热点 region ,如果是数据 region 热点,那么可以选择手动 split 或者 transfer leader 来缓解,如果是某几个 region 热点,那么可以对这些特别热的 region 进行 split,如果是某个 KV 节点热或者上面多个 region 热,可以 transfer leader 试下。

您好,我们的业务覆盖表中所有数据,很难找出几个region,从监控上能看出有某台kv的io为100%,而有几个kv则很低,针对这一情况能做什么优化吗

能否导出 TiKV 最近异常时间段的监控我们分析下,参考 https://metricstool.pingcap.com/

overview-tikv面板如下
test-cluster-Overview_2021-07-01T06_33_17.941Z.json (1.4 MB)

附上store信息

“count”: 6,
“stores”: [
{
“store”: {
“id”: 24478148,
“address”: “10.12.5.236:20160”,
“version”: “4.0.13”,
“status_address”: “10.12.5.236:20180”,
“git_hash”: “a448d617f79ddf545be73931525bb41af0f790f3”,
“start_timestamp”: 1625060753,
“deploy_path”: “/home/tidb/deploy/bin”,
“last_heartbeat”: 1625150052300593698,
“state_name”: “Up”
},
“status”: {
“capacity”: “5.952TiB”,
“available”: “2.929TiB”,
“used_size”: “2.336TiB”,
“leader_count”: 28173,
“leader_weight”: 2,
“leader_score”: 14086.5,
“leader_size”: 2287717,
“region_count”: 81160,
“region_weight”: 2,
“region_score”: 3279733.5,
“region_size”: 6559467,
“sending_snap_count”: 3,
“start_ts”: “2021-06-30T13:45:53Z”,
“last_heartbeat_ts”: “2021-07-01T14:34:12.300593698Z”,
“uptime”: “24h48m19.300593698s”
}
},
{
“store”: {
“id”: 24480822,
“address”: “10.12.5.239:20160”,
“version”: “4.0.13”,
“status_address”: “10.12.5.239:20180”,
“git_hash”: “a448d617f79ddf545be73931525bb41af0f790f3”,
“start_timestamp”: 1624981287,
“deploy_path”: “/home/tidb/deploy/bin”,
“last_heartbeat”: 1625150058184653393,
“state_name”: “Up”
},
“status”: {
“capacity”: “5.952TiB”,
“available”: “3.34TiB”,
“used_size”: “2.456TiB”,
“leader_count”: 28175,
“leader_weight”: 2,
“leader_score”: 14087.5,
“leader_size”: 2119216,
“region_count”: 92540,
“region_weight”: 2,
“region_score”: 3593884,
“region_size”: 7187768,
“sending_snap_count”: 2,
“receiving_snap_count”: 1,
“start_ts”: “2021-06-29T15:41:27Z”,
“last_heartbeat_ts”: “2021-07-01T14:34:18.184653393Z”,
“uptime”: “46h52m51.184653393s”
}
},
{
“store”: {
“id”: 24590972,
“address”: “10.12.5.240:20160”,
“version”: “4.0.13”,
“status_address”: “10.12.5.240:20180”,
“git_hash”: “a448d617f79ddf545be73931525bb41af0f790f3”,
“start_timestamp”: 1624981319,
“deploy_path”: “/home/tidb/deploy/bin”,
“last_heartbeat”: 1625150052511248317,
“state_name”: “Up”
},
“status”: {
“capacity”: “5.952TiB”,
“available”: “3.732TiB”,
“used_size”: “2.002TiB”,
“leader_count”: 28177,
“leader_weight”: 2,
“leader_score”: 14088.5,
“leader_size”: 2257348,
“region_count”: 79045,
“region_weight”: 2,
“region_score”: 3137037,
“region_size”: 6274074,
“start_ts”: “2021-06-29T15:41:59Z”,
“last_heartbeat_ts”: “2021-07-01T14:34:12.511248317Z”,
“uptime”: “46h52m13.511248317s”
}
},
{
“store”: {
“id”: 38833310,
“address”: “10.12.5.147:20160”,
“version”: “4.0.13”,
“status_address”: “10.12.5.147:20180”,
“git_hash”: “a448d617f79ddf545be73931525bb41af0f790f3”,
“start_timestamp”: 1624981158,
“deploy_path”: “/home/tidb/deploy/bin”,
“last_heartbeat”: 1625150055810678172,
“state_name”: “Up”
},
“status”: {
“capacity”: “5.952TiB”,
“available”: “3.538TiB”,
“used_size”: “2.103TiB”,
“leader_count”: 28177,
“leader_weight”: 2,
“leader_score”: 14088.5,
“leader_size”: 2185865,
“region_count”: 76711,
“region_weight”: 2,
“region_score”: 3037499.5,
“region_size”: 6074999,
“sending_snap_count”: 1,
“start_ts”: “2021-06-29T15:39:18Z”,
“last_heartbeat_ts”: “2021-07-01T14:34:15.810678172Z”,
“uptime”: “46h54m57.810678172s”
}
},
{
“store”: {
“id”: 262397455,
“address”: “10.12.5.13:20160”,
“version”: “4.0.13”,
“status_address”: “10.12.5.13:20180”,
“git_hash”: “a448d617f79ddf545be73931525bb41af0f790f3”,
“start_timestamp”: 1625010184,
“deploy_path”: “/home/tidb/deploy/bin”,
“last_heartbeat”: 1625150057591985664,
“state_name”: “Up”
},
“status”: {
“capacity”: “5.952TiB”,
“available”: “4.672TiB”,
“used_size”: “1.222TiB”,
“leader_count”: 14089,
“leader_weight”: 1,
“leader_score”: 14089,
“leader_size”: 1137581,
“region_count”: 46467,
“region_weight”: 1,
“region_score”: 3532790,
“region_size”: 3532790,
“sending_snap_count”: 1,
“receiving_snap_count”: 1,
“start_ts”: “2021-06-29T23:43:04Z”,
“last_heartbeat_ts”: “2021-07-01T14:34:17.591985664Z”,
“uptime”: “38h51m13.591985664s”
}
},
{
“store”: {
“id”: 268391998,
“address”: “10.12.5.119:20160”,
“version”: “4.0.13”,
“status_address”: “10.12.5.119:20180”,
“git_hash”: “a448d617f79ddf545be73931525bb41af0f790f3”,
“start_timestamp”: 1624981378,
“deploy_path”: “/home/tidb/deploy/bin”,
“last_heartbeat”: 1625150059176263312,
“state_name”: “Up”
},
“status”: {
“capacity”: “320TiB”,
“available”: “281.7TiB”,
“used_size”: “1.221TiB”,
“leader_count”: 14091,
“leader_weight”: 1,
“leader_score”: 14091,
“leader_size”: 1028339,
“region_count”: 46785,
“region_weight”: 1,
“region_score”: 3532367,
“region_size”: 3532367,
“start_ts”: “2021-06-29T15:42:58Z”,
“last_heartbeat_ts”: “2021-07-01T14:34:19.176263312Z”,
“uptime”: “46h51m21.176263312s”
}
}
]
}

不好意思上面没表达清楚,需要 tikv-detail 的监控信息,

tikv-details的信息已经在1楼给出

帮忙确认下 IO 非常高的这个 KV 节点上是否有其他进程在读写 KV 的数据目录


看监控里面 TiKV 的读写流量并不高,怀疑是否有其他进程写入频繁导致 IO 居高不下,影响了 TiDB 的使用。

是不是将tikv写日志改成异步写日志会有所提升?

应该没有其他进程

看下 iotop 结果

看着是在 compaction,不过磁盘 IO 打满有些奇怪。这个节点一直 IO 这么高吗?最近有频繁删除或者更新操作吗?

没有一直100% 但从grafana看出持续了一段时间了 drop和create操作有但较少 compaction压缩率
也调整过参数为30mb的速率

可以看下 TiDB 面板的 query summary 部分最近的 QPS 类型都包括哪些

test-cluster-TiDB-Summary_2021-07-01T07_34_24.009Z.json (490.5 KB)