tikv IO utilization 满载，如何定位tikv在干什么

TiDB_C罗 · 2023 年11 月 23 日 07:42

【 TiDB 使用环境】生产环境 /测试/ Poc
【 TiDB 版本】
【复现路径】做过哪些操作出现的问题
【遇到的问题：问题现象及影响】
【资源配置】
IO utilization 接近100%

tidb

slowlog

gc

grpc

rocksdb compact

调度任务

TiDB_C罗 · 2023 年11 月 23 日 07:43

我关闭了gc

 set global tidb_gc_enable=off;

依然没有下降的趋势

像风一样的男子 · 2023 年11 月 23 日 07:45

数据库整体延迟大吗？慢查询多吗？

TiDB_C罗 · 2023 年11 月 23 日 07:46

duration和slowlog监控和tikv io不同步

zhanggame1 · 2023 年11 月 23 日 07:47

io实际速度看看

像风一样的男子 · 2023 年11 月 23 日 07:51

dashboard中有个topsql能看到各个节点的sql资源使用情况

wzf0072 · 2023 年11 月 23 日 07:51

dashboard看下慢查询，看看系统在执行哪些SQL，看看是不是有大表在analyze table

TiDB_C罗 · 2023 年11 月 23 日 07:52

排除analyze，我加了begin,end执行区间

TiDB_C罗 · 2023 年11 月 23 日 07:55

perf top的结果

小龙虾爱大龙虾 · 2023 年11 月 23 日 09:04

根据我个人经验，一般都不看那个了，ssd那个经常是满的，更多应该关注吞吐量、IOPS、读写相应时间等指标。
想看tikv是什么IO在写入，可以看下tikv-detail-》IO breakdown。

h5n1 · 2023 年11 月 23 日 09:21

看着qps 有增加，但是很奇怪延迟反而下降了

裤衩儿飞上天 · 2023 年11 月 23 日 09:26

这个节点有跑其他服务吗？

路在何chu · 2023 年11 月 23 日 09:27

看看磁盘延时，有时候io util不准

buptzhoutian · 2023 年11 月 23 日 09:52

这个即使长期 100% 也不能直接说明磁盘"满载"

有一个专门展示磁盘指标的 Dashboard 叫 Disk-Performance, 这个 Dashboard 里也有一个图是 Disk IO Utilization, 官方还给了一个说明

Shows disk Utilization as percent of the time when there was at least one IO request in flight. It is designed to match utilization available in iostat tool. It is not very good measure of true IO Capacity Utilization. Consider looking at IO latency and Disk Load Graphs instead.

这个图使用的指标来自 node_exporter 的 node_disk_io_time_seconds_total, tidb 用的表达式是

rate(node_disk_io_time_seconds_total[$interval])

这个结果对应于 iostat 工具的 %util 列, 手册里是这么说的

man iostat

%util
    Percentage of elapsed time during which I/O requests were
    issued  to  the device (bandwidth utilization for the de‐
    vice). Device saturation occurs when this value is  close
    to  100%  for devices serving requests serially.  But for
    devices serving requests in parallel, such as RAID arrays
    and  modern SSDs, this number does not reflect their per‐
    formance limits.

路在何chu · 2023 年11 月 23 日 10:34

我们这边用的aws的磁盘，io util基本都是满的

小龙虾爱大龙虾 · 2023 年11 月 23 日 10:45

也正常，QPS高了后999线已经不足以显示哪些比较慢的SQL的执行情况，但那些执行慢的SQL可能并未消失

dba远航 · 2023 年11 月 25 日 00:06

查看一下IO排名前几位都在做什么

zhanggame1 · 2023 年11 月 25 日 15:19

这个指标不是很准，具体要看是不是IO瓶颈

随缘天空 · 2023 年11 月 26 日 03:09

尝试修改raftstore.sync-log =false参数，并观察这是否对IO情况产生影响。

Kongdom · 2023 年11 月 26 日 13:33

1、dashboard看一下慢查询
2、show full processlist看一下当前进程
3、就是我们机械硬盘经常遇到的，本身就是IO读写低，经常打满。