关于TiKV async request write duration seconds more than 1s告警应该如何查看具体的SQL导致KV压力过大

lemontree8801 · 2019 年9 月 24 日 07:28

为提高效率，提问时请尽量提供详细背景信息，问题描述清晰可优先响应。以下信息点请尽量提供：

系统版本 & kernel 版本：3.10.0-862.el7.x86_64 #1 SMP Fri Apr 20 16:44:24 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux
TiDB 版本：3.0.3-ga
磁盘型号：达标
集群节点分布：混合部署-TiDB 和 PD 部署在同一台机器，TiKV 单独部署。（18+3+40）
数据量 & region 数量 & 副本数：18.6T&2081548 &3
集群 QPS、.999-Duration、读写比例：16.2K &119ms~8s不等
问题描述（我做了什么）：现在告警TiKV async request write duration seconds more than 1s 这个值越来越大。按wikihttps://pingcap.com/docs-cn/v3.0/reference/alert-rules/#tikv_channel_full_total 讲解。是TiKV压力过大导致，能否定位到具体的SQL导致KV压力过大？到具体的告警的KV节点查看kv日志搜寻slow_query关键字，定位相关时间的table_id 查看所有tidb的slow_log 感觉并不符合预期。只有1 .2条的简单查询。麻烦问一下有没有更详细的关于这个告警的处理方法？

龙雪刚-PingCAP · 2019 年9 月 24 日 08:29

不一定是具体的某一类 SQL 导致。参照 TiKV_channel_full_total 说明，从如下几点排查：

观察 Raft Propose 监控，看这个报警的 TiKV 节点是否明显有比其他 TiKV 高很多。如果是，表明这个 TiKV 上有热点，需要检查热点调度是否能正常工作。
观察 Raft IO 监控，看延迟是否升高。如果延迟很高，表明磁盘可能有瓶颈。一个能缓解但不怎么安全的办法是将 sync-log 改成 false 。
观察 Raft Process 监控，看 tick duration 是否很高。如果是，需要在 [raftstore] 配置下加上 raft-base-tick-interval = “2s” 。

lemontree8801 · 2019 年9 月 24 日 09:19

1.第一个热点应该从Propose的哪个dashboard来判断呢。。 3.这个dashboard里就木有tick duration这个关键字的内容。。。这3个方法能进一步说详细一些么。。

龙雪刚-PingCAP · 2019 年9 月 24 日 09:46

1）着重观察一下read proposals/write proposals

Raft proposals per ready：在一个 mio tick 内，所有 Region proposal 的个数
Raft read/write proposals：不同类型的 proposal 的个数
Raft read proposals per server：每个 TiKV 实例发起读 proposal 的个数
Raft write proposals per server：每个 TiKV 实例发起写 proposal 的个数
Propose wait duration：每个 proposal 的等待时间
Propose wait duration per server：每个 TiKV 实例上每个 proposal 的等待时间
Raft log speed：peer propose 日志的速度

2）监控项全名：Process tick duration per server

lemontree8801 · 2019 年9 月 25 日 01:02

Process tick duration per server 这个监控项在3.0.3的版本是否已经改名为Process ready duration per server?

lemontree8801 · 2019 年9 月 25 日 01:26

另外，查看Kv 节点的日志上面有不少的[“txn conflicts”] [err=“Txn(Mvcc(TxnLockNotFound { start_ts: 411389510136366235, commit_ts: 411389510935905339, key: [116, 128, 0, 0, 0, 0, 1, 103, 29, 95, 105, 128, 0, 0, 0, 0, 0, 0, 7, 4, 25, 164, 49, 64, 4, 6, 83, 240, 3, 128, 0, 0, 23, 74, 106, 44, 138] }))”] 事务冲突的报错。这有方法可以定位到表或者执行的SQL么？

不懂就问 · 2019 年9 月 25 日 05:13

看下 table_id = 91933 这个表:curl http://{TiDBIP}:10080/schema?table_id= 91933，根据给出的信息解析到是这个表的问题，另外事务冲突的时候会有 conn_id，根据这个 ID 在 tidb.log 里面搜下 SQL 语句，同时也可以看下慢日志，可能也会有记录。

lemontree8801 · 2019 年9 月 25 日 10:43

这个table_id是怎么获得的。。。

不懂就问 · 2019 年9 月 25 日 11:09

反解 key 拿到的。

lemontree8801 · 2019 年9 月 25 日 11:09

这个工具或者方法能提供一下吗

不懂就问 · 2019 年9 月 25 日 11:11

https://github.com/disksing/mok