tikv写入阻塞导致查询超时

【 TiDB 使用环境】生产环境
【 TiDB 版本】6.1.7
【复现路径】偶发,无法通过固定方式复现
【遇到的问题:问题现象及影响】
业务反馈每天都会出现查询超时(配置了 5s),出现的时间不是很固定
异常语句是固定的,是根据主键查表,每次只返回一条数据

SELECT * FROM `table1` WHERE `uid` = ********* LIMIT 1

结合反馈的超时时间,每次都能在固定的 tikv 上找到对应的日志 WARN,别的 tikv 上也检查过日志,没有出现过一次报错

[2024/01/07 16:39:44.045 +08:00] [WARN] [write.rs:634] ["[store 1] async write too slow, write_kv: 0s, write_raft: 1.822788231s, send: 0.000000817s, callback: 0s thread: sync-writer"] [takes=1822]

集群的写入负载没有很高(峰值如图大概 700k),唯一的热点 region 出现在另一个表的主键上(NONCLUSTERED表,和 table1 不是同一张表)
主键的写入是顺序的所以没法打散,会写入到固定的 tikv
这里检查过这个热点 region 的 leader 在出现异常的 tikv 上,可能有关联

【资源配置】

【附件:截图/日志/监控】
异常时段的 tikv 日志

[2024/01/07 16:39:40.862 +08:00] [INFO] [apply.rs:1770] ["exec ConfChange"] [epoch="conf_ver: 24666 version: 12599"] [type=RemoveNode] [peer_id=171471415] [region_id=480019]
[2024/01/07 16:39:40.862 +08:00] [INFO] [apply.rs:1878] ["remove peer successfully"] [region="id: 480019 start_key: 7480000000000004FF175F72E000000309FFB32AD70000000000FA end_key: 7480000000000004FF175F72E8000002FCFFA3F83C0000000000FA region_epoch { conf_ver: 24666 version: 12599 } peers { id: 171471152 store_id: 137 } peers { id: 171471173 store_id: 136 } peers { id: 171471362 store_id: 627568 role: Learner } peers { id: 171471415 store_id: 1 }"] [peer="id: 171471362 store_id: 627568 role: Learner"] [peer_id=171471415] [region_id=480019]
[2024/01/07 16:39:40.862 +08:00] [INFO] [raft.rs:2646] ["switched to configuration"] [config="Configuration { voters: Configuration { incoming: Configuration { voters: {171471152, 171471173, 171471415} }, outgoing: Configuration { voters: {} } }, learners: {}, learners_next: {}, auto_leave: false }"] [raft_id=171471415] [region_id=480019]
[2024/01/07 16:39:40.899 +08:00] [INFO] [peer.rs:1172] ["deleting applied snap file"] [snap_file=480019_2231_45044232] [peer_id=171471415] [region_id=480019]
[2024/01/07 16:39:40.899 +08:00] [INFO] [snap.rs:674] ["set_snapshot_meta total cf files count: 4"]
[2024/01/07 16:39:42.220 +08:00] [INFO] [kv.rs:1117] ["call CheckLeader failed"] [address=ipv4:*.*.*.*:50794] [err=Grpc(RemoteStopped)]
[2024/01/07 16:39:44.045 +08:00] [WARN] [write.rs:634] ["[store 1] async write too slow, write_kv: 0s, write_raft: 1.822788231s, send: 0.000000817s, callback: 0s thread: sync-writer"] [takes=1822]
[2024/01/07 16:39:44.045 +08:00] [WARN] [write.rs:634] ["[store 1] async write too slow, write_kv: 0s, write_raft: 1.822514695s, send: 0.000019696s, callback: 0s thread: sync-writer"] [takes=1822]
[2024/01/07 16:39:44.045 +08:00] [WARN] [write.rs:634] ["[store 1] async write too slow, write_kv: 0s, write_raft: 1.8169129910000001s, send: 0.000025976s, callback: 0s thread: sync-writer"] [takes=1816]
[2024/01/07 16:39:44.045 +08:00] [WARN] [store.rs:999] ["[store 1] handle 1 pending peers include 1 ready, 0 entries, 0 messages and 0 snapshots"] [takes=1822]
[2024/01/07 16:39:44.045 +08:00] [WARN] [store.rs:999] ["[store 1] handle 1 pending peers include 1 ready, 0 entries, 0 messages and 0 snapshots"] [takes=1822]
[2024/01/07 16:39:44.045 +08:00] [WARN] [store.rs:999] ["[store 1] handle 1 pending peers include 1 ready, 0 entries, 0 messages and 0 snapshots"] [takes=1816]
[2024/01/07 16:39:44.572 +08:00] [WARN] [write.rs:634] ["[store 1] async write too slow, write_kv: 0s, write_raft: 2.195803191s, send: 0.000029982s, callback: 0s thread: sync-writer"] [takes=2195]
[2024/01/07 16:39:44.572 +08:00] [WARN] [write.rs:634] ["[store 1] async write too slow, write_kv: 0s, write_raft: 2.194970723s, send: 0.000026482s, callback: 0s thread: sync-writer"] [takes=2194]
[2024/01/07 16:39:44.572 +08:00] [WARN] [write.rs:634] ["[store 1] async write too slow, write_kv: 0s, write_raft: 2.195950804s, send: 0.000026311s, callback: 0s thread: sync-writer"] [takes=2195]
[2024/01/07 16:39:44.572 +08:00] [WARN] [store.rs:999] ["[store 1] handle 110 pending peers include 110 ready, 0 entries, 0 messages and 0 snapshots"] [takes=2197]
[2024/01/07 16:39:44.572 +08:00] [WARN] [store.rs:999] ["[store 1] handle 119 pending peers include 119 ready, 0 entries, 0 messages and 0 snapshots"] [takes=2197]
[2024/01/07 16:39:44.572 +08:00] [WARN] [store.rs:999] ["[store 1] handle 108 pending peers include 108 ready, 0 entries, 0 messages and 0 snapshots"] [takes=2197]
[2024/01/07 16:39:45.356 +08:00] [WARN] [write.rs:634] ["[store 1] async write too slow, write_kv: 0s, write_raft: 1.307429111s, send: 0.000030356s, callback: 0s thread: sync-writer"] [takes=1307]
[2024/01/07 16:39:45.356 +08:00] [WARN] [write.rs:634] ["[store 1] async write too slow, write_kv: 0s, write_raft: 1.307435468s, send: 0.000028111s, callback: 0s thread: sync-writer"] [takes=1307]
[2024/01/07 16:39:45.356 +08:00] [WARN] [store.rs:999] ["[store 1] handle 175 pending peers include 152 ready, 0 entries, 0 messages and 0 snapshots"] [takes=1311]
[2024/01/07 16:39:45.356 +08:00] [WARN] [write.rs:634] ["[store 1] async write too slow, write_kv: 0s, write_raft: 1.3087153790000001s, send: 0.000053301s, callback: 0s thread: sync-writer"] [takes=1308]
[2024/01/07 16:39:45.356 +08:00] [WARN] [store.rs:999] ["[store 1] handle 176 pending peers include 155 ready, 0 entries, 0 messages and 0 snapshots"] [takes=1311]
[2024/01/07 16:39:45.356 +08:00] [WARN] [store.rs:999] ["[store 1] handle 155 pending peers include 155 ready, 0 entries, 0 messages and 0 snapshots"] [takes=1311]
[2024/01/07 16:39:47.387 +08:00] [WARN] [write.rs:634] ["[store 1] async write too slow, write_kv: 0s, write_raft: 2.812094433s, send: 0.000039904s, callback: 0s thread: sync-writer"] [takes=2812]
[2024/01/07 16:39:47.387 +08:00] [WARN] [write.rs:634] ["[store 1] async write too slow, write_kv: 0s, write_raft: 2.810821532s, send: 0.000075818s, callback: 0s thread: sync-writer"] [takes=2810]
[2024/01/07 16:39:47.387 +08:00] [WARN] [write.rs:634] ["[store 1] async write too slow, write_kv: 0s, write_raft: 2.812195442s, send: 0.000088624s, callback: 0s thread: sync-writer"] [takes=2812]
[2024/01/07 16:39:47.387 +08:00] [WARN] [store.rs:999] ["[store 1] handle 184 pending peers include 173 ready, 0 entries, 0 messages and 0 snapshots"] [takes=2814]
[2024/01/07 16:39:47.387 +08:00] [WARN] [store.rs:999] ["[store 1] handle 191 pending peers include 191 ready, 0 entries, 0 messages and 0 snapshots"] [takes=2815]
[2024/01/07 16:39:47.387 +08:00] [WARN] [store.rs:999] ["[store 1] handle 188 pending peers include 188 ready, 0 entries, 0 messages and 0 snapshots"] [takes=2815]
[2024/01/07 16:39:48.038 +08:00] [WARN] [write.rs:634] ["[store 1] async write too slow, write_kv: 0s, write_raft: 2.679254154s, send: 0.000042336s, callback: 0s thread: sync-writer"] [takes=2679]
[2024/01/07 16:39:48.038 +08:00] [WARN] [write.rs:634] ["[store 1] async write too slow, write_kv: 0s, write_raft: 2.676675738s, send: 0.000043835s, callback: 0s thread: sync-writer"] [takes=2676]
[2024/01/07 16:39:48.038 +08:00] [WARN] [write.rs:634] ["[store 1] async write too slow, write_kv: 0s, write_raft: 2.678514685s, send: 0.000043762s, callback: 0s thread: sync-writer"] [takes=2678]
[2024/01/07 16:39:48.038 +08:00] [WARN] [store.rs:999] ["[store 1] handle 118 pending peers include 111 ready, 0 entries, 0 messages and 0 snapshots"] [takes=2681]
[2024/01/07 16:39:48.038 +08:00] [WARN] [store.rs:999] ["[store 1] handle 203 pending peers include 200 ready, 0 entries, 0 messages and 0 snapshots"] [takes=2681]
[2024/01/07 16:39:48.038 +08:00] [WARN] [store.rs:999] ["[store 1] handle 214 pending peers include 214 ready, 0 entries, 0 messages and 0 snapshots"] [takes=2681]
[2024/01/07 16:39:50.065 +08:00] [WARN] [write.rs:634] ["[store 1] async write too slow, write_kv: 0s, write_raft: 2.67392705s, send: 0.000094879s, callback: 0s thread: sync-writer"] [takes=2673]
[2024/01/07 16:39:50.065 +08:00] [WARN] [write.rs:634] ["[store 1] async write too slow, write_kv: 0s, write_raft: 2.672336576s, send: 0.000149155s, callback: 0s thread: sync-writer"] [takes=2672]
[2024/01/07 16:39:50.065 +08:00] [WARN] [write.rs:634] ["[store 1] async write too slow, write_kv: 0s, write_raft: 2.671493778s, send: 0.000174822s, callback: 0s thread: sync-writer"] [takes=2671]
[2024/01/07 16:39:50.065 +08:00] [WARN] [store.rs:999] ["[store 1] handle 219 pending peers include 219 ready, 0 entries, 0 messages and 0 snapshots"] [takes=2678]
[2024/01/07 16:39:50.065 +08:00] [WARN] [store.rs:999] ["[store 1] handle 228 pending peers include 228 ready, 0 entries, 0 messages and 0 snapshots"] [takes=2678]
[2024/01/07 16:39:50.065 +08:00] [WARN] [store.rs:999] ["[store 1] handle 224 pending peers include 223 ready, 0 entries, 0 messages and 0 snapshots"] [takes=2678]
[2024/01/07 16:39:50.105 +08:00] [INFO] [kv.rs:1117] ["call CheckLeader failed"] [address=ipv4:*.*.*.*:60840] [err=Grpc(RemoteStopped)]
[2024/01/07 16:39:50.625 +08:00] [WARN] [write.rs:634] ["[store 1] async write too slow, write_kv: 0s, write_raft: 2.58315059s, send: 0.000085109s, callback: 0s thread: sync-writer"] [takes=2583]
[2024/01/07 16:39:50.625 +08:00] [WARN] [write.rs:634] ["[store 1] async write too slow, write_kv: 0s, write_raft: 2.581864045s, send: 0.000096699s, callback: 0s thread: sync-writer"] [takes=2581]
[2024/01/07 16:39:50.625 +08:00] [WARN] [write.rs:634] ["[store 1] async write too slow, write_kv: 0s, write_raft: 2.581480991s, send: 0.000088852s, callback: 0s thread: sync-writer"] [takes=2581]
[2024/01/07 16:39:50.625 +08:00] [WARN] [store.rs:999] ["[store 1] handle 218 pending peers include 217 ready, 0 entries, 0 messages and 0 snapshots"] [takes=2587]
[2024/01/07 16:39:50.625 +08:00] [WARN] [store.rs:999] ["[store 1] handle 232 pending peers include 232 ready, 0 entries, 0 messages and 0 snapshots"] [takes=2587]
[2024/01/07 16:39:50.625 +08:00] [WARN] [store.rs:999] ["[store 1] handle 232 pending peers include 231 ready, 0 entries, 0 messages and 0 snapshots"] [takes=2587]
[2024/01/07 16:39:54.500 +08:00] [INFO] [apply.rs:1396] ["execute admin command"] [command="cmd_type: ChangePeer change_peer { change_type: AddLearnerNode peer { id: 171471418 store_id: 137 role: Learner } }"] [index=44863112] [term=2085] [peer_id=171471331] [region_id=482089]
[2024/01/07 16:39:54.500 +08:00] [INFO] [apply.rs:1770] ["exec ConfChange"] [epoch="conf_ver: 24433 version: 12601"] [type=AddLearner] [peer_id=171471331] [region_id=482089]
[2024/01/07 16:39:54.500 +08:00] [INFO] [apply.rs:1910] ["add learner successfully"] [region="id: 482089 start_key: 7480000000000004FF175F72E80000030AFF21C2CF0000000000FA end_key: 7480000000000004FF175F72F0000002FCFFC7439C0000000000FA region_epoch { conf_ver: 24433 version: 12601 } peers { id: 171471275 store_id: 627568 } peers { id: 171471331 store_id: 1 } peers { id: 171471360 store_id: 136 }"] [peer="id: 171471418 store_id: 137 role: Learner"] [peer_id=171471331] [region_id=482089]
[2024/01/07 16:39:54.500 +08:00] [INFO] [raft.rs:2646] ["switched to configuration"] [config="Configuration { voters: Configuration { incoming: Configuration { voters: {171471331, 171471360, 171471275} }, outgoing: Configuration { voters: {} } }, learners: {171471418}, learners_next: {}, auto_leave: false }"] [raft_id=171471331] [region_id=482089]

偶发的集群查询超时

可以看到都是 store1 报的耗时异常

store1所在 的tikv是不是硬件不行啊

store1 和别的 tikv 配置完全一致的,我们还替换过节点,结果换完以后报错跑到别的 tikv 上了 :joy:

业务超时,SQL超时,可以看慢查询分析
store 1这个写入慢可能确实有问题,可以看下TiKV=》Storage和TiKV=》Raft IO,再看下磁盘监控情况
热力图上这个明亮的线并不是热点,这才565 kb/min,怎么算热点

write_raft: 1.822788231s 问题在这里

磁盘的 io 不是很高的,热力图我截的是唯一的明亮的线了 :joy:
看上去的确是 raft propose 的问题,耗时和日志报错翻了下完全对的上

估计是某一个慢sql 引起的

其实从机器监控来看 唯一堵住的地方就是write_raft,不过几乎就出现在某个节点,这个节点替换后又专业到了另一个节点,怀疑是某个region更新频繁导致,从热点和机器分析都不是很明显,从慢查询角度分析的话,也没有找到可疑的慢sql

发一发 tikv-detail 里面,thread-cpu 的监控,grpc 监控面板看看呢。

主键的表也可以打散, ALTER TABLE t MODIFY COLUMN id BIGINT AUTO_RANDOM(5);AUTO_INCREMENT 切换成 AUTO_RANDOM

异常的节点是 tikv02

异常时段可以参考下面 tikv02 耗时的监控