集群响应很慢一直报警PD_tidb_handle_requests_duration

【TiDB 使用环境】生产环境
【TiDB 版本】v7.1.5
【集群数据量】

Cluster version:    v7.1.5
Deploy user:        tidb
SSH type:           builtin
Dashboard URL:      http://192.168.241.59:2379/dashboard
Grafana URL:        http://192.168.241.72:3000
ID                    Role        Host            Ports                            OS/Arch       Status  Data Dir                Deploy Dir
--                    ----        ----            -----                            -------       ------  --------                ----------
192.168.241.71:8300   cdc         192.168.241.71  8300                             linux/x86_64  Up      /disk2/cdc-8300         /home/tidb/deploy/cdc-8300
192.168.241.72:8300   cdc         192.168.241.72  8300                             linux/x86_64  Up      /disk2/cdc-8300         /home/tidb/deploy/cdc-8300
192.168.241.72:3000   grafana     192.168.241.72  3000                             linux/x86_64  Up      -                       /home/tidb/deploy/grafana-3000
192.168.241.59:2379   pd          192.168.241.59  2379/2380                        linux/x86_64  Up|UI   /disk2/pd-2379          /home/tidb/deploy/pd-2379
192.168.241.60:2379   pd          192.168.241.60  2379/2380                        linux/x86_64  Up      /disk2/pd-2379          /home/tidb/deploy/pd-2379
192.168.241.61:2379   pd          192.168.241.61  2379/2380                        linux/x86_64  Up|L    /disk2/pd-2379          /home/tidb/deploy/pd-2379
192.168.241.71:9090   prometheus  192.168.241.71  9090/12020                       linux/x86_64  Up      /disk2/prometheus-9090  /home/tidb/deploy/prometheus-9090
192.168.241.59:4000   tidb        192.168.241.59  4000/10080                       linux/x86_64  Up      -                       /home/tidb/deploy/tidb-4000
192.168.241.60:4000   tidb        192.168.241.60  4000/10080                       linux/x86_64  Up      -                       /home/tidb/deploy/tidb-4000
192.168.241.61:4000   tidb        192.168.241.61  4000/10080                       linux/x86_64  Up      -                       /home/tidb/deploy/tidb-4000
192.168.241.81:4000   tidb        192.168.241.81  4000/10080                       linux/x86_64  Up      -                       /home/tidb/deploy/tidb-4000
192.168.241.85:4000   tidb        192.168.241.85  4000/10080                       linux/x86_64  Up      -                       /home/tidb/deploy/tidb-4000
192.168.241.71:9000   tiflash     192.168.241.71  9000/8123/3930/20170/20292/8234  linux/x86_64  Up      /disk2/tiflash-9000     /home/tidb/deploy/tiflash-9000
192.168.241.71:20160  tikv        192.168.241.71  20160/20180                      linux/x86_64  Up      /disk2/tikv-20160       /home/tidb/deploy/tikv-20160
192.168.241.72:20160  tikv        192.168.241.72  20160/20180                      linux/x86_64  Up      /disk2/tikv-20160       /home/tidb/deploy/tikv-20160
192.168.241.73:20160  tikv        192.168.241.73  20160/20180                      linux/x86_64  Up      /disk2/tikv-20160       /home/tidb/deploy/tikv-20160
192.168.241.74:20160  tikv        192.168.241.74  20160/20180                      linux/x86_64  Up      /disk2/tikv-20160       /home/tidb/deploy/tikv-20160
192.168.241.75:20160  tikv        192.168.241.75  20160/20180                      linux/x86_64  Up      /disk2/tikv-20160       /home/tidb/deploy/tikv-20160
192.168.241.76:20160  tikv        192.168.241.76  20160/20180                      linux/x86_64  Up      /disk2/tikv-20160       /home/tidb/deploy/tikv-20160
192.168.241.83:20160  tikv        192.168.241.83  20160/20180                      linux/x86_64  Up      /disk2/tikv-20160       /home/tidb/deploy/tikv-20160
Total nodes: 20

【遇到的问题:问题现象及影响】
昨天下午16点左右,突然收到很多告警【TiDB_query_duration/PD_tidb_handle_requests_duration主要是这俩报警】


看TIDB日志85节点:一直在刷get timestamp too slow

$ tailf log/tidb.log  |grep -v INFO
[2025/04/14 12:45:05.789 +08:00] [WARN] [pd.go:156] ["get timestamp too slow"] ["cost time"=209.251121ms]
[2025/04/14 12:45:05.789 +08:00] [WARN] [pd.go:156] ["get timestamp too slow"] ["cost time"=38.943349ms]
[2025/04/14 12:45:06.217 +08:00] [WARN] [pd.go:156] ["get timestamp too slow"] ["cost time"=136.5953ms]
[2025/04/14 12:45:06.217 +08:00] [WARN] [pd.go:156] ["get timestamp too slow"] ["cost time"=136.692471ms]
[2025/04/14 12:45:06.217 +08:00] [WARN] [pd.go:156] ["get timestamp too slow"] ["cost time"=132.049231ms]
[2025/04/14 12:45:06.427 +08:00] [WARN] [pd.go:156] ["get timestamp too slow"] ["cost time"=81.115165ms]
[2025/04/14 12:45:06.427 +08:00] [WARN] [pd.go:156] ["get timestamp too slow"] ["cost time"=81.152082ms]
[2025/04/14 12:45:06.427 +08:00] [WARN] [pd.go:156] ["get timestamp too slow"] ["cost time"=81.40483ms]
[2025/04/14 12:45:06.955 +08:00] [WARN] [pd.go:156] ["get timestamp too slow"] ["cost time"=137.884456ms]
[2025/04/14 12:45:07.164 +08:00] [WARN] [pd.go:156] ["get timestamp too slow"] ["cost time"=194.533174ms]
[2025/04/14 12:45:07.375 +08:00] [WARN] [pd.go:156] ["get timestamp too slow"] ["cost time"=187.693764ms]
[2025/04/14 12:45:07.752 +08:00] [WARN] [pd.go:156] ["get timestamp too slow"] ["cost time"=151.272534ms]
[2025/04/14 12:45:07.752 +08:00] [WARN] [pd.go:156] ["get timestamp too slow"] ["cost time"=126.892345ms]
[2025/04/14 12:45:07.752 +08:00] [WARN] [pd.go:156] ["get timestamp too slow"] ["cost time"=155.55664ms]
[2025/04/14 12:45:07.752 +08:00] [WARN] [pd.go:156] ["get timestamp too slow"] ["cost time"=155.644418ms]
[2025/04/14 12:45:07.752 +08:00] [WARN] [pd.go:156] ["get timestamp too slow"] ["cost time"=155.697309ms]
[2025/04/14 12:45:07.752 +08:00] [WARN] [pd.go:156] ["get timestamp too slow"] ["cost time"=155.757283ms]
[2025/04/14 12:45:07.956 +08:00] [WARN] [pd.go:156] ["get timestamp too slow"] ["cost time"=193.618624ms]
[2025/04/14 12:45:07.956 +08:00] [WARN] [pd.go:156] ["get timestamp too slow"] ["cost time"=197.917383ms]
[2025/04/14 12:45:07.956 +08:00] [WARN] [pd.go:156] ["get timestamp too slow"] ["cost time"=89.226347ms]
[2025/04/14 12:45:07.956 +08:00] [WARN] [pd.go:156] ["get timestamp too slow"] ["cost time"=89.737459ms]
[2025/04/14 12:45:07.956 +08:00] [WARN] [pd.go:156] ["get timestamp too slow"] ["cost time"=111.226659ms]
[2025/04/14 12:45:08.164 +08:00] [WARN] [pd.go:156] ["get timestamp too slow"] ["cost time"=199.095257ms]
[2025/04/14 12:45:08.369 +08:00] [WARN] [pd.go:156] ["get timestamp too slow"] ["cost time"=172.168247ms]
[2025/04/14 12:45:08.369 +08:00] [WARN] [pd.go:156] ["get timestamp too slow"] ["cost time"=169.918419ms]

PD master日志:一直有存储空间不够的日志,但我这集群经常会超80%之前也没啥问题,不知为啥这次,出现了异常

$ tailf log/pd.log  |grep -v INFO
[2025/04/14 12:45:47.746 +08:00] [WARN] [cluster.go:893] ["store does not have enough disk space"] [store-id=2] [capacity=3937850605568] [available=781844189184]
[2025/04/14 12:45:47.990 +08:00] [WARN] [cluster.go:893] ["store does not have enough disk space"] [store-id=3] [capacity=3937850605568] [available=781035110400]
[2025/04/14 12:45:55.193 +08:00] [WARN] [cluster.go:893] ["store does not have enough disk space"] [store-id=2949399] [capacity=3937850605568] [available=787348148224]
[2025/04/14 12:45:57.746 +08:00] [WARN] [cluster.go:893] ["store does not have enough disk space"] [store-id=2] [capacity=3937850605568] [available=781844279296]
[2025/04/14 12:45:57.992 +08:00] [WARN] [cluster.go:893] ["store does not have enough disk space"] [store-id=3] [capacity=3937850605568] [available=780242685952]

【其他附件:截图/日志/监控】

是SQL响应慢?还是你在后台维护数据库响应慢?

TiDB_query_duration ,看下我发的截图

有明显执行慢的SQL吗?执行持续时间很长的。

慢SQL 一直都有,并且很多,但从未 像这次这样
主要是 PD_tidb_handle_requests_duration 这个告警 ,变动幅度太大不太懂 怎么排查问题

一般来说,几乎所有的数据库都是SQL慢就导致数据库出现各种问题。可能是响应慢,也可能会造成锁或者连接数增加,以至于数据库最后不健康或者不可用show processlist 可以看到执行长的SQL。在slowlog打开的情况下,看slowlog的日志可以看到SQL。


这是官方文档里对这问题的处理方法:
1,服务器负载 目前3个PD节点 并无明显的负载上涨
3,手动切换到另外2个pd leader 没有效果
第2点这个怎么使用的?

下面这个参考下,看有无作用:
读写延迟增加 | TiDB 文档中心

1 个赞

参考看一下这个哈 专栏 - 关于 PD etcd 空间使用满处理记录 | TiDB 社区 别是类似问题

1 个赞