求助:一条SQL 引起 OOM

为提高效率,提问时请提供以下信息,问题描述清晰可优先响应。

  • 【TiDB 版本】:
  • 【问题描述】:
系统信息
+-----------------------+-----------------------+
|          Host         |        Release        |
+-----------------------+-----------------------+
| tidb3pub.xxxxx.xx.xxx | 3.10.0-514.el7.x86_64 |
| tidb2pub.xxxxx.xx.xxx | 3.10.0-514.el7.x86_64 |
| tidb1pub.xxxxx.xx.xxx | 3.10.0-514.el7.x86_64 |
+-----------------------+-----------------------+
TiDB 集群信息
+---------------------+--------------+------+----+------+
|     TiDB_version    | Clu_replicas | TiDB | PD | TiKV |
+---------------------+--------------+------+----+------+
| 5.7.25-TiDB-v3.0.12 |      3       |  3   | 3  |  3   |
+---------------------+--------------+------+----+------+
集群节点信息
+------------+--------------+
|  Node_IP   | Server_info  |
+------------+--------------+
| instance_1 | pd+tikv+tidb |
| instance_2 | tidb+pd+tikv |
| instance_0 | pd+tikv+tidb |
+------------+--------------+
容量 & region 数量
+---------------------+-----------------+--------------+
| Storage_capacity_GB | Storage_uesd_GB | Region_count |
+---------------------+-----------------+--------------+
|       11002.23      |      105.71     |     5736     |
+---------------------+-----------------+--------------+
QPS
+---------+----------------+-----------------+
| Clu_QPS | Duration_99_MS | Duration_999_MS |
+---------+----------------+-----------------+
|   2.60  |     26.56      |      31.46      |
+---------+----------------+-----------------+
热点 region 信息
+---------+----------+-----------+
|  Store  | Hot_read | Hot_write |
+---------+----------+-----------+
| store-1 |    0     |     6     |
| store-5 |    1     |     6     |
| store-4 |    0     |     6     |
+---------+----------+-----------+
磁盘延迟信息
+---------+------------+-------------+--------------+
|  Device |  Instance  | Read_lat_MS | Write_lat_MS |
+---------+------------+-------------+--------------+
|   dm-0  | instance_2 |     nan     |     0.80     |
|   dm-0  | instance_0 |     nan     |     1.09     |
|   dm-0  | instance_1 |     nan     |     1.33     |
| nvme0n1 | instance_2 |     nan     |     0.08     |
| nvme0n1 | instance_0 |     nan     |     0.07     |
| nvme0n1 | instance_1 |     0.24    |     0.07     |
| nvme1n1 | instance_2 |     nan     |     nan      |
| nvme1n1 | instance_0 |     nan     |     nan      |
| nvme1n1 | instance_1 |     nan     |     nan      |
|   sr0   | instance_2 |     nan     |     nan      |
|   sr0   | instance_0 |     nan     |     nan      |
|   sr0   | instance_1 |     nan     |     nan      |
|   vda   | instance_2 |     nan     |     0.80     |
|   vda   | instance_0 |     nan     |     1.00     |
|   vda   | instance_1 |     nan     |     1.25     |
+---------+------------+-------------+--------------+

参考了文档: 9.2.1 ERROR 2013 (HY000): Lost connection to MySQL server during query 问题的排查方法?

  • log 中是否有 panic
  • dmesg 中是否有 oom,命令: dmesg -T | grep -i oom
  • 长时间没有访问,也会收到这个报错,一般是 tcp 超时导致的,tcp 长时间不用, 会被操作系统 kill。

一条sql引起 oom

sql语句

mysql> use db_sbtest1
Reading table information for completion of table and column names
You can turn off this feature to get a quicker startup with -A

Database changed
mysql> show tables;
+-----------------------------+
| Tables_in_db_sbtest1 |
+-----------------------------+
| tb_sbtest1      |
+-----------------------------+
1 row in set (0.00 sec)

mysql> select count(*) from tb_sbtest1;
+-----------+
| count(*)  |
+-----------+
| 233729081 |
+-----------+
1 row in set (11.52 sec)

mysql> use db_sbtest2
Reading table information for completion of table and column names
You can turn off this feature to get a quicker startup with -A

Database changed
mysql> show tables;
+-----------------------+
| Tables_in_db_sbtest2 |
+-----------------------+
| tb_sbtest2         |
+-----------------------+
1 row in set (0.00 sec)

mysql> select count(*) from tb_sbtest2;
+----------+
| count(*) |
+----------+
|      577 |
+----------+
1 row in set (0.04 sec)

mysql> select
    -> ph.date1 as pdate,
    -> s.cname,
    -> ph.pid,
    -> sum(ph.sp) as sp,
    -> avg(ph.sr) as sr
    -> from
    -> (SELECT oid as pid,
    ->           sid,
    ->           data_desc as date1,
    ->           sum(sale_amount) sp,
    ->           avg(sr) sr
    ->    FROM db_sbtest1.tb_sbtest1
    ->    where data_desc>='2019-01-01'
    ->    group by oid,sid,date1) ph
    -> left join
    -> (select si,
    ->         case when cname_cn = 'shanghai' then '上海市'
    ->              when cname_cn = 'suzhou' then '苏州市'
    ->              when cname_cn = 'wuxi' then '无锡市'
    ->              when cname_cn = 'hangzhou' then '杭州市'
    ->              when cname_cn = 'ningbo' then '宁波市'
    ->              when cname_cn = 'shenzhen' then '深圳市'
    ->              else '其他' end as cname
    -> from db_sbtest2.tb_sbtest2) s
    -> on ph.sid = s.si
    -> group by ph.date1,s.cname,ph.pid;
ERROR 2013 (HY000): Lost connection to MySQL server during query

oom信息

root@tidb1pub.xxxxx.xx.xxx# dmesg -T |grep -i oom
[Fri Apr  3 16:36:51 2020] grpc-server-0 invoked oom-killer: gfp_mask=0x201da, order=0, oom_score_adj=0
[Fri Apr  3 16:36:51 2020]  [<ffffffff8118460e>] oom_kill_process+0x24e/0x3c0
[Fri Apr  3 16:36:51 2020]  [<ffffffff811840ad>] ? oom_unkillable_task+0xcd/0x120
[Fri Apr  3 16:36:51 2020] [ pid ]   uid  tgid total_vm      rss nr_ptes swapents oom_score_adj name
[Fri Apr  3 16:36:51 2020] pd-server invoked oom-killer: gfp_mask=0x201da, order=0, oom_score_adj=0
[Fri Apr  3 16:36:51 2020]  [<ffffffff8118460e>] oom_kill_process+0x24e/0x3c0
[Fri Apr  3 16:36:51 2020]  [<ffffffff811840ad>] ? oom_unkillable_task+0xcd/0x120
[Fri Apr  3 16:36:51 2020] [ pid ]   uid  tgid total_vm      rss nr_ptes swapents oom_score_adj name
[Fri Apr  3 16:36:51 2020] blackbox_export invoked oom-killer: gfp_mask=0x201da, order=0, oom_score_adj=0
[Fri Apr  3 16:36:51 2020]  [<ffffffff8118460e>] oom_kill_process+0x24e/0x3c0
[Fri Apr  3 16:36:51 2020]  [<ffffffff811840ad>] ? oom_unkillable_task+0xcd/0x120
[Fri Apr  3 16:36:51 2020] [ pid ]   uid  tgid total_vm      rss nr_ptes swapents oom_score_adj name
[Fri Apr  3 16:36:52 2020] pd-server invoked oom-killer: gfp_mask=0x201da, order=0, oom_score_adj=0
[Fri Apr  3 16:36:52 2020]  [<ffffffff8118460e>] oom_kill_process+0x24e/0x3c0
[Fri Apr  3 16:36:52 2020]  [<ffffffff811840ad>] ? oom_unkillable_task+0xcd/0x120
[Fri Apr  3 16:36:52 2020] [ pid ]   uid  tgid total_vm      rss nr_ptes swapents oom_score_adj name

您好,

  1. 可以调整 tidb oom-action

  2. 设置 mem-quota-query 参数,调整 oom-action 触发的阈值

  3. 另外处理大表的操作,尽量先评估好。按需扩容 TiKV。

  4. 是否可优化下 SQL 语句?left join 左侧最小原则也适用于 TiDB 哦。