为提高效率,提问时请提供以下信息,问题描述清晰可优先响应。
- 【TiDB 版本】:
- 【问题描述】:
系统信息
+-----------------------+-----------------------+
| Host | Release |
+-----------------------+-----------------------+
| tidb3pub.xxxxx.xx.xxx | 3.10.0-514.el7.x86_64 |
| tidb2pub.xxxxx.xx.xxx | 3.10.0-514.el7.x86_64 |
| tidb1pub.xxxxx.xx.xxx | 3.10.0-514.el7.x86_64 |
+-----------------------+-----------------------+
TiDB 集群信息
+---------------------+--------------+------+----+------+
| TiDB_version | Clu_replicas | TiDB | PD | TiKV |
+---------------------+--------------+------+----+------+
| 5.7.25-TiDB-v3.0.12 | 3 | 3 | 3 | 3 |
+---------------------+--------------+------+----+------+
集群节点信息
+------------+--------------+
| Node_IP | Server_info |
+------------+--------------+
| instance_1 | pd+tikv+tidb |
| instance_2 | tidb+pd+tikv |
| instance_0 | pd+tikv+tidb |
+------------+--------------+
容量 & region 数量
+---------------------+-----------------+--------------+
| Storage_capacity_GB | Storage_uesd_GB | Region_count |
+---------------------+-----------------+--------------+
| 11002.23 | 105.71 | 5736 |
+---------------------+-----------------+--------------+
QPS
+---------+----------------+-----------------+
| Clu_QPS | Duration_99_MS | Duration_999_MS |
+---------+----------------+-----------------+
| 2.60 | 26.56 | 31.46 |
+---------+----------------+-----------------+
热点 region 信息
+---------+----------+-----------+
| Store | Hot_read | Hot_write |
+---------+----------+-----------+
| store-1 | 0 | 6 |
| store-5 | 1 | 6 |
| store-4 | 0 | 6 |
+---------+----------+-----------+
磁盘延迟信息
+---------+------------+-------------+--------------+
| Device | Instance | Read_lat_MS | Write_lat_MS |
+---------+------------+-------------+--------------+
| dm-0 | instance_2 | nan | 0.80 |
| dm-0 | instance_0 | nan | 1.09 |
| dm-0 | instance_1 | nan | 1.33 |
| nvme0n1 | instance_2 | nan | 0.08 |
| nvme0n1 | instance_0 | nan | 0.07 |
| nvme0n1 | instance_1 | 0.24 | 0.07 |
| nvme1n1 | instance_2 | nan | nan |
| nvme1n1 | instance_0 | nan | nan |
| nvme1n1 | instance_1 | nan | nan |
| sr0 | instance_2 | nan | nan |
| sr0 | instance_0 | nan | nan |
| sr0 | instance_1 | nan | nan |
| vda | instance_2 | nan | 0.80 |
| vda | instance_0 | nan | 1.00 |
| vda | instance_1 | nan | 1.25 |
+---------+------------+-------------+--------------+
参考了文档:
9.2.1 ERROR 2013 (HY000): Lost connection to MySQL server during query 问题的排查方法?
- log 中是否有 panic
- dmesg 中是否有 oom,命令:
dmesg -T | grep -i oom
- 长时间没有访问,也会收到这个报错,一般是 tcp 超时导致的,tcp 长时间不用, 会被操作系统 kill。
一条sql引起 oom
sql语句
mysql> use db_sbtest1
Reading table information for completion of table and column names
You can turn off this feature to get a quicker startup with -A
Database changed
mysql> show tables;
+-----------------------------+
| Tables_in_db_sbtest1 |
+-----------------------------+
| tb_sbtest1 |
+-----------------------------+
1 row in set (0.00 sec)
mysql> select count(*) from tb_sbtest1;
+-----------+
| count(*) |
+-----------+
| 233729081 |
+-----------+
1 row in set (11.52 sec)
mysql> use db_sbtest2
Reading table information for completion of table and column names
You can turn off this feature to get a quicker startup with -A
Database changed
mysql> show tables;
+-----------------------+
| Tables_in_db_sbtest2 |
+-----------------------+
| tb_sbtest2 |
+-----------------------+
1 row in set (0.00 sec)
mysql> select count(*) from tb_sbtest2;
+----------+
| count(*) |
+----------+
| 577 |
+----------+
1 row in set (0.04 sec)
mysql> select
-> ph.date1 as pdate,
-> s.cname,
-> ph.pid,
-> sum(ph.sp) as sp,
-> avg(ph.sr) as sr
-> from
-> (SELECT oid as pid,
-> sid,
-> data_desc as date1,
-> sum(sale_amount) sp,
-> avg(sr) sr
-> FROM db_sbtest1.tb_sbtest1
-> where data_desc>='2019-01-01'
-> group by oid,sid,date1) ph
-> left join
-> (select si,
-> case when cname_cn = 'shanghai' then '上海市'
-> when cname_cn = 'suzhou' then '苏州市'
-> when cname_cn = 'wuxi' then '无锡市'
-> when cname_cn = 'hangzhou' then '杭州市'
-> when cname_cn = 'ningbo' then '宁波市'
-> when cname_cn = 'shenzhen' then '深圳市'
-> else '其他' end as cname
-> from db_sbtest2.tb_sbtest2) s
-> on ph.sid = s.si
-> group by ph.date1,s.cname,ph.pid;
ERROR 2013 (HY000): Lost connection to MySQL server during query
oom信息
root@tidb1pub.xxxxx.xx.xxx# dmesg -T |grep -i oom
[Fri Apr 3 16:36:51 2020] grpc-server-0 invoked oom-killer: gfp_mask=0x201da, order=0, oom_score_adj=0
[Fri Apr 3 16:36:51 2020] [<ffffffff8118460e>] oom_kill_process+0x24e/0x3c0
[Fri Apr 3 16:36:51 2020] [<ffffffff811840ad>] ? oom_unkillable_task+0xcd/0x120
[Fri Apr 3 16:36:51 2020] [ pid ] uid tgid total_vm rss nr_ptes swapents oom_score_adj name
[Fri Apr 3 16:36:51 2020] pd-server invoked oom-killer: gfp_mask=0x201da, order=0, oom_score_adj=0
[Fri Apr 3 16:36:51 2020] [<ffffffff8118460e>] oom_kill_process+0x24e/0x3c0
[Fri Apr 3 16:36:51 2020] [<ffffffff811840ad>] ? oom_unkillable_task+0xcd/0x120
[Fri Apr 3 16:36:51 2020] [ pid ] uid tgid total_vm rss nr_ptes swapents oom_score_adj name
[Fri Apr 3 16:36:51 2020] blackbox_export invoked oom-killer: gfp_mask=0x201da, order=0, oom_score_adj=0
[Fri Apr 3 16:36:51 2020] [<ffffffff8118460e>] oom_kill_process+0x24e/0x3c0
[Fri Apr 3 16:36:51 2020] [<ffffffff811840ad>] ? oom_unkillable_task+0xcd/0x120
[Fri Apr 3 16:36:51 2020] [ pid ] uid tgid total_vm rss nr_ptes swapents oom_score_adj name
[Fri Apr 3 16:36:52 2020] pd-server invoked oom-killer: gfp_mask=0x201da, order=0, oom_score_adj=0
[Fri Apr 3 16:36:52 2020] [<ffffffff8118460e>] oom_kill_process+0x24e/0x3c0
[Fri Apr 3 16:36:52 2020] [<ffffffff811840ad>] ? oom_unkillable_task+0xcd/0x120
[Fri Apr 3 16:36:52 2020] [ pid ] uid tgid total_vm rss nr_ptes swapents oom_score_adj name