tidb-5.7.25-TiDB-v4.0.10 查询集群慢日志导致tidb-server内存激增

为提高效率,请提供以下信息,问题描述清晰能够更快得到解决:
【 TiDB 使用环境】
生产

【背景】 做过哪些操作
查询SQL:
SELECT digest,COUNT(1) AS cnt,MIN(QUERY),AVG(query_time) AS avg_query_time,AVG(total_keys) AS avg_keys,AVG(process_time) AS avg_process_time
FROM information_schema.cluster_slow_query WHERE TIME>‘2021-08-01 14:00:00’ AND TIME< ‘2021-08-18 16:00:00’ AND query_time>0.5 GROUP BY digest HAVING CNT>10
执行计划:

【现象】 业务和数据库现象
tidb-server内存激增然后很多应用连接断开异常

【问题】 当前遇到的问题
连接失败
【业务影响】
连接失败
【TiDB 版本】
5.7.25-TiDB-v4.0.10
【应用软件及版本】
xx
【附件】 相关日志及配置信息

  • TiUP Cluster Display 信息
  • TiUP CLuster Edit config 信息

监控(https://metricstool.pingcap.com/)

  • TiDB-Overview Grafana监控
  • TiDB Grafana 监控
  • TiKV Grafana 监控
  • PD Grafana 监控
  • 对应模块日志(包含问题前后 1 小时日志)

若提问为性能优化、故障排查类问题,请下载脚本运行。终端输出的打印结果,请务必全选并复制粘贴上传。

看下你的这个表结构市什么样

CREATE TABLE CLUSTER_SLOW_QUERY (
INSTANCE varchar(64) DEFAULT NULL,
Time timestamp(6) NULL DEFAULT NULL,
Txn_start_ts bigint(20) unsigned DEFAULT NULL,
User varchar(64) DEFAULT NULL,
Host varchar(64) DEFAULT NULL,
Conn_ID bigint(20) unsigned DEFAULT NULL,
Exec_retry_count bigint(20) unsigned DEFAULT NULL,
Exec_retry_time double DEFAULT NULL,
Query_time double DEFAULT NULL,
Parse_time double DEFAULT NULL,
Compile_time double DEFAULT NULL,
Rewrite_time double DEFAULT NULL,
Preproc_subqueries bigint(20) unsigned DEFAULT NULL,
Preproc_subqueries_time double DEFAULT NULL,
Prewrite_time double DEFAULT NULL,
Wait_prewrite_binlog_time double DEFAULT NULL,
Commit_time double DEFAULT NULL,
Get_commit_ts_time double DEFAULT NULL,
Commit_backoff_time double DEFAULT NULL,
Backoff_types varchar(64) DEFAULT NULL,
Resolve_lock_time double DEFAULT NULL,
Local_latch_wait_time double DEFAULT NULL,
Write_keys bigint(22) DEFAULT NULL,
Write_size bigint(22) DEFAULT NULL,
Prewrite_region bigint(22) DEFAULT NULL,
Txn_retry bigint(22) DEFAULT NULL,
Cop_time double DEFAULT NULL,
Process_time double DEFAULT NULL,
Wait_time double DEFAULT NULL,
Backoff_time double DEFAULT NULL,
LockKeys_time double DEFAULT NULL,
Request_count bigint(20) unsigned DEFAULT NULL,
Total_keys bigint(20) unsigned DEFAULT NULL,
Process_keys bigint(20) unsigned DEFAULT NULL,
DB varchar(64) DEFAULT NULL,
Index_names varchar(100) DEFAULT NULL,
Is_internal tinyint(1) DEFAULT NULL,
Digest varchar(64) DEFAULT NULL,
Stats varchar(512) DEFAULT NULL,
Cop_proc_avg double DEFAULT NULL,
Cop_proc_p90 double DEFAULT NULL,
Cop_proc_max double DEFAULT NULL,
Cop_proc_addr varchar(64) DEFAULT NULL,
Cop_wait_avg double DEFAULT NULL,
Cop_wait_p90 double DEFAULT NULL,
Cop_wait_max double DEFAULT NULL,
Cop_wait_addr varchar(64) DEFAULT NULL,
Mem_max bigint(20) DEFAULT NULL,
Disk_max bigint(20) DEFAULT NULL,
KV_total double DEFAULT NULL,
PD_total double DEFAULT NULL,
Backoff_total double DEFAULT NULL,
Write_sql_response_total double DEFAULT NULL,
Backoff_Detail varchar(4096) DEFAULT NULL,
Prepared tinyint(1) DEFAULT NULL,
Succ tinyint(1) DEFAULT NULL,
Plan_from_cache tinyint(1) DEFAULT NULL,
Plan longtext DEFAULT NULL,
Plan_digest varchar(128) DEFAULT NULL,
Prev_stmt longtext DEFAULT NULL,
Query longtext DEFAULT NULL
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_bin;

time没有索引

这个表不是默认的吗?能否给一个正确的ddl结构给我参考下

可能和版本有关。你是什么版本,这个加索引不影响其他的。

CREATE TABLE CLUSTER_SLOW_QUERY (
INSTANCE varchar(64) DEFAULT NULL,
Time timestamp(6) NOT NULL,
Txn_start_ts bigint(20) unsigned DEFAULT NULL,
User varchar(64) DEFAULT NULL,
Host varchar(64) DEFAULT NULL,
Conn_ID bigint(20) unsigned DEFAULT NULL,
Exec_retry_count bigint(20) unsigned DEFAULT NULL,
Exec_retry_time double DEFAULT NULL,
Query_time double DEFAULT NULL,
Parse_time double DEFAULT NULL,
Compile_time double DEFAULT NULL,
Rewrite_time double DEFAULT NULL,
Preproc_subqueries bigint(20) unsigned DEFAULT NULL,
Preproc_subqueries_time double DEFAULT NULL,
Optimize_time double DEFAULT NULL,
Wait_TS double DEFAULT NULL,
Prewrite_time double DEFAULT NULL,
Wait_prewrite_binlog_time double DEFAULT NULL,
Commit_time double DEFAULT NULL,
Get_commit_ts_time double DEFAULT NULL,
Commit_backoff_time double DEFAULT NULL,
Backoff_types varchar(64) DEFAULT NULL,
Resolve_lock_time double DEFAULT NULL,
Local_latch_wait_time double DEFAULT NULL,
Write_keys bigint(22) DEFAULT NULL,
Write_size bigint(22) DEFAULT NULL,
Prewrite_region bigint(22) DEFAULT NULL,
Txn_retry bigint(22) DEFAULT NULL,
Cop_time double DEFAULT NULL,
Process_time double DEFAULT NULL,
Wait_time double DEFAULT NULL,
Backoff_time double DEFAULT NULL,
LockKeys_time double DEFAULT NULL,
Request_count bigint(20) unsigned DEFAULT NULL,
Total_keys bigint(20) unsigned DEFAULT NULL,
Process_keys bigint(20) unsigned DEFAULT NULL,
Rocksdb_delete_skipped_count bigint(20) unsigned DEFAULT NULL,
Rocksdb_key_skipped_count bigint(20) unsigned DEFAULT NULL,
Rocksdb_block_cache_hit_count bigint(20) unsigned DEFAULT NULL,
Rocksdb_block_read_count bigint(20) unsigned DEFAULT NULL,
Rocksdb_block_read_byte bigint(20) unsigned DEFAULT NULL,
DB varchar(64) DEFAULT NULL,
Index_names varchar(100) DEFAULT NULL,
Is_internal tinyint(1) DEFAULT NULL,
Digest varchar(64) DEFAULT NULL,
Stats varchar(512) DEFAULT NULL,
Cop_proc_avg double DEFAULT NULL,
Cop_proc_p90 double DEFAULT NULL,
Cop_proc_max double DEFAULT NULL,
Cop_proc_addr varchar(64) DEFAULT NULL,
Cop_wait_avg double DEFAULT NULL,
Cop_wait_p90 double DEFAULT NULL,
Cop_wait_max double DEFAULT NULL,
Cop_wait_addr varchar(64) DEFAULT NULL,
Mem_max bigint(20) DEFAULT NULL,
Disk_max bigint(20) DEFAULT NULL,
KV_total double DEFAULT NULL,
PD_total double DEFAULT NULL,
Backoff_total double DEFAULT NULL,
Write_sql_response_total double DEFAULT NULL,
Backoff_Detail varchar(4096) DEFAULT NULL,
Prepared tinyint(1) DEFAULT NULL,
Succ tinyint(1) DEFAULT NULL,
Plan_from_cache tinyint(1) DEFAULT NULL,
Plan_from_binding tinyint(1) DEFAULT NULL,
Plan longtext DEFAULT NULL,
Plan_digest varchar(128) DEFAULT NULL,
Prev_stmt longtext DEFAULT NULL,
Query longtext DEFAULT NULL,
PRIMARY KEY (Time) /*T![clustered_index] CLUSTERED */
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_bin;

这是5.1.1版本的表结构

好的。我先加上普通索引看看效果

我们是4.0版本,dba觉得加了索引也没用(耗资源主要在解析slow log文件),所以没法尝试了,只能限制查询范围

如果造成问题的sql是你上面发的,那么加索引应该是有效果的。

不会吧。查询slow_query表的时候,是需要实时解析慢日志文件的,加不加索引还是得解析大量文件,消耗大量tidb-server 内存.

我这边是4.0.13 同样的执行计划,并未出现你说的现象

-- 查询当前集群所有 TiDB 集群的 Slowlog, 适用于 TiDB v4.0.x
SET sql_mode = '';
SET @mintime = '2020-11-01 20:00:00';
SET @maxtime = '2020-11-01 20:30:00';

SELECT
	digest,
	plan_digest,
	sum( Query_time ) AS 'query_responces',
	avg( Query_time ) AS 'avg_query_resp',
	count( 1 ) cnt,
	count( CASE WHEN Backoff_types > '' THEN 1 END ) AS 'Backoff_query_cnt',
	count( CASE WHEN Backoff_types > '' THEN 1 END )/ count( 1 ) * 100 AS 'Backoff_query_pct',
	avg( process_keys ) AS 'avg_process_keys',
	avg( total_keys ) AS 'avg_total_keys',
	sum( process_keys )/ sum( total_keys )* 100 AS 'process_key_pct',
    sum( Wait_time) as 'Wait_time',
    sum( Wait_time) / sum( Query_time )*100 as 'wait_time_pct',
	count( CASE succ WHEN 1 THEN succ END ) AS 'succ_cnt',
	count( CASE succ WHEN 1 THEN succ END )/ count( 1 ) * 100 AS 'succ_pct',
	sum( Mem_max )/ 1024 / 1024 AS 'sum_mem(MB)',
	avg( Mem_max )/ 1024 / 1024 AS 'avg_mem(MB)',
	sum( disk_max )/ 1024 / 1024 AS 'sum_disk(MB)',
	avg( disk_max )/ 1024 / 1024 AS 'avg_disk(MB)',
	min(
	from_unixtime(( Txn_start_ts >> 18 )/ 1000 )) AS 'min_time',
	max( time ) AS 'max_time',
	group_concat( DISTINCT INSTANCE ) AS 'INSTANCE',
	LEFT ( CASE WHEN length( Prev_stmt )> 0 THEN Prev_stmt ELSE QUERY END, 256 ) AS 'query' 
FROM
	information_schema.CLUSTER_SLOW_QUERY
	JOIN (
	SELECT
		digest AS dg,
		min( CASE WHEN Txn_start_ts > 0 THEN from_unixtime(( Txn_start_ts >> 18 )/ 1000 ) ELSE time END ) mintime,
		max( time ) maxtime 
	FROM
		information_schema.CLUSTER_SLOW_QUERY 
	WHERE
		Is_internal = 0 
		AND time >= @mintime 
		AND time < @maxtime 
	GROUP BY
		Digest 
	) o ON o.dg = digest 
	AND time BETWEEN mintime 
	AND maxtime 
GROUP BY
	digest,
	Plan_digest 
ORDER BY
	query_responces DESC 
	LIMIT 50;

此话题已在最后回复的 1 分钟后被自动关闭。不再允许新回复。