对于hashAgg算子非并行模式下还是发生OOM

人如其名 · 2023 年5 月 3 日 15:51

【 TiDB 使用环境】测试
【 TiDB 版本】v7.0.0
因为hashAgg算子在开启并行时候不能落盘导致内存不可控或容易被oom，在实际生产中需要尽可能的避免中断，因此会采用单线程做hashAgg让其尽量落盘，减少内存的使用。但是在实际使用中经常遇到使用了非并行模式的hashAgg还是会发生语句oom的问题。

对于非并行hashAgg算子的大体执行逻辑如下：
select col1,count(*) from a group by col1,优先将col1（去重复得到key）放到一个map中，对于子算子流入的chunk（多行记录）逐行计算，如果col1已经存在map中则进行函数计算（这里是count），如果不存在那么就进行判断是否已经开始落盘。假设发生了落盘行为，那么对于后续的chunk中的行只要不存在map中那么就进行一行一行的落盘操作。如此以来可以得到第一批在发生落盘之前就存在map中的key以及聚合记录，然后返回给父算子。接下来在磁盘中一行一行的（导致性能慢，问题1）读取之前未处理的记录，重复之前的步骤，如果内存还是不足则进行二次落盘（此时有一个落盘计数器spillTimes），但最多不能超过maxSpillTimes=10次（为何限制最大落盘次数，问题2），并且二次落盘会在落盘临时文件中继续追加，之前记录并没有清空，可能会导致磁盘空间占用很大（问题3）。

参数设置如下：

mysql> show variables like 'tidb_hashagg%';
+----------------------------------+-------+
| Variable_name                    | Value |
+----------------------------------+-------+
| tidb_hashagg_final_concurrency   | 1     |
| tidb_hashagg_partial_concurrency | 1     |
+----------------------------------+-------+
2 rows in set (0.00 sec)

mysql> show variables like 'tidb_mem_quota_query';
+----------------------+------------+
| Variable_name        | Value      |
+----------------------+------------+
| tidb_mem_quota_query | 1073741824 |
+----------------------+------------+
1 row in set (0.00 sec)

mysql> show variables like 'tidb_enable_rate_limit_action';
+-------------------------------+-------+
| Variable_name                 | Value |
+-------------------------------+-------+
| tidb_enable_rate_limit_action | OFF   |
+-------------------------------+-------+
1 row in set (0.01 sec)
mysql> show variables like '%oom%';
+--------------------------------+--------+
| Variable_name                  | Value  |
+--------------------------------+--------+
| tidb_enable_tmp_storage_on_oom | ON     |
| tidb_mem_oom_action            | CANCEL |
+--------------------------------+--------+
2 rows in set (0.00 sec)

测试如下：


mysql> explain analyze  select C_NAME,count(C_PHONE) from customer group by c_name;
+----------------------------+-------------+----------+-----------+----------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------+----------+---------+
| id                         | estRows     | actRows  | task      | access object  | execution info                                                                                                                                                                                                                                                                                           | operator info                                                                                                                                       | memory   | disk    |
+----------------------------+-------------+----------+-----------+----------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------+----------+---------+
| Projection_4               | 15000000.00 | 15000000 | root      |                | time:2m6s, loops:14650, RRU:48513.136572, WRU:0.000000, Concurrency:5                                                                                                                                                                                                                                    | tpch100.customer.c_name, Column#9                                                                                                                   | 427.8 KB | N/A     |
| └─HashAgg_7                | 15000000.00 | 15000000 | root      |                | time:2m5.9s, loops:14650                                                                                                                                                                                                                                                                                 | group by:tpch100.customer.c_name, funcs:count(tpch100.customer.c_phone)->Column#9, funcs:firstrow(tpch100.customer.c_name)->tpch100.customer.c_name | 971.6 MB | 1.38 GB |
|   └─TableReader_12         | 15000000.00 | 15000000 | root      |                | time:146.5ms, loops:14675, cop_task: {num: 565, max: 58.6ms, min: 393µs, avg: 12.5ms, p95: 26.4ms, max_proc_keys: 50144, p95_proc_keys: 50144, tot_proc: 5.39s, tot_wait: 65.6ms, rpc_num: 565, rpc_time: 7.04s, copr_cache_hit_ratio: 0.00, build_task_duration: 37.6µs, max_distsql_concurrency: 15}   | data:TableFullScan_11                                                                                                                               | 37.5 MB  | N/A     |
|     └─TableFullScan_11     | 15000000.00 | 15000000 | cop[tikv] | table:customer | tikv_task:{proc max:38ms, min:0s, avg: 8.29ms, p80:15ms, p95:20ms, iters:16879, tasks:565}, scan_detail: {total_process_keys: 15000000, total_process_keys_size: 3052270577, total_keys: 15000565, get_snapshot_time: 22.2ms, rocksdb: {key_skipped_count: 15000000, block: {cache_hit_count: 103777}}}  | keep order:false                                                                                                                                    | N/A      | N/A     |
+----------------------------+-------------+----------+-----------+----------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------+----------+---------+
4 rows in set (2 min 6.19 sec)

mysql> explain analyze  select C_NAME,count(C_PHONE),count(C_ADDRESS) from customer group by c_name;
ERROR 1105 (HY000): Out Of Memory Quota![conn=3978033069293983]

观察其执行过程中的tidb日志如下：

[2023/05/03 23:17:15.596 +08:00] [INFO] [controller.go:282] ["[resource group controller] create resource group cost controller"] [name=default]
[2023/05/03 23:17:16.310 +08:00] [INFO] [aggregate.go:1949] ["memory exceeds quota, set aggregate mode to spill-mode"] [spillTimes=1] [consumed=165914156] [quota=207374182]
[2023/05/03 23:17:28.915 +08:00] [WARN] [expensivequery.go:118] ["memory exceeds quota"] [cost_time=13.329033457s] [cop_time=0.015257304s] [process_time=6.481480135s] [wait_time=0.047572265s] [request_count=565] [total_keys=14977315] [process_keys=14976768] [num_cop_tasks=565] [process_avg_time=0.011471646s] [process_p90_time=0.024542849s] [process_max_time=0.035438311s] [process_max_addr=192.168.31.201:20160] [wait_avg_time=0.000084198s] [wait_p90_time=0.000135554s] [wait_max_time=0.000884638s] [wait_max_addr=192.168.31.201:20160] [stats=customer:441211688017461249] [conn=3978033069293983] [user=root] [database=tpch100] [table_ids="[369]"] [txn_start_ts=441221653614297089] [mem_max="232134581 Bytes (221.4 MB)"] [sql="explain analyze  select C_NAME,count(C_PHONE),count(C_ADDRESS) from customer group by c_name"]
[2023/05/03 23:17:28.915 +08:00] [ERROR] [projection.go:466] ["projection executor panicked"] [error="Out Of Memory Quota![conn=3978033069293983]"] [stack="github.com/pingcap/tidb/executor.recoveryProjection\n\t/home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/tidb/executor/projection.go:466\ngithub.com/pingcap/tidb/executor.(*projectionInputFetcher).run.func1\n\t/home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/tidb/executor/projection.go:364\nruntime.gopanic\n\t/usr/local/go/src/runtime/panic.go:884\ngithub.com/pingcap/tidb/util/memory.(*PanicOnExceed).Action\n\t/home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/tidb/util/memory/action.go:148\ngithub.com/pingcap/tidb/util/memory.(*Tracker).Consume.func1\n\t/home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/tidb/util/memory/tracker.go:437\ngithub.com/pingcap/tidb/util/memory.(*Tracker).Consume\n\t/home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/tidb/util/memory/tracker.go:478\ngithub.com/pingcap/tidb/executor.(*HashAggExec).getPartialResults\n\t/home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/tidb/executor/aggregate.go:1121\ngithub.com/pingcap/tidb/executor.(*HashAggExec).execute\n\t/home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/tidb/executor/aggregate.go:1045\ngithub.com/pingcap/tidb/executor.(*HashAggExec).unparallelExec\n\t/home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/tidb/executor/aggregate.go:970\ngithub.com/pingcap/tidb/executor.(*HashAggExec).Next\n\t/home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/tidb/executor/aggregate.go:789\ngithub.com/pingcap/tidb/executor.Next\n\t/home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/tidb/executor/executor.go:326\ngithub.com/pingcap/tidb/executor.(*projectionInputFetcher).run\n\t/home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/tidb/executor/projection.go:388"]
[2023/05/03 23:17:29.098 +08:00] [INFO] [conn.go:1151] ["command dispatched failed"] [conn=3978033069293983] [connInfo="id:3978033069293983, addr:192.168.31.200:37786 status:10, collation:utf8mb4_0900_ai_ci, user:root"] [command=Query] [status="inTxn:0, autocommit:1"] [sql="explain analyze  select C_NAME,count(C_PHONE),count(C_ADDRESS) from customer group by c_name"] [txn_mode=PESSIMISTIC] [timestamp=441221653614297089] [err="Out Of Memory Quota![conn=3978033069293983]\ngithub.com/pingcap/tidb/executor.recoveryProjection\n\t/home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/tidb/executor/projection.go:464\ngithub.com/pingcap/tidb/executor.(*projectionInputFetcher).run.func1\n\t/home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/tidb/executor/projection.go:364\nruntime.gopanic\n\t/usr/local/go/src/runtime/panic.go:884\ngithub.com/pingcap/tidb/util/memory.(*PanicOnExceed).Action\n\t/home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/tidb/util/memory/action.go:148\ngithub.com/pingcap/tidb/util/memory.(*Tracker).Consume.func1\n\t/home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/tidb/util/memory/tracker.go:437\ngithub.com/pingcap/tidb/util/memory.(*Tracker).Consume\n\t/home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/tidb/util/memory/tracker.go:478\ngithub.com/pingcap/tidb/executor.(*HashAggExec).getPartialResults\n\t/home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/tidb/executor/aggregate.go:1121\ngithub.com/pingcap/tidb/executor.(*HashAggExec).execute\n\t/home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/tidb/executor/aggregate.go:1045\ngithub.com/pingcap/tidb/executor.(*HashAggExec).unparallelExec\n\t/home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/tidb/executor/aggregate.go:970\ngithub.com/pingcap/tidb/executor.(*HashAggExec).Next\n\t/home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/tidb/executor/aggregate.go:789\ngithub.com/pingcap/tidb/executor.Next\n\t/home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/tidb/executor/executor.go:326\ngithub.com/pingcap/tidb/executor.(*projectionInputFetcher).run\n\t/home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/tidb/executor/projection.go:388\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:1598"]

可以看到在做hashAgg时候最大溢出到磁盘的次数为：spillTimes=1，只有一次。在其落盘行为定义中需要超过10次落盘（maxSpillTimes）才会回退到oom-action=cancel（DefPanicPriority最低优先级），否则会优先进行落盘（DefSpillPriority）。相关代码：
https://github.com/pingcap/tidb/blob/7eff8ce629d00cf4ab1eeda82a1ecd5a4e394b02/executor/aggregate.go#L1945

// Action set HashAggExec spill mode.
func (a *AggSpillDiskAction) Action(t *memory.Tracker) {
	// Guarantee that processed data is at least 20% of the threshold, to avoid spilling too frequently.
	if atomic.LoadUint32(&a.e.inSpillMode) == 0 && a.spillTimes < maxSpillTimes && a.e.memTracker.BytesConsumed() >= t.GetBytesLimit()/5 {
		a.spillTimes++
		logutil.BgLogger().Info("memory exceeds quota, set aggregate mode to spill-mode",
			zap.Uint32("spillTimes", a.spillTimes),
			zap.Int64("consumed", t.BytesConsumed()),
			zap.Int64("quota", t.GetBytesLimit()))
		atomic.StoreUint32(&a.e.inSpillMode, 1)
		memory.QueryForceDisk.Add(1)
		return
	}
	if fallback := a.GetFallback(); fallback != nil {
		fallback.Action(t)
	}
}

既然只做一次落盘，为什么还是会被cancel掉呢（问题4）？

1、对于问题1，几乎所有的落盘行为都存在这种缓慢的情况，能否进行优化？
2、对于问题2，为何限制重复落盘次数maxSpillTimes=10次呢？
3、对于问题3，多次落盘操作不但放大了问题1，而且还导致磁盘空间急剧膨胀，能否设计成多个“段”，及时回收无用空间呢？
4、对于问题4，即使开启了落盘，还是会经常遇到被cancel的情况，主要是什么原因导致，可否进行优化？
5、另外，对于并行hashAgg，是否有计划添加落盘行为？

人如其名 · 2023 年5 月 6 日 06:24

对于问题4因该是发生在hashAgg数据落盘的过程中，fallback的action发现内存不足选择了Priority更低的行为（这里是PanicOnExceed，oom-action=cancel），把当前会话杀掉了。

发生这种情况一般是由于tikv发送给tidb的数据较慢，给到tidb，那么tidb就传给hashAgg算子，当hashAgg算子做内存检测时发现内存不足，那么就开启了落盘。但是落盘是一行一行的落的，可能速度没有tikv发送给tidb的快。导致tidb的RPC消息积压的越来越多，此时落盘还在进行中（a.e.inSpillMode=1），那么就会走退化逻辑（查找优先级更低的Priority)。因为我们设置的oom-action=cancel（PanicOnExceed），加上内存数据挤压的越来越多，这里就被判定cancel掉了。

Raymond · 2023 年5 月 7 日 06:05

对于问题3
不知道是否可以这么设计？
MySQL8.0的临时表空间维护了1个pool,这个pool 初始化10个临时表空间（可以根据需要进行拓展表空间的数量），每个mysql 会话可以使用这个pool的临时表空间，类似的不知道tidb 是否可以这么设置，如果多次落盘，可以将每次落盘的数据存储到不同的临时表空间里面，如果是不再需要的落盘数据，那就将对应的临时表空间给truncate，避免放到同一个文件里面，不好截断，导致文件系统空间不能及时时间

XuHuaiyu-PingCAP · 2023 年5 月 10 日 07:23

1、对于问题1 ，几乎所有的落盘行为都存在这种缓慢的情况，能否进行优化？
2、对于问题2 ，为何限制重复落盘次数maxSpillTimes=10次呢？
3、对于问题3 ，多次落盘操作不但放大了问题1，而且还导致磁盘空间急剧膨胀，能否设计成多个“段”，及时回收无用空间呢？
4、对于问题4 ，即使开启了落盘，还是会经常遇到被cancel的情况，主要是什么原因导致，可否进行优化？
5、另外，对于并行hashAgg，是否有计划添加落盘行为？

问题 1：接下来在磁盘中一行一行的 （导致性能慢，问题1 ）读取之前未处理的记录
我翻了一下代码，从磁盘中读取文件，应该是按 chunk(1024 行) 读的

问题 2：但最多不能超过maxSpillTimes=10次（为何限制最大落盘次数）
这是实现时候定的一个 magic number，当时的考虑的出发点是，每次计算 10%，分10次计算完，次数过多的话，顾虑点是加上来回落盘，可能会导致资源被长时间占用。

问题 3：二次落盘会在落盘临时文件中继续追加，之前记录并没有清空，可能会导致磁盘空间占用很大
实现问题，这里确实可以进行优化

问题 4：即使开启了落盘，还是会经常遇到被cancel的情况，主要是什么原因导致，可否进行优化

agg 落盘设计时候的想法是，确保 agg 算子的内存不会再涨
由于 agg 落盘无法做到让 query 内存下降，因此它的触发时机，是 tidb_mem_quota_query * 80%
tidb 实现上，对于 tidb_mem_quota_query * 80% (soft limit)触发的控制行为和 tidb_mem_quota_query (hard limit) 触发的控制行为，分别采用了不用的链进行了追踪。agg 落盘是放在了 ActionForSoftLimit 这条链上，cancel sql 是放在了 ActionForHardLimit 这条链上。
由于 action 在不同链上， agg 落盘不会 fallback 到 cancel query 的行为上
这里 agg 出现落盘，但依然发生 cancel sql 的原因，是因为落盘过程中，该 sql 的内存消耗继续增长，超过了 tidb_mem_quota_query (hardlimit) 的限制，在 hardlimit 的这条链上，触发了 cancel 行为

问题 5：
有计划，但暂未定在哪个版本带出

人如其名 · 2023 年5 月 10 日 08:02

谢谢大佬，但是有点疑问，这里调用的func (l *ListInDisk) GetChunk(chkIdx int) (*Chunk, error) 虽然是一次获取一个chunk，但是和OS交互时候是每一次IO发生一行记录，
和IO交互的代码：

// ReadFrom reads data of r, deserializes it from the format of diskFormatRow
// into Row.
func (row *rowInDisk) ReadFrom(r io.Reader) (n int64, err error) {
	b := make([]byte, 8*row.numCol)
	var n1 int
	n1, err = io.ReadFull(r, b)
	n += int64(n1)
	if err != nil {
		return
	}
	row.sizesOfColumns = bytesToI64Slice(b)
	row.cells = make([][]byte, 0, row.numCol)
	for _, size := range row.sizesOfColumns {
		if size == -1 {
			continue
		}
		cell := make([]byte, size)
		row.cells = append(row.cells, cell)
		n1, err = io.ReadFull(r, cell)
		n += int64(n1)
		if err != nil {
			return
		}
	}
	return
}

XuHuaiyu-PingCAP · 2023 年5 月 10 日 10:24

ok, 从这个方法看，那确实是按行读取的。
应该是为了防止读如 string 这种不定长数据出现正确性问题。

人如其名 · 2023 年5 月 10 日 15:16

是读取一行中发生多次IO，一列一列读取。

system · 2023 年7 月 9 日 15:16

此话题已在最后回复的 60 天后被自动关闭。不再允许新回复。