对于hashAgg算子非并行模式下还是发生OOM

【 TiDB 使用环境】测试
【 TiDB 版本】v7.0.0
因为hashAgg算子在开启并行时候不能落盘导致内存不可控或容易被oom,在实际生产中需要尽可能的避免中断,因此会采用单线程做hashAgg让其尽量落盘,减少内存的使用。但是在实际使用中经常遇到使用了非并行模式的hashAgg还是会发生语句oom的问题。

对于非并行hashAgg算子的大体执行逻辑如下:
select col1,count(*) from a group by col1,优先将col1(去重复得到key)放到一个map中,对于子算子流入的chunk(多行记录)逐行计算,如果col1已经存在map中则进行函数计算(这里是count),如果不存在那么就进行判断是否已经开始落盘。假设发生了落盘行为,那么对于后续的chunk中的行只要不存在map中那么就进行一行一行的落盘操作。如此以来可以得到第一批在发生落盘之前就存在map中的key以及聚合记录,然后返回给父算子。接下来在磁盘中一行一行的(导致性能慢,问题1)读取之前未处理的记录,重复之前的步骤,如果内存还是不足则进行二次落盘(此时有一个落盘计数器spillTimes),但最多不能超过maxSpillTimes=10次(为何限制最大落盘次数,问题2),并且二次落盘会在落盘临时文件中继续追加,之前记录并没有清空,可能会导致磁盘空间占用很大问题3)。

参数设置如下:

mysql> show variables like 'tidb_hashagg%';
+----------------------------------+-------+
| Variable_name                    | Value |
+----------------------------------+-------+
| tidb_hashagg_final_concurrency   | 1     |
| tidb_hashagg_partial_concurrency | 1     |
+----------------------------------+-------+
2 rows in set (0.00 sec)

mysql> show variables like 'tidb_mem_quota_query';
+----------------------+------------+
| Variable_name        | Value      |
+----------------------+------------+
| tidb_mem_quota_query | 1073741824 |
+----------------------+------------+
1 row in set (0.00 sec)

mysql> show variables like 'tidb_enable_rate_limit_action';
+-------------------------------+-------+
| Variable_name                 | Value |
+-------------------------------+-------+
| tidb_enable_rate_limit_action | OFF   |
+-------------------------------+-------+
1 row in set (0.01 sec)
mysql> show variables like '%oom%';
+--------------------------------+--------+
| Variable_name                  | Value  |
+--------------------------------+--------+
| tidb_enable_tmp_storage_on_oom | ON     |
| tidb_mem_oom_action            | CANCEL |
+--------------------------------+--------+
2 rows in set (0.00 sec)

测试如下:


mysql> explain analyze  select C_NAME,count(C_PHONE) from customer group by c_name;
+----------------------------+-------------+----------+-----------+----------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------+----------+---------+
| id                         | estRows     | actRows  | task      | access object  | execution info                                                                                                                                                                                                                                                                                           | operator info                                                                                                                                       | memory   | disk    |
+----------------------------+-------------+----------+-----------+----------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------+----------+---------+
| Projection_4               | 15000000.00 | 15000000 | root      |                | time:2m6s, loops:14650, RRU:48513.136572, WRU:0.000000, Concurrency:5                                                                                                                                                                                                                                    | tpch100.customer.c_name, Column#9                                                                                                                   | 427.8 KB | N/A     |
| └─HashAgg_7                | 15000000.00 | 15000000 | root      |                | time:2m5.9s, loops:14650                                                                                                                                                                                                                                                                                 | group by:tpch100.customer.c_name, funcs:count(tpch100.customer.c_phone)->Column#9, funcs:firstrow(tpch100.customer.c_name)->tpch100.customer.c_name | 971.6 MB | 1.38 GB |
|   └─TableReader_12         | 15000000.00 | 15000000 | root      |                | time:146.5ms, loops:14675, cop_task: {num: 565, max: 58.6ms, min: 393µs, avg: 12.5ms, p95: 26.4ms, max_proc_keys: 50144, p95_proc_keys: 50144, tot_proc: 5.39s, tot_wait: 65.6ms, rpc_num: 565, rpc_time: 7.04s, copr_cache_hit_ratio: 0.00, build_task_duration: 37.6µs, max_distsql_concurrency: 15}   | data:TableFullScan_11                                                                                                                               | 37.5 MB  | N/A     |
|     └─TableFullScan_11     | 15000000.00 | 15000000 | cop[tikv] | table:customer | tikv_task:{proc max:38ms, min:0s, avg: 8.29ms, p80:15ms, p95:20ms, iters:16879, tasks:565}, scan_detail: {total_process_keys: 15000000, total_process_keys_size: 3052270577, total_keys: 15000565, get_snapshot_time: 22.2ms, rocksdb: {key_skipped_count: 15000000, block: {cache_hit_count: 103777}}}  | keep order:false                                                                                                                                    | N/A      | N/A     |
+----------------------------+-------------+----------+-----------+----------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------+----------+---------+
4 rows in set (2 min 6.19 sec)

mysql> explain analyze  select C_NAME,count(C_PHONE),count(C_ADDRESS) from customer group by c_name;
ERROR 1105 (HY000): Out Of Memory Quota![conn=3978033069293983]

观察其执行过程中的tidb日志如下:

[2023/05/03 23:17:15.596 +08:00] [INFO] [controller.go:282] ["[resource group controller] create resource group cost controller"] [name=default]
[2023/05/03 23:17:16.310 +08:00] [INFO] [aggregate.go:1949] ["memory exceeds quota, set aggregate mode to spill-mode"] [spillTimes=1] [consumed=165914156] [quota=207374182]
[2023/05/03 23:17:28.915 +08:00] [WARN] [expensivequery.go:118] ["memory exceeds quota"] [cost_time=13.329033457s] [cop_time=0.015257304s] [process_time=6.481480135s] [wait_time=0.047572265s] [request_count=565] [total_keys=14977315] [process_keys=14976768] [num_cop_tasks=565] [process_avg_time=0.011471646s] [process_p90_time=0.024542849s] [process_max_time=0.035438311s] [process_max_addr=192.168.31.201:20160] [wait_avg_time=0.000084198s] [wait_p90_time=0.000135554s] [wait_max_time=0.000884638s] [wait_max_addr=192.168.31.201:20160] [stats=customer:441211688017461249] [conn=3978033069293983] [user=root] [database=tpch100] [table_ids="[369]"] [txn_start_ts=441221653614297089] [mem_max="232134581 Bytes (221.4 MB)"] [sql="explain analyze  select C_NAME,count(C_PHONE),count(C_ADDRESS) from customer group by c_name"]
[2023/05/03 23:17:28.915 +08:00] [ERROR] [projection.go:466] ["projection executor panicked"] [error="Out Of Memory Quota![conn=3978033069293983]"] [stack="github.com/pingcap/tidb/executor.recoveryProjection\n\t/home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/tidb/executor/projection.go:466\ngithub.com/pingcap/tidb/executor.(*projectionInputFetcher).run.func1\n\t/home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/tidb/executor/projection.go:364\nruntime.gopanic\n\t/usr/local/go/src/runtime/panic.go:884\ngithub.com/pingcap/tidb/util/memory.(*PanicOnExceed).Action\n\t/home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/tidb/util/memory/action.go:148\ngithub.com/pingcap/tidb/util/memory.(*Tracker).Consume.func1\n\t/home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/tidb/util/memory/tracker.go:437\ngithub.com/pingcap/tidb/util/memory.(*Tracker).Consume\n\t/home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/tidb/util/memory/tracker.go:478\ngithub.com/pingcap/tidb/executor.(*HashAggExec).getPartialResults\n\t/home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/tidb/executor/aggregate.go:1121\ngithub.com/pingcap/tidb/executor.(*HashAggExec).execute\n\t/home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/tidb/executor/aggregate.go:1045\ngithub.com/pingcap/tidb/executor.(*HashAggExec).unparallelExec\n\t/home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/tidb/executor/aggregate.go:970\ngithub.com/pingcap/tidb/executor.(*HashAggExec).Next\n\t/home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/tidb/executor/aggregate.go:789\ngithub.com/pingcap/tidb/executor.Next\n\t/home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/tidb/executor/executor.go:326\ngithub.com/pingcap/tidb/executor.(*projectionInputFetcher).run\n\t/home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/tidb/executor/projection.go:388"]
[2023/05/03 23:17:29.098 +08:00] [INFO] [conn.go:1151] ["command dispatched failed"] [conn=3978033069293983] [connInfo="id:3978033069293983, addr:192.168.31.200:37786 status:10, collation:utf8mb4_0900_ai_ci, user:root"] [command=Query] [status="inTxn:0, autocommit:1"] [sql="explain analyze  select C_NAME,count(C_PHONE),count(C_ADDRESS) from customer group by c_name"] [txn_mode=PESSIMISTIC] [timestamp=441221653614297089] [err="Out Of Memory Quota![conn=3978033069293983]\ngithub.com/pingcap/tidb/executor.recoveryProjection\n\t/home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/tidb/executor/projection.go:464\ngithub.com/pingcap/tidb/executor.(*projectionInputFetcher).run.func1\n\t/home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/tidb/executor/projection.go:364\nruntime.gopanic\n\t/usr/local/go/src/runtime/panic.go:884\ngithub.com/pingcap/tidb/util/memory.(*PanicOnExceed).Action\n\t/home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/tidb/util/memory/action.go:148\ngithub.com/pingcap/tidb/util/memory.(*Tracker).Consume.func1\n\t/home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/tidb/util/memory/tracker.go:437\ngithub.com/pingcap/tidb/util/memory.(*Tracker).Consume\n\t/home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/tidb/util/memory/tracker.go:478\ngithub.com/pingcap/tidb/executor.(*HashAggExec).getPartialResults\n\t/home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/tidb/executor/aggregate.go:1121\ngithub.com/pingcap/tidb/executor.(*HashAggExec).execute\n\t/home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/tidb/executor/aggregate.go:1045\ngithub.com/pingcap/tidb/executor.(*HashAggExec).unparallelExec\n\t/home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/tidb/executor/aggregate.go:970\ngithub.com/pingcap/tidb/executor.(*HashAggExec).Next\n\t/home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/tidb/executor/aggregate.go:789\ngithub.com/pingcap/tidb/executor.Next\n\t/home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/tidb/executor/executor.go:326\ngithub.com/pingcap/tidb/executor.(*projectionInputFetcher).run\n\t/home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/tidb/executor/projection.go:388\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:1598"]

可以看到在做hashAgg时候最大溢出到磁盘的次数为:spillTimes=1,只有一次。在其落盘行为定义中需要超过10次落盘(maxSpillTimes)才会回退到oom-action=cancel(DefPanicPriority最低优先级),否则会优先进行落盘(DefSpillPriority)。相关代码:
https://github.com/pingcap/tidb/blob/7eff8ce629d00cf4ab1eeda82a1ecd5a4e394b02/executor/aggregate.go#L1945

// Action set HashAggExec spill mode.
func (a *AggSpillDiskAction) Action(t *memory.Tracker) {
	// Guarantee that processed data is at least 20% of the threshold, to avoid spilling too frequently.
	if atomic.LoadUint32(&a.e.inSpillMode) == 0 && a.spillTimes < maxSpillTimes && a.e.memTracker.BytesConsumed() >= t.GetBytesLimit()/5 {
		a.spillTimes++
		logutil.BgLogger().Info("memory exceeds quota, set aggregate mode to spill-mode",
			zap.Uint32("spillTimes", a.spillTimes),
			zap.Int64("consumed", t.BytesConsumed()),
			zap.Int64("quota", t.GetBytesLimit()))
		atomic.StoreUint32(&a.e.inSpillMode, 1)
		memory.QueryForceDisk.Add(1)
		return
	}
	if fallback := a.GetFallback(); fallback != nil {
		fallback.Action(t)
	}
}

既然只做一次落盘,为什么还是会被cancel掉呢(问题4)?

1、对于问题1,几乎所有的落盘行为都存在这种缓慢的情况,能否进行优化?
2、对于问题2,为何限制重复落盘次数maxSpillTimes=10次呢?
3、对于问题3,多次落盘操作不但放大了问题1,而且还导致磁盘空间急剧膨胀,能否设计成多个“段”,及时回收无用空间呢?
4、对于问题4,即使开启了落盘,还是会经常遇到被cancel的情况,主要是什么原因导致,可否进行优化?
5、另外,对于并行hashAgg,是否有计划添加落盘行为?

对于问题4因该是发生在hashAgg数据落盘的过程中,fallback的action发现内存不足选择了Priority更低的行为(这里是PanicOnExceed,oom-action=cancel),把当前会话杀掉了。

发生这种情况一般是由于tikv发送给tidb的数据较慢,给到tidb,那么tidb就传给hashAgg算子,当hashAgg算子做内存检测时发现内存不足,那么就开启了落盘。但是落盘是一行一行的落的,可能速度没有tikv发送给tidb的快。导致tidb的RPC消息积压的越来越多,此时落盘还在进行中(a.e.inSpillMode=1),那么就会走退化逻辑(查找优先级更低的Priority)。因为我们设置的oom-action=cancel(PanicOnExceed),加上内存数据挤压的越来越多,这里就被判定cancel掉了。

对于问题3
不知道是否可以这么设计?
MySQL8.0的临时表空间维护了1个pool,这个pool 初始化10个临时表空间(可以根据需要进行拓展表空间的数量),每个mysql 会话可以使用这个pool的临时表空间,类似的不知道tidb 是否可以这么设置,如果多次落盘,可以将每次落盘的数据存储到不同的临时表空间里面,如果是不再需要的落盘数据,那就将对应的临时表空间给truncate,避免放到同一个文件里面,不好截断,导致文件系统空间不能及时时间

1、对于问题1 ,几乎所有的落盘行为都存在这种缓慢的情况,能否进行优化?
2、对于问题2 ,为何限制重复落盘次数maxSpillTimes=10次呢?
3、对于问题3 ,多次落盘操作不但放大了问题1,而且还导致磁盘空间急剧膨胀,能否设计成多个“段”,及时回收无用空间呢?
4、对于问题4 ,即使开启了落盘,还是会经常遇到被cancel的情况,主要是什么原因导致,可否进行优化?
5、另外,对于并行hashAgg,是否有计划添加落盘行为?

问题 1: 接下来在磁盘中一行一行的 (导致性能慢,问题1 )读取之前未处理的记录
我翻了一下代码,从磁盘中读取文件,应该是按 chunk(1024 行) 读的

问题 2:但最多不能超过maxSpillTimes=10次(为何限制最大落盘次数)
这是实现时候定的一个 magic number,当时的考虑的出发点是,每次计算 10%,分10次计算完,次数过多的话,顾虑点是加上来回落盘,可能会导致资源被长时间占用。

问题 3: 二次落盘会在落盘临时文件中继续追加,之前记录并没有清空,可能会导致磁盘空间占用很大
实现问题,这里确实可以进行优化

问题 4:即使开启了落盘,还是会经常遇到被cancel的情况,主要是什么原因导致,可否进行优化

  1. agg 落盘设计时候的想法是,确保 agg 算子的内存不会再涨
  2. 由于 agg 落盘无法做到让 query 内存下降,因此它的触发时机,是 tidb_mem_quota_query * 80%
  3. tidb 实现上,对于 tidb_mem_quota_query * 80% (soft limit)触发的控制行为 和 tidb_mem_quota_query (hard limit) 触发的控制行为,分别采用了不用的链进行了追踪。agg 落盘是放在了 ActionForSoftLimit 这条链上cancel sql 是放在了 ActionForHardLimit 这条链上
  4. 由于 action 在不同链上, agg 落盘不会 fallback 到 cancel query 的行为上
  5. 这里 agg 出现落盘,但依然发生 cancel sql 的原因,是因为落盘过程中,该 sql 的内存消耗继续增长,超过了 tidb_mem_quota_query (hardlimit) 的限制,在 hardlimit 的这条链上,触发了 cancel 行为

问题 5:
有计划,但暂未定在哪个版本带出

谢谢大佬,但是有点疑问,这里调用的func (l *ListInDisk) GetChunk(chkIdx int) (*Chunk, error) 虽然是一次获取一个chunk,但是和OS交互时候是每一次IO发生一行记录,
和IO交互的代码:

// ReadFrom reads data of r, deserializes it from the format of diskFormatRow
// into Row.
func (row *rowInDisk) ReadFrom(r io.Reader) (n int64, err error) {
	b := make([]byte, 8*row.numCol)
	var n1 int
	n1, err = io.ReadFull(r, b)
	n += int64(n1)
	if err != nil {
		return
	}
	row.sizesOfColumns = bytesToI64Slice(b)
	row.cells = make([][]byte, 0, row.numCol)
	for _, size := range row.sizesOfColumns {
		if size == -1 {
			continue
		}
		cell := make([]byte, size)
		row.cells = append(row.cells, cell)
		n1, err = io.ReadFull(r, cell)
		n += int64(n1)
		if err != nil {
			return
		}
	}
	return
}

ok, 从这个方法看,那确实是按行读取的。
应该是为了防止读如 string 这种不定长数据出现正确性问题。

是读取一行中发生多次IO,一列一列读取。 :zipper_mouth_face:

此话题已在最后回复的 60 天后被自动关闭。不再允许新回复。