当HashJoin的BuildSide过大时容易OOM

当HashJoin的BuildSide的记录数如果量非常大则容易发生SQL级别的OOM,即使开启了落盘。

--表结构(tpch表)
mysql> show create table customer \G
*************************** 1. row ***************************
       Table: customer
Create Table: CREATE TABLE `customer` (
  `C_CUSTKEY` bigint(20) NOT NULL,
  `C_NAME` varchar(25) NOT NULL,
  `C_ADDRESS` varchar(40) NOT NULL,
  `C_NATIONKEY` bigint(20) NOT NULL,
  `C_PHONE` char(15) NOT NULL,
  `C_ACCTBAL` decimal(15,2) NOT NULL,
  `C_MKTSEGMENT` char(10) NOT NULL,
  `C_COMMENT` varchar(117) NOT NULL,
  PRIMARY KEY (`C_CUSTKEY`) /*T![clustered_index] CLUSTERED */,
  KEY `idx1` (`C_PHONE`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_bin
1 row in set (0.00 sec)

mysql> show create table orders \G
*************************** 1. row ***************************
       Table: orders
Create Table: CREATE TABLE `orders` (
  `O_ORDERKEY` bigint(20) NOT NULL,
  `O_CUSTKEY` bigint(20) NOT NULL,
  `O_ORDERSTATUS` char(1) NOT NULL,
  `O_TOTALPRICE` decimal(15,2) NOT NULL,
  `O_ORDERDATE` date NOT NULL,
  `O_ORDERPRIORITY` char(15) NOT NULL,
  `O_CLERK` char(15) NOT NULL,
  `O_SHIPPRIORITY` bigint(20) NOT NULL,
  `O_COMMENT` varchar(79) NOT NULL,
  PRIMARY KEY (`O_ORDERKEY`) /*T![clustered_index] CLUSTERED */,
  KEY `idx1` (`O_CUSTKEY`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_bin
1 row in set (0.00 sec)

--customer表数据量(1500万)
mysql> select count(*) from customer;
+----------+
| count(*) |
+----------+
| 15000000 |
+----------+
1 row in set (0.57 sec)
--orders表数据量(1.5亿)
mysql> select count(*) from orders;
+-----------+
| count(*)  |
+-----------+
| 150000000 |
+-----------+
1 row in set (4.34 sec)

--数据库版本
mysql> select version();
+--------------------+
| version()          |
+--------------------+
| 5.7.25-TiDB-v7.0.0 |
+--------------------+
1 row in set (0.00 sec)

--oom-action行为
mysql> show variables like 'tidb_mem_oom_action';
+---------------------+--------+
| Variable_name       | Value  |
+---------------------+--------+
| tidb_mem_oom_action | CANCEL |
+---------------------+--------+
1 row in set (0.00 sec)
mysql> show variables like 'tidb_enable_rate_limit_action';
+-------------------------------+-------+
| Variable_name                 | Value |
+-------------------------------+-------+
| tidb_enable_rate_limit_action | OFF   |
+-------------------------------+-------+
1 row in set (0.00 sec)

--落盘参数
mysql> show variables like 'tidb_enable_tmp_storage_on_oom';
+--------------------------------+-------+
| Variable_name                  | Value |
+--------------------------------+-------+
| tidb_enable_tmp_storage_on_oom | ON    |
+--------------------------------+-------+
1 row in set (0.00 sec)
--SQL级别内存控制为1GB
mysql> show variables like 'tidb_mem_quota_query';
+----------------------+------------+
| Variable_name        | Value      |
+----------------------+------------+
| tidb_mem_quota_query | 1073741824 |
+----------------------+------------+
1 row in set (0.01 sec)


将orders表作为hashJoin的BuildSide

mysql> select count(distinct O_CUSTKEY) from orders;
+---------------------------+
| count(distinct O_CUSTKEY) |
+---------------------------+
|                   9999832 |
+---------------------------+
1 row in set (23.24 sec)



mysql> explain   select /*+ HASH_JOIN_BUILD(b) */ count(*) from customer a,orders b where a.c_custkey=b.O_custkey and a.C_PHONE='11-746-264-1304';
+-------------------------------+--------------+-----------+--------------------------------+------------------------------------------------------------------------------+
| id                            | estRows      | task      | access object                  | operator info                                                                |
+-------------------------------+--------------+-----------+--------------------------------+------------------------------------------------------------------------------+
| StreamAgg_9                   | 1.00         | root      |                                | funcs:count(1)->Column#18                                                    |
| └─HashJoin_61                 | 15.20        | root      |                                | inner join, equal:[eq(tpch100.customer.c_custkey, tpch100.orders.o_custkey)] |
|   ├─IndexReader_38(Build)     | 150000000.00 | root      |                                | index:IndexFullScan_37                                                       |
|   │ └─IndexFullScan_37        | 150000000.00 | cop[tikv] | table:b, index:idx1(O_CUSTKEY) | keep order:false                                                             |
|   └─IndexReader_34(Probe)     | 1.00         | root      |                                | index:IndexRangeScan_33                                                      |
|     └─IndexRangeScan_33       | 1.00         | cop[tikv] | table:a, index:idx1(C_PHONE)   | range:["11-746-264-1304","11-746-264-1304"], keep order:false                |
+-------------------------------+--------------+-----------+--------------------------------+------------------------------------------------------------------------------+
6 rows in set (0.00 sec)

--执行该语句发现总是Out of Memory Quota!(关联字段o_custkey去重复后值:9999832)
mysql> explain analyze  select /*+ HASH_JOIN_BUILD(b) */ count(*) from customer a,orders b where a.c_custkey=b.O_custkey and a.C_PHONE='11-746-264-1304';
ERROR 1105 (HY000): Out Of Memory Quota![conn=3978033069293973]

--在build表上找一个重复度低的字段做关联条件看是否还会OOM
--这里采用O_ORDERDATE作为关联字段,重复值才2k多。
mysql> select count(distinct O_ORDERDATE) from orders;
+-----------------------------+
| count(distinct O_ORDERDATE) |
+-----------------------------+
|                        2406 |
+-----------------------------+
1 row in set (2.16 sec)

mysql> explain analyze  select /*+ HASH_JOIN_BUILD(b) */ count(*) from customer a,orders b where a.c_custkey=b.O_ORDERDATE and a.C_PHONE='11-746-264-1304';
ERROR 1105 (HY000): Out Of Memory Quota![conn=3978033069293973]
mysql> 

还是会发生OOM,因此buildSide表的关联字段(key)重复度是高还是低,都会具有OOM的风险。
在hashJoin的buildSide生成hash_table的代码中看到,针对每一个key和行指针都会放到一个hasbtable中,代码位置:
https://github.com/pingcap/tidb/blob/7eff8ce629d00cf4ab1eeda82a1ecd5a4e394b02/executor/hash_table.go#L463

https://github.com/pingcap/tidb/blob/7eff8ce629d00cf4ab1eeda82a1ecd5a4e394b02/executor/hash_table.go#L599

而且通过func (ht unsafeHashTable) Put(hashKey uint64, rowPtr chunk.RowPtr)方法可以看出当key的重复值越低时候生成这个hashtable的速度越慢(因为key相同的记录的rowPtr需要追加到slice中)。
这个hashtable中记录的是key和对应的行指针,没有落盘行为,数据(数据放在rowContainer中有落盘行为)。但是每一次Put一行记录都会添加type entry struct {
ptr chunk.RowPtr
next entry
}字节,总共16字节的大小的行指针记录以及去重复的hashkey记录(8字节)。因此及时不考虑kvrequest过来的数据和缓存的数据(所有1GB内存都给这里hashtable用)也只能缓存大约1GB
1024
10241024/16=67108864(当key存在大量重复,大约6710万),1GB102410241024/24=44739242(当key几乎不存在重复,大约4473万)条之间的记录的指针。

对其进行验证,buildSide表(key=O_ORDERDATE ,重复度较低)记录数8kw、7kw、6kw,观察是否会发生OOM:

mysql> explain analyze  select /*+ HASH_JOIN_BUILD(b) */ count(*) from customer a,(select * from orders limit 80000000) b where a.c_custkey=b.O_ORDERDATE and a.C_PHONE='11-746-264-1304';
ERROR 1105 (HY000): Out Of Memory Quota![conn=3978033069293973]
mysql> explain analyze  select /*+ HASH_JOIN_BUILD(b) */ count(*) from customer a,(select * from orders limit 70000000) b where a.c_custkey=b.O_ORDERDATE and a.C_PHONE='11-746-264-1304';
ERROR 1105 (HY000): Out Of Memory Quota![conn=3978033069293973]
mysql> 
mysql> explain analyze  select /*+ HASH_JOIN_BUILD(b) */ count(*) from customer a,(select * from orders limit 60000000) b where a.c_custkey=b.O_ORDERDATE and a.C_PHONE='11-746-264-1304';
+------------------------------------+-------------+----------+-----------+------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------+-----------+---------+
| id                                 | estRows     | actRows  | task      | access object                | execution info                                                                                                                                                                                                                                                                                                                                                                                                   | operator info                                                 | memory    | disk    |
+------------------------------------+-------------+----------+-----------+------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------+-----------+---------+
| HashAgg_14                         | 1.00        | 1        | root      |                              | time:16.2s, loops:2, RRU:149686.360502, WRU:0.000000, partial_worker:{wall_time:16.161443238s, concurrency:5, task_num:0, tot_wait:1m20.806052767s, tot_exec:0s, tot_time:1m20.806055402s, max:16.161212139s, p95:16.161212139s}, final_worker:{wall_time:16.161945916s, concurrency:5, task_num:0, tot_wait:1m20.806094127s, tot_exec:1.556µs, tot_time:1m20.806097792s, max:16.161223448s, p95:16.161223448s}  | funcs:count(1)->Column#18                                     | 6.15 KB   | N/A     |
| └─HashJoin_16                      | 24937.66    | 0        | root      |                              | time:16.2s, loops:1, build_hash_table:{total:16.2s, fetch:4.74s, build:11.4s}, probe:{concurrency:5, total:1m20.8s, max:16.2s, probe:13.1µs, fetch:1m20.8s}                                                                                                                                                                                                                                                      | inner join, equal:[eq(Column#19, Column#20)]                  | 1019.3 MB | 1.34 GB |
|   ├─Projection_20(Build)           | 60000000.00 | 60000000 | root      |                              | time:6.85s, loops:58730, Concurrency:5                                                                                                                                                                                                                                                                                                                                                                           | cast(tpch100.orders.o_orderdate, double BINARY)->Column#20    | 87.8 KB   | N/A     |
|   │ └─Limit_21                     | 60000000.00 | 60000000 | root      |                              | time:340.6ms, loops:58730                                                                                                                                                                                                                                                                                                                                                                                        | offset:0, count:60000000                                      | N/A       | N/A     |
|   │   └─TableReader_25             | 60000000.00 | 60000965 | root      |                              | time:283.2ms, loops:58729, cop_task: {num: 2151, max: 108.6ms, min: 363.8µs, avg: 20.1ms, p95: 51.2ms, max_proc_keys: 50144, p95_proc_keys: 50144, tot_proc: 29.7s, tot_wait: 821.9ms, rpc_num: 2151, rpc_time: 43.1s, copr_cache_hit_ratio: 0.00, build_task_duration: 277.5µs, max_distsql_concurrency: 15}                                                                                                    | data:Limit_24                                                 | 6.14 MB   | N/A     |
|   │     └─Limit_24                 | 60000000.00 | 60004037 | cop[tikv] |                              | tikv_task:{proc max:98ms, min:0s, avg: 12.6ms, p80:22ms, p95:36ms, iters:67116, tasks:2151}, scan_detail: {total_process_keys: 60004037, total_process_keys_size: 9114363007, total_keys: 60006188, get_snapshot_time: 313.8ms, rocksdb: {key_skipped_count: 60004037, block: {cache_hit_count: 318616}}}                                                                                                        | offset:0, count:60000000                                      | N/A       | N/A     |
|   │       └─TableFullScan_23       | 60000000.00 | 60004037 | cop[tikv] | table:orders                 | tikv_task:{proc max:98ms, min:0s, avg: 12.6ms, p80:22ms, p95:36ms, iters:67116, tasks:2151}                                                                                                                                                                                                                                                                                                                      | keep order:false                                              | N/A       | N/A     |
|   └─Projection_17(Probe)           | 1.00        | 1        | root      |                              | time:2.07ms, loops:2, Concurrency:OFF                                                                                                                                                                                                                                                                                                                                                                            | cast(tpch100.customer.c_custkey, double BINARY)->Column#19    | 8.86 KB   | N/A     |
|     └─IndexReader_19               | 1.00        | 1        | root      |                              | time:1.97ms, loops:2, cop_task: {num: 1, max: 2.27ms, proc_keys: 0, tot_proc: 2.21µs, tot_wait: 518.6µs, rpc_num: 1, rpc_time: 2.25ms, copr_cache_hit_ratio: 1.00, build_task_duration: 8.62µs, max_distsql_concurrency: 1}                                                                                                                                                                                      | index:IndexRangeScan_18                                       | 298 Bytes | N/A     |
|       └─IndexRangeScan_18          | 1.00        | 1        | cop[tikv] | table:a, index:idx1(C_PHONE) | tikv_task:{time:12ms, loops:1}, scan_detail: {get_snapshot_time: 252.4µs, rocksdb: {block: {}}}                                                                                                                                                                                                                                                                                                                  | range:["11-746-264-1304","11-746-264-1304"], keep order:false | N/A       | N/A     |
+------------------------------------+-------------+----------+-----------+------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------+-----------+---------+
10 rows in set (16.32 sec)

可以看到基本和预期一致,因此在当前的hash_join算子中如果buildSide记录数过大,则容易导致语句级内存不足发生oom-kill情况。
可否进行增强,如果当前buildside统计信息表记录数过大(比语句级内存设置*系数大),那么则采用另一种算法,可以稍微慢一些但是支持更大数据量buildSide的hashJoin呢(比如其它数据库用的对build和probe进行partition划分,分批逐步完成计算)。

概括一下您所提到的问题:

  1. probe 端只有 1 行数据
  2. 随着 build 端的增大,会开始触发 hashjoin 的落盘行为
  3. 但即便存在落盘,随着 build 的不断增大,仍然会出现 query 内存超限而被 cancel 掉的情况

我抓了下过程中的 heap profile,确实如您所分析的,落盘后内存的增长,来自于存储 key 和指针关系的那个 map(hashtable),进而导致了 query 被 cancel。

可否进行增强,如果当前buildside统计信息表记录数过大(比语句级内存设置*系数大),那么则采用另一种算法,可以稍微慢一些但是支持更大数据量buildSide的hashJoin呢(比如其它数据库用的对build和probe进行partition划分,分批逐步完成计算)

是一个可行的方向。这种自适应的方式,同时也会依赖优化器对 build 侧的基数估计的准确度。

额外提一点是,自 6.5.0 之后,我们支持了实例级别的内存控制。默认的 tidb_mem_quota_query 为 1GB,实际使用中相对比较保守,在真实业务中使用的时候,由于存在实例级别的内存控制,如有必要,我们会倾向于放大 tidb_mem_quota_query 的设置值。

1 个赞

此话题已在最后回复的 60 天后被自动关闭。不再允许新回复。