当HashJoin的BuildSide的记录数如果量非常大则容易发生SQL级别的OOM,即使开启了落盘。
--表结构(tpch表)
mysql> show create table customer \G
*************************** 1. row ***************************
Table: customer
Create Table: CREATE TABLE `customer` (
`C_CUSTKEY` bigint(20) NOT NULL,
`C_NAME` varchar(25) NOT NULL,
`C_ADDRESS` varchar(40) NOT NULL,
`C_NATIONKEY` bigint(20) NOT NULL,
`C_PHONE` char(15) NOT NULL,
`C_ACCTBAL` decimal(15,2) NOT NULL,
`C_MKTSEGMENT` char(10) NOT NULL,
`C_COMMENT` varchar(117) NOT NULL,
PRIMARY KEY (`C_CUSTKEY`) /*T![clustered_index] CLUSTERED */,
KEY `idx1` (`C_PHONE`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_bin
1 row in set (0.00 sec)
mysql> show create table orders \G
*************************** 1. row ***************************
Table: orders
Create Table: CREATE TABLE `orders` (
`O_ORDERKEY` bigint(20) NOT NULL,
`O_CUSTKEY` bigint(20) NOT NULL,
`O_ORDERSTATUS` char(1) NOT NULL,
`O_TOTALPRICE` decimal(15,2) NOT NULL,
`O_ORDERDATE` date NOT NULL,
`O_ORDERPRIORITY` char(15) NOT NULL,
`O_CLERK` char(15) NOT NULL,
`O_SHIPPRIORITY` bigint(20) NOT NULL,
`O_COMMENT` varchar(79) NOT NULL,
PRIMARY KEY (`O_ORDERKEY`) /*T![clustered_index] CLUSTERED */,
KEY `idx1` (`O_CUSTKEY`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_bin
1 row in set (0.00 sec)
--customer表数据量(1500万)
mysql> select count(*) from customer;
+----------+
| count(*) |
+----------+
| 15000000 |
+----------+
1 row in set (0.57 sec)
--orders表数据量(1.5亿)
mysql> select count(*) from orders;
+-----------+
| count(*) |
+-----------+
| 150000000 |
+-----------+
1 row in set (4.34 sec)
--数据库版本
mysql> select version();
+--------------------+
| version() |
+--------------------+
| 5.7.25-TiDB-v7.0.0 |
+--------------------+
1 row in set (0.00 sec)
--oom-action行为
mysql> show variables like 'tidb_mem_oom_action';
+---------------------+--------+
| Variable_name | Value |
+---------------------+--------+
| tidb_mem_oom_action | CANCEL |
+---------------------+--------+
1 row in set (0.00 sec)
mysql> show variables like 'tidb_enable_rate_limit_action';
+-------------------------------+-------+
| Variable_name | Value |
+-------------------------------+-------+
| tidb_enable_rate_limit_action | OFF |
+-------------------------------+-------+
1 row in set (0.00 sec)
--落盘参数
mysql> show variables like 'tidb_enable_tmp_storage_on_oom';
+--------------------------------+-------+
| Variable_name | Value |
+--------------------------------+-------+
| tidb_enable_tmp_storage_on_oom | ON |
+--------------------------------+-------+
1 row in set (0.00 sec)
--SQL级别内存控制为1GB
mysql> show variables like 'tidb_mem_quota_query';
+----------------------+------------+
| Variable_name | Value |
+----------------------+------------+
| tidb_mem_quota_query | 1073741824 |
+----------------------+------------+
1 row in set (0.01 sec)
将orders表作为hashJoin的BuildSide
mysql> select count(distinct O_CUSTKEY) from orders;
+---------------------------+
| count(distinct O_CUSTKEY) |
+---------------------------+
| 9999832 |
+---------------------------+
1 row in set (23.24 sec)
mysql> explain select /*+ HASH_JOIN_BUILD(b) */ count(*) from customer a,orders b where a.c_custkey=b.O_custkey and a.C_PHONE='11-746-264-1304';
+-------------------------------+--------------+-----------+--------------------------------+------------------------------------------------------------------------------+
| id | estRows | task | access object | operator info |
+-------------------------------+--------------+-----------+--------------------------------+------------------------------------------------------------------------------+
| StreamAgg_9 | 1.00 | root | | funcs:count(1)->Column#18 |
| └─HashJoin_61 | 15.20 | root | | inner join, equal:[eq(tpch100.customer.c_custkey, tpch100.orders.o_custkey)] |
| ├─IndexReader_38(Build) | 150000000.00 | root | | index:IndexFullScan_37 |
| │ └─IndexFullScan_37 | 150000000.00 | cop[tikv] | table:b, index:idx1(O_CUSTKEY) | keep order:false |
| └─IndexReader_34(Probe) | 1.00 | root | | index:IndexRangeScan_33 |
| └─IndexRangeScan_33 | 1.00 | cop[tikv] | table:a, index:idx1(C_PHONE) | range:["11-746-264-1304","11-746-264-1304"], keep order:false |
+-------------------------------+--------------+-----------+--------------------------------+------------------------------------------------------------------------------+
6 rows in set (0.00 sec)
--执行该语句发现总是Out of Memory Quota!(关联字段o_custkey去重复后值:9999832)
mysql> explain analyze select /*+ HASH_JOIN_BUILD(b) */ count(*) from customer a,orders b where a.c_custkey=b.O_custkey and a.C_PHONE='11-746-264-1304';
ERROR 1105 (HY000): Out Of Memory Quota![conn=3978033069293973]
--在build表上找一个重复度低的字段做关联条件看是否还会OOM
--这里采用O_ORDERDATE作为关联字段,重复值才2k多。
mysql> select count(distinct O_ORDERDATE) from orders;
+-----------------------------+
| count(distinct O_ORDERDATE) |
+-----------------------------+
| 2406 |
+-----------------------------+
1 row in set (2.16 sec)
mysql> explain analyze select /*+ HASH_JOIN_BUILD(b) */ count(*) from customer a,orders b where a.c_custkey=b.O_ORDERDATE and a.C_PHONE='11-746-264-1304';
ERROR 1105 (HY000): Out Of Memory Quota![conn=3978033069293973]
mysql>
还是会发生OOM,因此buildSide表的关联字段(key)重复度是高还是低,都会具有OOM的风险。
在hashJoin的buildSide生成hash_table的代码中看到,针对每一个key和行指针都会放到一个hasbtable中,代码位置:
https://github.com/pingcap/tidb/blob/7eff8ce629d00cf4ab1eeda82a1ecd5a4e394b02/executor/hash_table.go#L463
而且通过func (ht unsafeHashTable) Put(hashKey uint64, rowPtr chunk.RowPtr)方法可以看出当key的重复值越低时候生成这个hashtable的速度越慢(因为key相同的记录的rowPtr需要追加到slice中)。
这个hashtable中记录的是key和对应的行指针,没有落盘行为,数据(数据放在rowContainer中有落盘行为)。但是每一次Put一行记录都会添加type entry struct {
ptr chunk.RowPtr
next entry
}字节,总共16字节的大小的行指针记录以及去重复的hashkey记录(8字节)。因此及时不考虑kvrequest过来的数据和缓存的数据(所有1GB内存都给这里hashtable用)也只能缓存大约1GB102410241024/16=67108864(当key存在大量重复,大约6710万),1GB102410241024/24=44739242(当key几乎不存在重复,大约4473万)条之间的记录的指针。
对其进行验证,buildSide表(key=O_ORDERDATE ,重复度较低)记录数8kw、7kw、6kw,观察是否会发生OOM:
mysql> explain analyze select /*+ HASH_JOIN_BUILD(b) */ count(*) from customer a,(select * from orders limit 80000000) b where a.c_custkey=b.O_ORDERDATE and a.C_PHONE='11-746-264-1304';
ERROR 1105 (HY000): Out Of Memory Quota![conn=3978033069293973]
mysql> explain analyze select /*+ HASH_JOIN_BUILD(b) */ count(*) from customer a,(select * from orders limit 70000000) b where a.c_custkey=b.O_ORDERDATE and a.C_PHONE='11-746-264-1304';
ERROR 1105 (HY000): Out Of Memory Quota![conn=3978033069293973]
mysql>
mysql> explain analyze select /*+ HASH_JOIN_BUILD(b) */ count(*) from customer a,(select * from orders limit 60000000) b where a.c_custkey=b.O_ORDERDATE and a.C_PHONE='11-746-264-1304';
+------------------------------------+-------------+----------+-----------+------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------+-----------+---------+
| id | estRows | actRows | task | access object | execution info | operator info | memory | disk |
+------------------------------------+-------------+----------+-----------+------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------+-----------+---------+
| HashAgg_14 | 1.00 | 1 | root | | time:16.2s, loops:2, RRU:149686.360502, WRU:0.000000, partial_worker:{wall_time:16.161443238s, concurrency:5, task_num:0, tot_wait:1m20.806052767s, tot_exec:0s, tot_time:1m20.806055402s, max:16.161212139s, p95:16.161212139s}, final_worker:{wall_time:16.161945916s, concurrency:5, task_num:0, tot_wait:1m20.806094127s, tot_exec:1.556µs, tot_time:1m20.806097792s, max:16.161223448s, p95:16.161223448s} | funcs:count(1)->Column#18 | 6.15 KB | N/A |
| └─HashJoin_16 | 24937.66 | 0 | root | | time:16.2s, loops:1, build_hash_table:{total:16.2s, fetch:4.74s, build:11.4s}, probe:{concurrency:5, total:1m20.8s, max:16.2s, probe:13.1µs, fetch:1m20.8s} | inner join, equal:[eq(Column#19, Column#20)] | 1019.3 MB | 1.34 GB |
| ├─Projection_20(Build) | 60000000.00 | 60000000 | root | | time:6.85s, loops:58730, Concurrency:5 | cast(tpch100.orders.o_orderdate, double BINARY)->Column#20 | 87.8 KB | N/A |
| │ └─Limit_21 | 60000000.00 | 60000000 | root | | time:340.6ms, loops:58730 | offset:0, count:60000000 | N/A | N/A |
| │ └─TableReader_25 | 60000000.00 | 60000965 | root | | time:283.2ms, loops:58729, cop_task: {num: 2151, max: 108.6ms, min: 363.8µs, avg: 20.1ms, p95: 51.2ms, max_proc_keys: 50144, p95_proc_keys: 50144, tot_proc: 29.7s, tot_wait: 821.9ms, rpc_num: 2151, rpc_time: 43.1s, copr_cache_hit_ratio: 0.00, build_task_duration: 277.5µs, max_distsql_concurrency: 15} | data:Limit_24 | 6.14 MB | N/A |
| │ └─Limit_24 | 60000000.00 | 60004037 | cop[tikv] | | tikv_task:{proc max:98ms, min:0s, avg: 12.6ms, p80:22ms, p95:36ms, iters:67116, tasks:2151}, scan_detail: {total_process_keys: 60004037, total_process_keys_size: 9114363007, total_keys: 60006188, get_snapshot_time: 313.8ms, rocksdb: {key_skipped_count: 60004037, block: {cache_hit_count: 318616}}} | offset:0, count:60000000 | N/A | N/A |
| │ └─TableFullScan_23 | 60000000.00 | 60004037 | cop[tikv] | table:orders | tikv_task:{proc max:98ms, min:0s, avg: 12.6ms, p80:22ms, p95:36ms, iters:67116, tasks:2151} | keep order:false | N/A | N/A |
| └─Projection_17(Probe) | 1.00 | 1 | root | | time:2.07ms, loops:2, Concurrency:OFF | cast(tpch100.customer.c_custkey, double BINARY)->Column#19 | 8.86 KB | N/A |
| └─IndexReader_19 | 1.00 | 1 | root | | time:1.97ms, loops:2, cop_task: {num: 1, max: 2.27ms, proc_keys: 0, tot_proc: 2.21µs, tot_wait: 518.6µs, rpc_num: 1, rpc_time: 2.25ms, copr_cache_hit_ratio: 1.00, build_task_duration: 8.62µs, max_distsql_concurrency: 1} | index:IndexRangeScan_18 | 298 Bytes | N/A |
| └─IndexRangeScan_18 | 1.00 | 1 | cop[tikv] | table:a, index:idx1(C_PHONE) | tikv_task:{time:12ms, loops:1}, scan_detail: {get_snapshot_time: 252.4µs, rocksdb: {block: {}}} | range:["11-746-264-1304","11-746-264-1304"], keep order:false | N/A | N/A |
+------------------------------------+-------------+----------+-----------+------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------+-----------+---------+
10 rows in set (16.32 sec)
可以看到基本和预期一致,因此在当前的hash_join算子中如果buildSide记录数过大,则容易导致语句级内存不足发生oom-kill情况。
可否进行增强,如果当前buildside统计信息表记录数过大(比语句级内存设置*系数大),那么则采用另一种算法,可以稍微慢一些但是支持更大数据量buildSide的hashJoin呢(比如其它数据库用的对build和probe进行partition划分,分批逐步完成计算)。