子查询重复值多的情况下半连接执行太慢

【 TiDB 使用环境】测试
【 TiDB 版本】5.7.25-TiDB-v7.3.0

mysql> explain analyze select count(*) from customer a where exists (select 1 from customer b where a.C_NATIONKEY=b.C_NATIONKEY);
+-------------------------------+------------+---------+-----------+---------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------+----------+---------+
| id                            | estRows    | actRows | task      | access object | execution info                                                                                                                                                                                                                                                                                                                                           | operator info                                                                   | memory   | disk    |
+-------------------------------+------------+---------+-----------+---------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------+----------+---------+
| StreamAgg_11                  | 1.00       | 1       | root      |               | time:1h15m6.9s, loops:2                                                                                                                                                                                                                                                                                                                                  | funcs:count(1)->Column#18                                                       | 8 Bytes  | N/A     |
| └─HashJoin_17                 | 1200000.00 | 1500000 | root      |               | time:1h15m6.9s, loops:1467, build_hash_table:{total:585.7ms, fetch:473.8ms, build:111.9ms}, probe:{concurrency:5, total:6h15m19.3s, max:1h15m6.9s, probe:6h15m14.7s, fetch:4.65s}                                                                                                                                                                        | semi join, equal:[eq(tpch10.customer.c_nationkey, tpch10.customer.c_nationkey)] | 46.3 MB  | 0 Bytes |
|   ├─TableReader_16(Build)     | 1500000.00 | 1500000 | root      |               | time:471.4ms, loops:1468, cop_task: {num: 54, max: 66.6ms, min: 1.34ms, avg: 26.2ms, p95: 62.3ms, max_proc_keys: 50144, p95_proc_keys: 50144, tot_proc: 1.19s, tot_wait: 15ms, rpc_num: 54, rpc_time: 1.41s, copr_cache_hit_ratio: 0.00, distsql_concurrency: 15}                                                                                        | data:TableFullScan_15                                                           | 785.3 KB | N/A     |
|   │ └─TableFullScan_15        | 1500000.00 | 1500000 | cop[tikv] | table:b       | tikv_task:{proc max:61ms, min:0s, avg: 21.9ms, p80:41ms, p95:53ms, iters:1678, tasks:54}, scan_detail: {total_process_keys: 1500000, total_process_keys_size: 305225771, total_keys: 1500054, get_snapshot_time: 1.58ms, rocksdb: {key_skipped_count: 1500000, block: {cache_hit_count: 160, read_count: 4992, read_byte: 112.6 MB, read_time: 57.9ms}}} | keep order:false                                                                | N/A      | N/A     |
|   └─TableReader_14(Probe)     | 1500000.00 | 1500000 | root      |               | time:88.1ms, loops:1468, cop_task: {num: 54, max: 50ms, min: 624.9µs, avg: 13.8ms, p95: 45.9ms, max_proc_keys: 50144, p95_proc_keys: 50144, tot_proc: 71ms, tot_wait: 111ms, rpc_num: 54, rpc_time: 745.2ms, copr_cache_hit_ratio: 0.70, distsql_concurrency: 15}                                                                                        | data:TableFullScan_13                                                           | 1.53 MB  | N/A     |
|     └─TableFullScan_13        | 1500000.00 | 1500000 | cop[tikv] | table:a       | tikv_task:{proc max:61ms, min:0s, avg: 21ms, p80:41ms, p95:53ms, iters:1678, tasks:54}, scan_detail: {total_process_keys: 191231, total_process_keys_size: 38891884, total_keys: 191247, get_snapshot_time: 71.8ms, rocksdb: {key_skipped_count: 191231, block: {cache_hit_count: 670, read_count: 15, read_byte: 1019.5 KB, read_time: 1.6ms}}}         | keep order:false                                                                | N/A      | N/A     |
+-------------------------------+------------+---------+-----------+---------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------+----------+---------+
6 rows in set (1 hour 15 min 6.93 sec)

表结构如下:

mysql> show create table customer \G
*************************** 1. row ***************************
       Table: customer
Create Table: CREATE TABLE `customer` (
  `C_CUSTKEY` bigint(20) NOT NULL,
  `C_NAME` varchar(25) NOT NULL,
  `C_ADDRESS` varchar(40) NOT NULL,
  `C_NATIONKEY` bigint(20) NOT NULL,
  `C_PHONE` char(15) NOT NULL,
  `C_ACCTBAL` decimal(15,2) NOT NULL,
  `C_MKTSEGMENT` char(10) NOT NULL,
  `C_COMMENT` varchar(117) NOT NULL,
  PRIMARY KEY (`C_CUSTKEY`) /*T![clustered_index] CLUSTERED */,
  KEY `idx2` (`C_PHONE`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_bin
1 row in set (0.01 sec)

mysql> select count(C_NATIONKEY),count(distinct C_NATIONKEY) from customer;
+--------------------+-----------------------------+
| count(C_NATIONKEY) | count(distinct C_NATIONKEY) |
+--------------------+-----------------------------+
|            1500000 |                          25 |
+--------------------+-----------------------------+
1 row in set (0.12 sec)

半连接为何这么慢呢,应该几秒钟出结果比较正常吧?

这sql跑1小时? 有点夸张了。

看看集群资源利用率

没看懂,这个语句是想查什么呀???

explain analyze select count(*) from customer a where exists (select 1 from customer b where a.C_NATIONKEY=b.C_NATIONKEY);

1 个赞

这里是纯技术在说为何半连接慢的场景,只将问题聚焦在半连接上。
我说下这个问题的整体背景吧:
在实际场景中经常遇到这样子的子查询:select xx from a where a.col1 in (select b.col1 from b) 这种形式,这种形式的子查询tidb优化器默认情况下会对b.col1进行去重然后和a.col1关联,举例:

mysql> explain select count(*) from customer where c_custkey in (select o_custkey from orders);
+----------------------------------+-------------+-----------+----------------+----------------------------------------------------------------------------------------------------+
| id                               | estRows     | task      | access object  | operator info                                                                                      |
+----------------------------------+-------------+-----------+----------------+----------------------------------------------------------------------------------------------------+
| StreamAgg_12                     | 1.00        | root      |                | funcs:count(1)->Column#18                                                                          |
| └─HashJoin_43                    | 1009664.00  | root      |                | inner join, equal:[eq(tpch10.customer.c_custkey, tpch10.orders.o_custkey)]                         |
|   ├─HashAgg_28(Build)            | 1009664.00  | root      |                | group by:tpch10.orders.o_custkey, funcs:firstrow(tpch10.orders.o_custkey)->tpch10.orders.o_custkey |
|   │ └─TableReader_29             | 1009664.00  | root      |                | data:HashAgg_24                                                                                    |
|   │   └─HashAgg_24               | 1009664.00  | cop[tikv] |                | group by:tpch10.orders.o_custkey,                                                                  |
|   │     └─TableFullScan_27       | 15000000.00 | cop[tikv] | table:orders   | keep order:false                                                                                   |
|   └─TableReader_33(Probe)        | 1500000.00  | root      |                | data:TableFullScan_32                                                                              |
|     └─TableFullScan_32           | 1500000.00  | cop[tikv] | table:customer | keep order:false                                                                                   |
+----------------------------------+-------------+-----------+----------------+----------------------------------------------------------------------------------------------------+
8 rows in set (0.00 sec)

如果b.col1的重复值比较低,那么这种聚合显得特别无效且需要较大的代价进行聚合,我认为不如直接semi join效率更高。
为何tidb总是对b表进行聚合去重操作呢?这是因为参数:
https://docs.pingcap.com/zh/tidb/stable/system-variables#tidb_opt_insubq_to_join_and_agg

tidb_opt_insubq_to_join_and_agg

  • 作用域:SESSION | GLOBAL
  • 是否持久化到集群:是
  • 类型:布尔型
  • 默认值:ON
  • 这个变量用来设置是否开启优化规则:将子查询转成 join 和 aggregation。

当我们关闭这个参数后可以发现可以走半连接了:

mysql> set tidb_opt_insubq_to_join_and_agg=OFF;
Query OK, 0 rows affected (0.00 sec)

mysql> explain select count(*) from customer where c_custkey in (select o_custkey from orders);
+-------------------------------+-------------+-----------+----------------+---------------------------------------------------------------------------+
| id                            | estRows     | task      | access object  | operator info                                                             |
+-------------------------------+-------------+-----------+----------------+---------------------------------------------------------------------------+
| StreamAgg_10                  | 1.00        | root      |                | funcs:count(1)->Column#18                                                 |
| └─HashJoin_16                 | 1200000.00  | root      |                | semi join, equal:[eq(tpch10.customer.c_custkey, tpch10.orders.o_custkey)] |
|   ├─TableReader_15(Build)     | 15000000.00 | root      |                | data:TableFullScan_14                                                     |
|   │ └─TableFullScan_14        | 15000000.00 | cop[tikv] | table:orders   | keep order:false                                                          |
|   └─TableReader_13(Probe)     | 1500000.00  | root      |                | data:TableFullScan_12                                                     |
|     └─TableFullScan_12        | 1500000.00  | cop[tikv] | table:customer | keep order:false                                                          |
+-------------------------------+-------------+-----------+----------------+---------------------------------------------------------------------------+
6 rows in set (0.00 sec)

这里有一个问题就是为何优化器默认优先做去重处理呢?
1、in子查询中如果查询字段重复值很少,那么去重意义并不大,且聚合带来的消耗更多。
2、in子查询中如果查询字段重复值较大,虽然去重有一定意义,但半连接是遇到匹配的行数则终止进行下一行的probe匹配(early out),因此半连接效率上我理解不应该太慢。
但tidb在这里设置了一个参数tidb_opt_insubq_to_join_and_agg来控制(希望后续优化成自适应)默认每次让in子查询字段去重,我想在tidb上应该还是存在大量重复时候会导致性能低下,因此查看当半连接子查询中字段重复值很多的情况下效率是否较低成了主要本帖子的主要问题点。
1、为了避免引入多表计划只用tpch10的customer这一张表进行子查询测试:

--tidb并不会对如下这种同一张表的自查询进行改写优化成单表查询,因此可以用一张表来做子查询进行测试
select count(*) from customer a where a.C_NATIONKEY in (select C_NATIONKEY from customer b );

2、为了避免tidb_opt_insubq_to_join_and_agg参数默认让in子查询去重处理带来的额外影响,让问题更能够聚焦在半连接上,因此对上述语句进行改写成exists方式让其直接走半连接形式(这样在发帖中就不需要引入tidb_opt_insubq_to_join_and_agg参数问题了,但是这里还是解释了。。。):

mysql> explain select count(*) from customer a where exists (select 1 from customer b where a.C_NATIONKEY=b.C_NATIONKEY);
+-------------------------------+------------+-----------+---------------+---------------------------------------------------------------------------------+
| id                            | estRows    | task      | access object | operator info                                                                   |
+-------------------------------+------------+-----------+---------------+---------------------------------------------------------------------------------+
| StreamAgg_11                  | 1.00       | root      |               | funcs:count(1)->Column#18                                                       |
| └─HashJoin_17                 | 1200000.00 | root      |               | semi join, equal:[eq(tpch10.customer.c_nationkey, tpch10.customer.c_nationkey)] |
|   ├─TableReader_16(Build)     | 1500000.00 | root      |               | data:TableFullScan_15                                                           |
|   │ └─TableFullScan_15        | 1500000.00 | cop[tikv] | table:b       | keep order:false                                                                |
|   └─TableReader_14(Probe)     | 1500000.00 | root      |               | data:TableFullScan_13                                                           |
|     └─TableFullScan_13        | 1500000.00 | cop[tikv] | table:a       | keep order:false                                                                |
+-------------------------------+------------+-----------+---------------+---------------------------------------------------------------------------------+
6 rows in set (0.00 sec)

因此才有了上述提问题的帖子,显得毫无业务意义,只是希望更能聚焦算子本身的问题点。
在业务上可以有很多种优化,比如:如果字段重复多先去重、添加索引等等,但更希望产品本身上有更强的兜底能力,毕竟并不是每一个开发人员都很懂得性能调优。

3 个赞

想问下这个是多少仓的数据量?我想在其他数据库上也试试看 :upside_down_face:

tpch sf=10

1 个赞

我们在 https://github.com/pingcap/tidb/issues/47424 跟踪一下这里 semi join 执行效率的问题。

2 个赞

这个转换是个基于cost的,目前不支持 cost-based 逻辑优化,在后续 cascades 框架里会支持自动选择。

是基于rule的,这里口误吧?

老哥细啊 :+1:

1 个赞

此话题已在最后回复的 60 天后被自动关闭。不再允许新回复。