join reorder中inner join和outer join顺序问题

【 TiDB 使用环境】测试
【 TiDB 版本】v7.5.1

对于一条SQL:select * from a inner join b on a.c1=b.id left join c on a.c2=c.id;
如果a join b的结果集大于a join c的结果集,那么优化器会评估执行顺序为:a left join c inner join b ,这常见于b表的关联条件上存在较多重复数据的情况下,但是如果能确定a inner join b的结果集最多不大于count(a) (比如b表关联字段为主键)那么先做a inner b在绝大多数情况下都是最优的
但从测试来看TiDB并没有做这块的优化,测试如下:

测试SQL:

drop table if exists a;
drop table if exists b;
drop table if exists c;

create table a(id int,c1 int,c2 int,primary key (id));
create table b(id int,c1 int,c2 int,primary key (id));
create table c(id int,c1 int,c2 int,primary key (id));

insert into a values (1,1,1),(2,2,2),(3,3,3);
insert into b values (1,1,1),(10,10,10),(11,11,11),(12,12,12);
insert into c values (1,1,1),(2,2,2);

analyze table a,b,c with 1 samplerate;

做如下查询:

select * from a inner join b on a.c1=b.id left join c on a.c2=c.id;

想要的最佳连接顺序为a inner join b left join c,但是查看执行计划,优化器给的顺序为a left join c inner join b:

mysql> show variables like 'tidb_enable_outer_join_reorder';
+--------------------------------+-------+
| Variable_name                  | Value |
+--------------------------------+-------+
| tidb_enable_outer_join_reorder | ON    |
+--------------------------------+-------+
1 row in set (0.00 sec)

mysql> explain select * from a inner join b on a.c1=b.id left join c on a.c2=c.id;
+-----------------------------------+---------+-----------+---------------+---------------------------------------------------------------------------------------------------+
| id                                | estRows | task      | access object | operator info                                                                                     |
+-----------------------------------+---------+-----------+---------------+---------------------------------------------------------------------------------------------------+
| Projection_12                     | 3.00    | root      |               | test.a.id, test.a.c1, test.a.c2, test.b.id, test.b.c1, test.b.c2, test.c.id, test.c.c1, test.c.c2 |
| └─HashJoin_23                     | 3.00    | root      |               | inner join, equal:[eq(test.a.c1, test.b.id)]                                                      |
|   ├─HashJoin_33(Build)            | 3.00    | root      |               | left outer join, equal:[eq(test.a.c2, test.c.id)]                                                 |
|   │ ├─TableReader_39(Build)       | 2.00    | root      |               | data:TableFullScan_38                                                                             |
|   │ │ └─TableFullScan_38          | 2.00    | cop[tikv] | table:c       | keep order:false                                                                                  |
|   │ └─TableReader_37(Probe)       | 3.00    | root      |               | data:Selection_36                                                                                 |
|   │   └─Selection_36              | 3.00    | cop[tikv] |               | not(isnull(test.a.c1))                                                                            |
|   │     └─TableFullScan_35        | 3.00    | cop[tikv] | table:a       | keep order:false                                                                                  |
|   └─TableReader_41(Probe)         | 4.00    | root      |               | data:TableFullScan_40                                                                             |
|     └─TableFullScan_40            | 4.00    | cop[tikv] | table:b       | keep order:false                                                                                  |
+-----------------------------------+---------+-----------+---------------+---------------------------------------------------------------------------------------------------+
10 rows in set (0.00 sec)

mysql> explain select /*+ leading(a,b,c) */ * from a inner join b on a.c1=b.id left join c on a.c2=c.id;
+--------------------------------+---------+-----------+---------------+---------------------------------------------------+
| id                             | estRows | task      | access object | operator info                                     |
+--------------------------------+---------+-----------+---------------+---------------------------------------------------+
| HashJoin_20                    | 3.00    | root      |               | left outer join, equal:[eq(test.a.c2, test.c.id)] |
| ├─TableReader_39(Build)        | 2.00    | root      |               | data:TableFullScan_38                             |
| │ └─TableFullScan_38           | 2.00    | cop[tikv] | table:c       | keep order:false                                  |
| └─HashJoin_32(Probe)           | 3.00    | root      |               | inner join, equal:[eq(test.a.c1, test.b.id)]      |
|   ├─TableReader_35(Build)      | 3.00    | root      |               | data:Selection_34                                 |
|   │ └─Selection_34             | 3.00    | cop[tikv] |               | not(isnull(test.a.c1))                            |
|   │   └─TableFullScan_33       | 3.00    | cop[tikv] | table:a       | keep order:false                                  |
|   └─TableReader_37(Probe)      | 4.00    | root      |               | data:TableFullScan_36                             |
|     └─TableFullScan_36         | 4.00    | cop[tikv] | table:b       | keep order:false                                  |
+--------------------------------+---------+-----------+---------------+---------------------------------------------------+
9 rows in set (0.00 sec)

从实际数据可以看到如果先做a inner join b 那么可以过滤很多无效数据,然后在和c关联得到最终结果。
但是如果a left join c,还是会保留不少于a表总记录数的数据,然后和b关联,整体上要比先做a inner join b多扫描较多数据。

因此是否可以做这样优化:对于a inner join b left join c这种存在多个inner join和outer join的多表关联的情况下,如果能确定a inner join b结果集不高于a,那么应该优先做inner join来更早的过滤更多数据,减少数据扫描?

1 个赞

如果能够确定 a INNER JOIN b 的结果集不会超过 a 表的总记录数,那么先执行这一步连接确实可以有效地减少数据量,尤其是在 b 表的关联字段是主键的情况下。这样的优化可以减少后续 LEFT JOIN c 操作的数据量,从而提高整体的查询效率。

如果能尽早把数据量降下来,再关联其他表,一般还是尽可能在关联之前先降低表数据量。

为什么从这里得到的测试结果是a和b先join,再和cjoin
6.5.3版本

你这创建表后立刻搜集统计信息不准,被系统double了。重新搜集下统计信息。