【 TiDB 使用环境】测试
【 TiDB 版本】v7.5.1
对于一条SQL:select * from a inner join b on a.c1=b.id left join c on a.c2=c.id;
如果a join b的结果集大于a join c的结果集,那么优化器会评估执行顺序为:a left join c inner join b ,这常见于b表的关联条件上存在较多重复数据的情况下,但是如果能确定a inner join b的结果集最多不大于count(a) (比如b表关联字段为主键)那么先做a inner b在绝大多数情况下都是最优的。
但从测试来看TiDB并没有做这块的优化,测试如下:
测试SQL:
drop table if exists a;
drop table if exists b;
drop table if exists c;
create table a(id int,c1 int,c2 int,primary key (id));
create table b(id int,c1 int,c2 int,primary key (id));
create table c(id int,c1 int,c2 int,primary key (id));
insert into a values (1,1,1),(2,2,2),(3,3,3);
insert into b values (1,1,1),(10,10,10),(11,11,11),(12,12,12);
insert into c values (1,1,1),(2,2,2);
analyze table a,b,c with 1 samplerate;
做如下查询:
select * from a inner join b on a.c1=b.id left join c on a.c2=c.id;
想要的最佳连接顺序为a inner join b left join c,但是查看执行计划,优化器给的顺序为a left join c inner join b:
mysql> show variables like 'tidb_enable_outer_join_reorder';
+--------------------------------+-------+
| Variable_name | Value |
+--------------------------------+-------+
| tidb_enable_outer_join_reorder | ON |
+--------------------------------+-------+
1 row in set (0.00 sec)
mysql> explain select * from a inner join b on a.c1=b.id left join c on a.c2=c.id;
+-----------------------------------+---------+-----------+---------------+---------------------------------------------------------------------------------------------------+
| id | estRows | task | access object | operator info |
+-----------------------------------+---------+-----------+---------------+---------------------------------------------------------------------------------------------------+
| Projection_12 | 3.00 | root | | test.a.id, test.a.c1, test.a.c2, test.b.id, test.b.c1, test.b.c2, test.c.id, test.c.c1, test.c.c2 |
| └─HashJoin_23 | 3.00 | root | | inner join, equal:[eq(test.a.c1, test.b.id)] |
| ├─HashJoin_33(Build) | 3.00 | root | | left outer join, equal:[eq(test.a.c2, test.c.id)] |
| │ ├─TableReader_39(Build) | 2.00 | root | | data:TableFullScan_38 |
| │ │ └─TableFullScan_38 | 2.00 | cop[tikv] | table:c | keep order:false |
| │ └─TableReader_37(Probe) | 3.00 | root | | data:Selection_36 |
| │ └─Selection_36 | 3.00 | cop[tikv] | | not(isnull(test.a.c1)) |
| │ └─TableFullScan_35 | 3.00 | cop[tikv] | table:a | keep order:false |
| └─TableReader_41(Probe) | 4.00 | root | | data:TableFullScan_40 |
| └─TableFullScan_40 | 4.00 | cop[tikv] | table:b | keep order:false |
+-----------------------------------+---------+-----------+---------------+---------------------------------------------------------------------------------------------------+
10 rows in set (0.00 sec)
mysql> explain select /*+ leading(a,b,c) */ * from a inner join b on a.c1=b.id left join c on a.c2=c.id;
+--------------------------------+---------+-----------+---------------+---------------------------------------------------+
| id | estRows | task | access object | operator info |
+--------------------------------+---------+-----------+---------------+---------------------------------------------------+
| HashJoin_20 | 3.00 | root | | left outer join, equal:[eq(test.a.c2, test.c.id)] |
| ├─TableReader_39(Build) | 2.00 | root | | data:TableFullScan_38 |
| │ └─TableFullScan_38 | 2.00 | cop[tikv] | table:c | keep order:false |
| └─HashJoin_32(Probe) | 3.00 | root | | inner join, equal:[eq(test.a.c1, test.b.id)] |
| ├─TableReader_35(Build) | 3.00 | root | | data:Selection_34 |
| │ └─Selection_34 | 3.00 | cop[tikv] | | not(isnull(test.a.c1)) |
| │ └─TableFullScan_33 | 3.00 | cop[tikv] | table:a | keep order:false |
| └─TableReader_37(Probe) | 4.00 | root | | data:TableFullScan_36 |
| └─TableFullScan_36 | 4.00 | cop[tikv] | table:b | keep order:false |
+--------------------------------+---------+-----------+---------------+---------------------------------------------------+
9 rows in set (0.00 sec)
从实际数据可以看到如果先做a inner join b 那么可以过滤很多无效数据,然后在和c关联得到最终结果。
但是如果a left join c,还是会保留不少于a表总记录数的数据,然后和b关联,整体上要比先做a inner join b多扫描较多数据。
因此是否可以做这样优化:对于a inner join b left join c这种存在多个inner join和outer join的多表关联的情况下,如果能确定a inner join b结果集不高于a,那么应该优先做inner join来更早的过滤更多数据,减少数据扫描?