join reorder没有对等值关联条件做转换

人如其名 · 2024 年5 月 25 日 08:05

【 TiDB 使用环境】测试
【 TiDB 版本】v7.5.1

tpch的Q5执行效率较慢，其中有一点是没有对等值关联条件做转换，帖子：因优化器问题导致TPCH的Q5语句执行过慢

但是上面这个帖子内容较多，没有聚焦该问题，因此这里简化测试该问题。
创建相关表结构：

drop table if exists a;
drop table if exists b;
drop table if exists c;

create table a(id int,c1 int,c2 int,primary key (id));
create table b(id int,c1 int,c2 int,primary key (id));
create table c(id int,c1 int,c2 int,primary key (id));

insert into a values (1,1,1),(2,2,2),(3,3,3);
insert into b values (1,1,1),(10,10,10),(11,11,11),(12,12,12);
insert into c values (1,1,1),(2,2,2);

analyze table a,b,c with 1 samplerate;

查看SQL语句：select count(*) from a,b,c where a.id=b.id and b.id=c.id and c.c1=a.c1 的执行计划：

mysql> explain select count(*) from a,b,c where a.id=b.id and b.id=c.id and c.c1=a.c1;
+------------------------------------+---------+-----------+---------------+-----------------------------------------------------------------------+
| id                                 | estRows | task      | access object | operator info                                                         |
+------------------------------------+---------+-----------+---------------+-----------------------------------------------------------------------+
| StreamAgg_15                       | 1.00    | root      |               | funcs:count(1)->Column#10                                             |
| └─HashJoin_45                      | 2.00    | root      |               | inner join, equal:[eq(test.a.id, test.b.id) eq(test.c.id, test.b.id)] |
|   ├─HashJoin_27(Build)             | 2.00    | root      |               | inner join, equal:[eq(test.c.c1, test.a.c1)]                          |
|   │ ├─TableReader_30(Build)        | 2.00    | root      |               | data:Selection_29                                                     |
|   │ │ └─Selection_29               | 2.00    | cop[tikv] |               | not(isnull(test.c.c1))                                                |
|   │ │   └─TableFullScan_28         | 2.00    | cop[tikv] | table:c       | keep order:false                                                      |
|   │ └─TableReader_33(Probe)        | 3.00    | root      |               | data:Selection_32                                                     |
|   │   └─Selection_32               | 3.00    | cop[tikv] |               | not(isnull(test.a.c1))                                                |
|   │     └─TableFullScan_31         | 3.00    | cop[tikv] | table:a       | keep order:false                                                      |
|   └─TableReader_35(Probe)          | 4.00    | root      |               | data:TableFullScan_34                                                 |
|     └─TableFullScan_34             | 4.00    | cop[tikv] | table:b       | keep order:false                                                      |
+------------------------------------+---------+-----------+---------------+-----------------------------------------------------------------------+
11 rows in set (0.01 sec)

可以看到优化器优先选择a join c 最后和b关联，但是a join c的condition条件只有：equal:[eq(test.c.c1, test.a.c1) 并没有感知a.id=b.id and b.id=c.id推出a.id=c.id，因此没有提前做a.id=c.id的过滤，手工优化为：


mysql> explain select count(*) from a,b,c where a.id=b.id and b.id=c.id and c.c1=a.c1 and a.id=c.id;
+------------------------------------+---------+-----------+---------------+-----------------------------------------------------------------------+
| id                                 | estRows | task      | access object | operator info                                                         |
+------------------------------------+---------+-----------+---------------+-----------------------------------------------------------------------+
| StreamAgg_15                       | 1.00    | root      |               | funcs:count(1)->Column#10                                             |
| └─HashJoin_65                      | 2.00    | root      |               | inner join, equal:[eq(test.a.id, test.b.id) eq(test.c.id, test.b.id)] |
|   ├─HashJoin_47(Build)             | 2.00    | root      |               | inner join, equal:[eq(test.c.c1, test.a.c1) eq(test.c.id, test.a.id)] |
|   │ ├─TableReader_50(Build)        | 2.00    | root      |               | data:Selection_49                                                     |
|   │ │ └─Selection_49               | 2.00    | cop[tikv] |               | not(isnull(test.c.c1))                                                |
|   │ │   └─TableFullScan_48         | 2.00    | cop[tikv] | table:c       | keep order:false                                                      |
|   │ └─TableReader_53(Probe)        | 3.00    | root      |               | data:Selection_52                                                     |
|   │   └─Selection_52               | 3.00    | cop[tikv] |               | not(isnull(test.a.c1))                                                |
|   │     └─TableFullScan_51         | 3.00    | cop[tikv] | table:a       | keep order:false                                                      |
|   └─TableReader_55(Probe)          | 4.00    | root      |               | data:TableFullScan_54                                                 |
|     └─TableFullScan_54             | 4.00    | cop[tikv] | table:b       | keep order:false                                                      |
+------------------------------------+---------+-----------+---------------+-----------------------------------------------------------------------+
11 rows in set (0.00 sec)

在复杂SQL语句中，人工做这种变换较为困难，希望优化器可以做下关联字段等值转换方面的优化。

yytest · 2024 年5 月 26 日 04:27

如果您需要在短期内提高查询性能，可以考虑使用TiDB提供的Hint机制来影响优化器的决策。虽然这并不是一个长期的解决方案，但它可以帮助您在特定情况下获得更好的性能。例如，您可以使用类似以下的Hint来强制优化器按照您期望的方式执行查询：

SELECT /*+ JOIN_ORDER(a, c, b) */ COUNT(*)
FROM a, b, c
WHERE a.id = b.id AND b.id = c.id AND c.c1 = a.c1;

在这个例子中，JOIN_ORDER(a, c, b) Hint告诉优化器首先执行 a 和 c 的连接，然后再与 b 连接。

FutureDB · 2024 年5 月 26 日 07:18

目前TiDB的优化器还没有那么智能，很多时候这种并不一定能推断出来，写SQL能够明示还是明示。

Raymond · 2024 年5 月 26 日 08:11

1.从MySQL和PG的测试来看，都支持自动推到出来a.id=c.id

2.有个疑问？

a和c表关联后，再和b表关联，应该只需要满足b.id = ``c.id` 就行了吧或者b.id=a.id
不需要 eq(test.a.id, test.b.id) eq(test.c.id, test.b.id)] 这两个条件同时去判断了吧？

人如其名 · 2024 年5 月 26 日 14:40

a.c1=c.c1 and b.id=c.id
a.c1=c.c1 and b.id=c.id and a.id=b.id
明显不相同吧，所以需要。

Raymond · 2024 年5 月 26 日 14:48

我的意思是语句的join 的时候
a和c的join 条件为a.c1 = c.c1 and a.id = c.id
然和和b表去关联是b.id=c.id
是这个意思

人如其名 · 2024 年5 月 26 日 14:54

那按道理不需要

Raymond · 2024 年5 月 26 日 14:57

是的，我就是这个意思所以我才觉得tidb 那里的又多判断了一次

Chad · 2024 年5 月 26 日 15:40

目前没有join_order的hints.