merge join的疑问

为提高效率,请提供以下信息,问题描述清晰能够更快得到解决:
【 TiDB 使用环境】
4.0.16

【概述】 场景 + 问题概述
1,数据库从4.0.8升级到4.0.16之后,sql的执行计划发生变化,之前走hashjoin的表关联语句,在升级之后出现hashjoin,mergejoin 两种关联方式。
2,手动执行都是hashjoin,但是应用session的sql会有mergejoin 和hashjoin 两种方式
3,在进行了sql binding之后,mergeJoin的频繁程度降低,但还是会出现。

注:统计信息已经收集,索引什么的都没有问题,表结构和索引在升级前后未做改变

merge join的这种方式不能并行,消耗内存比较小,所以会不会有可能系统压力过大导致sql导致使用merge join这种消耗资源比较小的执行计划,如果是,有无参数可以调整sql尽量不去选择merge join

希望sql关联只有hashjoin,没有mergejoin,请问如何解决

【背景】 做过哪些操作
TiDB升级版本从4.0.8到4.0.16

【现象】 业务和数据库现象
sql执行计划发生变化

【问题】 当前遇到的问题

【业务影响】
mergejoin导致sql执行时间变慢10倍

【TiDB 版本】
4.0.16

【应用软件及版本】

【附件】 相关日志及配置信息

  • TiUP Cluster Display 信息
  • TiUP CLuster Edit config 信息

监控(https://metricstool.pingcap.com/)

  • TiDB-Overview Grafana监控
  • TiDB Grafana 监控
  • TiKV Grafana 监控
  • PD Grafana 监控
  • 对应模块日志(包含问题前后 1 小时日志)

若提问为性能优化、故障排查类问题,请下载脚本运行。终端输出的打印结果,请务必全选并复制粘贴上传。

看一下表健康度,表健康度会影响执行计划的选择

@db_user @h5n1 @qizheng @spc_monkey

大佬们帮忙看看

SQL 、表结构,执行计划贴下吧

表结构就算了,索引该有的都有

sql:
create GLOBAL BINDING FOR
SELECT
sum(wpi.volume) AS totalVolum,
count(DISTINCT wpo.id) AS totalPacNum
FROM
wpo
INNER JOIN wpi ON wpi.package_no = wpo.package_no
AND wpi.is_delete = 0
INNER JOIN pri ON wpo.code = pri.code
AND wpo.package_no = pri.package_no
AND pri.type NOT IN (0, 6)
AND pri.finished = 0
AND pri.invalid = 0
WHERE
wpo.is_delete = 0
AND wpo.code != wpo.to_code
AND wpo.code IN (‘2340’)
AND wpo.status IN (
2,
11,
14
)
AND wpo.create_time >= ‘2022-01-01 00:00:00’
AND wpo.create_time <= ‘2022-04-14 21:00:00’

错误的执行计划
‘’’
id task estRows actRows execution info memory disk
StreamAgg_14 root 1 1 time:20m18.4s, loops:2 1.76 KB N/A
└─IndexJoin_21 root 0.06 9 time:20m18.4s, loops:2, inner:{total:3.09ms, concurrency:4, task:1, construct:47.9µs, fetch:3.03ms, build:8.19µs}, probe:7.34µs 179.9 KB N/A
├─MergeJoin_43 root 0.06 9 time:20m18.4s, loops:3 14.5 MB 0 Bytes
│ ├─IndexLookUp_72 root 0.05 11550 time:562.1ms, loops:13, index_task: {total_time: 21.6ms, fetch_handle: 20ms, build: 1.59ms, wait: 22.6µs}, table_task: {total_time: 1.04s, num: 5, concurrency: 4} 1.58 MB N/A
│ │ ├─IndexRangeScan_69 cop[tikv] 0.79 31769 time:19.7ms, loops:34, cop_task: {num: 1, max: 19.6ms, proc_keys: 31769, tot_proc: 19ms, rpc_num: 1, rpc_time: 19.6ms, copr_cache: disabled}, tikv_task:{time:18ms, loops:36}, scan_detail: {total_process_keys: 31769, total_keys: 31770} N/A N/A
│ │ └─Selection_71 cop[tikv] 0.05 11550 time:949.8ms, loops:17, cop_task: {num: 24, max: 538ms, min: 822µs, avg: 49.3ms, p95: 291.4ms, max_proc_keys: 8750, p95_proc_keys: 2778, tot_proc: 768ms, tot_wait: 396ms, rpc_num: 24, rpc_time: 1.18s, copr_cache: disabled}, tikv_task:{proc max:16ms, min:0s, p80:5ms, p95:7ms, iters:116, tasks:24}, scan_detail: {total_process_keys: 31769, total_keys: 32556} N/A N/A
│ │ └─TableRowIDScan_70 cop[tikv] 0.79 31769 tikv_task:{proc max:14ms, min:0s, p80:5ms, p95:7ms, iters:116, tasks:24} N/A N/A
│ └─Projection_82 root 1.30 168580 time:2.41s, loops:5239, Concurrency:OFF 3.91 MB N/A
│ └─IndexLookUp_81 root 1.30 168580 time:2.4s, loops:5239, index_task: {total_time: 1.37s, fetch_handle: 87.7ms, build: 17ms, wait: 1.27s}, table_task: {total_time: 8.53s, num: 14, concurrency: 4} 10.3 MB N/A
│ ├─IndexRangeScan_78 cop[tikv] 10.00 214358 time:83.9ms, loops:213, cop_task: {num: 2, max: 83.6ms, min: 17ms, avg: 50.3ms, p95: 83.6ms, max_proc_keys: 184557, p95_proc_keys: 184557, tot_proc: 93ms, tot_wait: 1ms, rpc_num: 2, rpc_time: 100.6ms, copr_cache: disabled}, tikv_task:{proc max:73ms, min:15ms, p80:73ms, p95:73ms, iters:219, tasks:2}, scan_detail: {total_process_keys: 214358, total_keys: 214360} N/A N/A
│ └─Selection_80 cop[tikv] 1.30 168580 time:7.98s, loops:183, cop_task: {num: 1046, max: 702.4ms, min: 415.6µs, avg: 42.1ms, p95: 164.2ms, max_proc_keys: 7171, p95_proc_keys: 757, tot_proc: 30s, tot_wait: 12.8s, rpc_num: 1048, rpc_time: 44s, copr_cache: disabled}, tikv_task:{proc max:179ms, min:0s, p80:7ms, p95:17ms, iters:2610, tasks:1046}, scan_detail: {total_process_keys: 214358, total_keys: 214361} N/A N/A
│ └─TableRowIDScan_79 cop[tikv] 10.00 214358 tikv_task:{proc max:179ms, min:0s, p80:7ms, p95:17ms, iters:2610, tasks:1046} N/A N/A
└─IndexLookUp_20 root 1 9 time:2.93ms, loops:2, index_task: {total_time: 1.38ms, fetch_handle: 1.37ms, build: 1µs, wait: 13.4µs}, table_task: {total_time: 7.02ms, num: 1, concurrency: 4} 10.4 KB N/A
├─IndexRangeScan_17 cop[tikv] 1 9 time:1.36ms, loops:3, cop_task: {num: 1, max: 1.33ms, proc_keys: 9, rpc_num: 1, rpc_time: 1.31ms, copr_cache: disabled}, tikv_task:{time:0s, loops:1}, scan_detail: {total_process_keys: 9, total_keys: 9} N/A N/A
└─Selection_19 cop[tikv] 1 9 time:1.34ms, loops:2, cop_task: {num: 1, max: 1.26ms, proc_keys: 9, rpc_num: 1, rpc_time: 1.25ms, copr_cache: disabled}, tikv_task:{time:0s, loops:1}, scan_detail: {total_process_keys: 9, total_keys: 10} N/A N/A
└─TableRowIDScan_18 cop[tikv] 1 9 tikv_task:{time:0s, loops:1} N/A N/A
‘’’

期望的执行计划
‘’‘
id task estRows actRows execution info memory disk
StreamAgg_14 root 1 1 time:1m12.8s, loops:2 49.8 KB N/A
└─IndexHashJoin_139 root 219491.87 4445 time:1m12.8s, loops:6, inner:{total:2.08s, concurrency:4, task:5, construct:6.25ms, fetch:2.06s, build:1.53ms, join:13.8ms} 734.4 KB N/A
├─HashJoin_60 root 219491.87 4550 time:1m12.4s, loops:10, build_hash_table:{total:3.2s, fetch:3.19s, build:15.6ms}, probe:{concurrency:5, total:6m1.9s, max:1m12.4s, probe:724.2ms, fetch:6m1.2s} 3.66 MB 0 Bytes
│ ├─IndexLookUp_101 root 128175.77 44371 time:3.2s, loops:45, index_task: {total_time: 2.4s, fetch_handle: 210.1ms, build: 27.1µs, wait: 2.19s}, table_task: {total_time: 11.6s, num: 16, concurrency: 4} 6.63 MB N/A
│ │ ├─IndexRangeScan_98 cop[tikv] 171668.24 240396 time:200.2ms, loops:238, cop_task: {num: 2, max: 200.1ms, min: 10.3ms, avg: 105.2ms, p95: 200.1ms, max_proc_keys: 231008, p95_proc_keys: 231008, tot_proc: 201ms, tot_wait: 2ms, rpc_num: 2, rpc_time: 210.3ms, copr_cache: disabled}, tikv_task:{proc max:186ms, min:9ms, p80:186ms, p95:186ms, iters:244, tasks:2}, scan_detail: {total_process_keys: 240396, total_keys: 516428} N/A N/A
│ │ └─Selection_100 cop[tikv] 128175.77 44371 time:11.3s, loops:68, cop_task: {num: 2718, max: 832.5ms, min: 356.5µs, avg: 38.1ms, p95: 328.8ms, max_proc_keys: 6049, p95_proc_keys: 344, tot_proc: 16.5s, tot_wait: 1m25.2s, rpc_num: 2718, rpc_time: 1m43.5s, copr_cache: disabled}, tikv_task:{proc max:33ms, min:0s, p80:1ms, p95:4ms, iters:4097, tasks:2718}, scan_detail: {total_process_keys: 240396, total_keys: 259225} N/A N/A
│ │ └─TableRowIDScan_99 cop[tikv] 171668.24 240396 tikv_task:{proc max:32ms, min:0s, p80:1ms, p95:4ms, iters:4097, tasks:2718} N/A N/A
│ └─IndexLookUp_124 root 1377393.02 1766291 time:1m9.4s, loops:1726, index_task: {total_time: 1m11.5s, fetch_handle: 296.1ms, build: 1.19ms, wait: 1m11.2s}, table_task: {total_time: 4m48.1s, num: 527, concurrency: 4} 93.3 MB N/A
│ ├─IndexRangeScan_121 cop[tikv] 10547632.28 10708231 time:155.6ms, loops:10467, cop_task: {num: 13, max: 758.1ms, min: 147.8ms, avg: 444.9ms, p95: 758.1ms, max_proc_keys: 1295624, p95_proc_keys: 1295624, tot_proc: 5.43s, tot_wait: 11ms, rpc_num: 14, rpc_time: 5.78s, copr_cache: disabled}, tikv_task:{proc max:693ms, min:132ms, p80:495ms, p95:693ms, iters:10516, tasks:13}, scan_detail: {total_process_keys: 10708231, total_keys: 10823104} N/A N/A
│ └─Selection_123 cop[tikv] 1377393.02 1766291 time:4m42.5s, loops:2719, cop_task: {num: 12576, max: 1.49s, min: 331.9µs, avg: 75ms, p95: 343.5ms, max_proc_keys: 20480, p95_proc_keys: 5369, tot_proc: 6m44.5s, tot_wait: 8m41.8s, rpc_num: 12577, rpc_time: 15m43s, copr_cache: disabled}, tikv_task:{proc max:538ms, min:0s, p80:12ms, p95:92ms, iters:35038, tasks:12576}, scan_detail: {total_process_keys: 10708231, total_keys: 10713826} N/A N/A
│ └─TableRowIDScan_122 cop[tikv] 10547632.28 10708231 tikv_task:{proc max:538ms, min:0s, p80:12ms, p95:91ms, iters:35038, tasks:12576} N/A N/A
└─IndexLookUp_136 root 1 4441 time:2.05s, loops:10, index_task: {total_time: 825.7ms, fetch_handle: 825.7ms, build: 6.12µs, wait: 40.7µs}, table_task: {total_time: 4.53s, num: 5, concurrency: 20} 95.5 KB N/A
├─IndexRangeScan_133 cop[tikv] 1 4546 time:825.6ms, loops:14, cop_task: {num: 105, max: 332.7ms, min: 465.6µs, avg: 16.9ms, p95: 120.5ms, max_proc_keys: 432, p95_proc_keys: 320, tot_proc: 112ms, tot_wait: 1.61s, rpc_num: 105, rpc_time: 1.78s, copr_cache: disabled}, tikv_task:{proc max:8ms, min:0s, p80:1ms, p95:5ms, iters:148, tasks:105}, scan_detail: {total_process_keys: 4546, total_keys: 4546} N/A N/A
└─Selection_135 cop[tikv] 1 4441 time:1.22s, loops:10, cop_task: {num: 274, max: 372.5ms, min: 322.9µs, avg: 25.8ms, p95: 198.7ms, max_proc_keys: 497, p95_proc_keys: 67, tot_proc: 732ms, tot_wait: 6.16s, rpc_num: 277, rpc_time: 7.08s, copr_cache: disabled}, backoff{regionMiss: 2ms}, tikv_task:{proc max:10ms, min:0s, p80:1ms, p95:2ms, iters:330, tasks:274}, scan_detail: {total_process_keys: 4546, total_keys: 5633} N/A N/A
└─TableRowIDScan_134 cop[tikv] 1 4546 tikv_task:{proc max:10ms, min:0s, p80:1ms, p95:2ms, iters:330, tasks:274} N/A N/A
’‘’

表结构没有、执行计划的access object也删了, 看actrows/estrows差距还是比较大的重新收集下统计信息吧,

并不是所有的sql执行计划都不对,程序同时跑了100条,99个都是hashjoin,剩下的1个是mergejoin,这和统计信息的关系大么?

执行计划依靠统计信息生成,统计信息准确性影响最优执行计划生成,当然优化器本身有问题可能也会造成执行计划不正确

发一下表健康度吧

都是99、98

大佬,咨询下,用bind,强制绑定成hash_join是能一定走hash_join吗?目前我们执行了绑定,但是还是偶尔出现merge_join, 这个是我们绑定没起作用还是说绑定达不到强制的目的?

create GLOBAL BINDING FOR
SELECT
  sum(wpi.package_volume) AS totalVolum,
  count(DISTINCT wpo.id) AS totalPacNum
FROM
  waybill_package_status_out_transfer wpo
  INNER JOIN waybill_package_info wpi ON wpi.package_no = wpo.package_no
  AND wpi.is_delete = 0
  INNER JOIN package_route_info pri ON wpo.cur_transit_center_code = pri.cur_transit_center_code
  AND wpo.package_no = pri.package_no
  AND pri.transport_type NOT IN (0, 6)
  AND pri.route_finished = 0
  AND pri.invalid = 0
WHERE
  wpo.is_delete = 0
  AND wpo.cur_transit_center_code != wpo.to_transit_center_code
  AND wpo.cur_transit_center_code IN ('2340')
  AND wpo.status IN (
    2,
    11,
    14
  )
  AND wpo.create_time >= '2022-01-01 00:00:00'
  AND wpo.create_time <= '2022-04-14 21:00:00'
USING
SELECT
  /*+ HASH_JOIN(wpo, pri) */
  sum(wpi.package_volume) AS totalVolum,
  count(DISTINCT wpo.id) AS totalPacNum
FROM
  waybill_package_status_out_transfer wpo FORCE INDEX(idx_transit_crtime)
  INNER JOIN waybill_package_info wpi ON wpi.package_no = wpo.package_no
  AND wpi.is_delete = 0
  INNER JOIN package_route_info pri ON wpo.cur_transit_center_code = pri.cur_transit_center_code
  AND wpo.package_no = pri.package_no
  AND pri.transport_type NOT IN (0, 6)
  AND pri.route_finished = 0
  AND pri.invalid = 0
WHERE
  wpo.is_delete = 0
  AND wpo.cur_transit_center_code != wpo.to_transit_center_code
  AND wpo.cur_transit_center_code IN ('2340')
  AND wpo.status IN (
    2,
    11,
    14
  )
  AND wpo.create_time >= '2022-01-01 00:00:00'
  AND wpo.create_time <= '2022-04-14 21:00:00';

绝大部分情况binding是能生效大,执行计划绑定的实现不是绑定的物理优化后的执行计划,而是把转换后的语法抽象树AST加上HINT存储起来,然后SQL执行时通过对比找到一致的binding,把AST+HINT在还原成SQL生成执行计划,如果有有不满足hint的条件 也是没法按照绑定的走的。

2 个赞

你这样,你先在4.0.8搞一个 binding,然后查看 binding 后执行计划的文本。
然后你在4.0.16搞一个 binding,然后运行一次不生效的情况,然后查看 binding 后执行计划的文本。

两个版本对比一下,看看缺了什么?

1 个赞

我觉得可能在参数抽象化的sql没有对比找到一致的binding

但binding没能覆盖并不是我们主要关心的问题,我关心的问题是优化器为何在升级后由原本的hashjoin,倾向于mergejoin这种消耗资源比较少的join方式,是把数据库的整体负载也作为成本算进去了么?

具体成本怎么算的也不太清楚,建议你先重新收集下统计信息,有时候可能就解决问题了,虽然健康度比较高

估计优化器执行计划错误了,可以先收集一下统计信息,或者绑定执行计划。

问题及已经解决,版本回退到4.0.8,执行计划再没有不稳定的情况发生。不知道4.0.16这个版本对比4.0.8,优化器作了什么调整

小版本之间,只涉及到 bug 修复,不会引入新的功能。

如果方便的话,可以提供详细的信息我们看一下执行计划不稳定的情况。

1.表结构
2.SQL
3.执行计划
4.统计信息也需要导出一下(升级前后各一份)

此话题已在最后回复的 1 分钟后被自动关闭。不再允许新回复。