HashAgg 导致TiDB内存占用过高

【 TiDB 使用环境】生产环境 or 测试环境 or POC
生产
【 TiDB 版本】
v5.1.0
【遇到的问题】
TiDB 内存占用过高,经查看起来是HashAgg 导致的, 如下图

PS 查看进程发现该实例占用内存达70%(物理机总共128G内存), 请问一下 这个情况有啥好的解决方法吗?
server-memory-quota 配置似乎不推荐生产使用
tidb_mem_quota_query 这个配置有点不合适, 我需要限制内存,但不希望查询oom报错
【复现路径】做过哪些操作出现的问题

【问题现象及影响】

【附件】

请提供各个组件的 version 信息,如 cdc/tikv,可通过执行 cdc version/tikv-server --version 获取。

只能找到这条 SQL,进行合理优化,好像没别的办法

如果可以接受sql执行慢一点 的话,可以改为stream agg

贴一下sql和执行计划看一下

agg 在CTE中, 没法用hint 绑定

SELECT
stats_date
FROM
(
WITH fusion_order AS (
SELECT
/*+ STREAM_AGG() /
stats_date,
union_id,
union_name,
zone_id,
zone_name,
org_id,
org_name,
(
CASE WHEN site_id = 1 THEN ‘one’ WHEN site_id = 5 THEN ‘two’ END
) AS org_type,
(
CASE WHEN shipment_type = 8 THEN ‘on-1’ WHEN shipment_type = 10 THEN ‘on-2’ WHEN (
(
shipment_type IN(7, 9)
AND (
order_from = ‘Shop_Lite’
OR order_type = ‘attr’
)
)
OR shipment_type = 1
) THEN ‘x1’ WHEN (
order_from != ‘Shop_Lite’
AND shipment_type = 7
AND order_type = ‘natural’
) THEN ‘x1’ WHEN (
order_from != ‘Shop_Lite’
AND shipment_type = 9
AND order_type = ‘natural’
) THEN ‘x3’ END
) AS shipment_type,
SUM(valid_ord_cnt) AS ord_cnt
FROM
nrdc_bi.ads_o2ofusion_ord_org_d
GROUP BY
1,
2,
3,
4,
5,
6,
7,
8,
9
),
fusion_order_org AS (
SELECT
/
+ STREAM_AGG() /
stats_date,
union_id,
union_name,
zone_id,
zone_name,
org_id,
org_name,
site_id,
SUM(valid_ord_cnt) AS ord_cnt
FROM
nrdc_bi.ads_o2ofusion_ord_org_d
GROUP BY
1,
2,
3,
4,
5,
6,
7,
8
),
retail_order AS (
SELECT
/
+ STREAM_AGG() */
t1.stats_date,
t1.union_id,
t1.union_name,
t1.zone_id,
t1.zone_name,
t1.org_id,
t1.org_name,
(
CASE WHEN t1.site_id = 1 THEN ‘x1’ WHEN t1.site_id = 5 THEN ‘x1’ END
) AS org_type,
‘x’ AS shipment_type,
SUM(t2.ord_cnt - t1.ord_cnt) AS ord_cnt
FROM
fusion_order_org t1
INNER JOIN nrdc_bi.ads_retail_ord_store_d t2 ON t1.stats_date = t2.stats_date
AND t1.org_id = t2.org_id
GROUP BY
1,
2,
3,
4,
5,
6,
7,
8,
9
)
SELECT
*
FROM
fusion_order
UNION ALL
SELECT
*
FROM
retail_order
) sql_model_virtual_table_new_27082_12067
where
stats_date like ‘%zghe%’
limit
1;

这个单独执行执行计划确实用了stream agg , 但是如果使用SPM 进行bind 就无法生效

SQL:
SELECT stats_date FROM ( WITH fusion_order AS ( SELECT stats_date, union_id, union_name, zone_id, zone_name, org_id, org_name, ( CASE WHEN site_id = 1 THEN ‘one’ WHEN site_id = 5 THEN ‘two’ END ) AS org_type, ( CASE WHEN shipment_type = 8 THEN ‘on-1’ WHEN shipment_type = 10 THEN ‘on-2’ WHEN ( ( shipment_type IN(7, 9) AND ( order_from = ‘Shop_Lite’ OR order_type = ‘attr’ ) ) OR shipment_type = 1 ) THEN ‘x1’ WHEN ( order_from != ‘Shop_Lite’ AND shipment_type = 7 AND order_type = ‘natural’ ) THEN ‘x1’ WHEN ( order_from != ‘Shop_Lite’ AND shipment_type = 9 AND order_type = ‘natural’ ) THEN ‘x3’ END ) AS shipment_type, SUM(valid_ord_cnt) AS ord_cnt FROM nrdc_bi.ads_o2ofusion_ord_org_d GROUP BY 1, 2, 3, 4, 5, 6, 7, 8, 9 ), fusion_order_org AS ( SELECT stats_date, union_id, union_name, zone_id, zone_name, org_id, org_name, site_id, SUM(valid_ord_cnt) AS ord_cnt FROM nrdc_bi.ads_o2ofusion_ord_org_d GROUP BY 1, 2, 3, 4, 5, 6, 7, 8 ), retail_order AS ( SELECT t1.stats_date, t1.union_id, t1.union_name, t1.zone_id, t1.zone_name, t1.org_id, t1.org_name, ( CASE WHEN t1.site_id = 1 THEN ‘x1’ WHEN t1.site_id = 5 THEN ‘x1’ END ) AS org_type, ‘x’ AS shipment_type, SUM(t2.ord_cnt - t1.ord_cnt) AS ord_cnt FROM fusion_order_org t1 INNER JOIN nrdc_bi.ads_retail_ord_store_d t2 ON t1.stats_date = t2.stats_date AND t1.org_id = t2.org_id GROUP BY 1, 2, 3, 4, 5, 6, 7, 8, 9 ) SELECT * FROM fusion_order UNION ALL SELECT * FROM retail_order ) sql_model_virtual_table_new_27082_12067 where stats_date like ‘%zghe%’ limit 1;

执行计划:

±-----------------------------------------±-----------±----------±------------------------------------------------±----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
| id | estRows | task | access object | operator info
±-----------------------------------------±-----------±----------±------------------------------------------------±----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
| Limit_34 | 1.00 | root | | offset:0, count:1
| └─Union_35 | 1.00 | root | |
| ├─Limit_39 | 0.80 | root | | offset:0, count:1
| │ └─Selection_40 | 0.80 | root | | like(cast(nrdc_bi.ads_o2ofusion_ord_org_d.stats_date, var_string(10)), “%z%”, 92)
| │ └─CTEFullScan_50 | 1.00 | root | CTE:fusion_order | data:CTE_0
| └─Limit_54 | 0.80 | root | | offset:0, count:1
| └─Selection_55 | 0.80 | root | | like(cast(nrdc_bi.ads_o2ofusion_ord_org_d.stats_date, var_string(10)), “%z%”, 92)
| └─CTEFullScan_88 | 1.00 | root | CTE:retail_order | data:CTE_2
| CTE_0 | 1.00 | root | | Non-Recursive CTE
| └─Projection_41(Seed Part) | 9954.85 | root | | nrdc_bi.ads_o2ofusion_ord_org_d.stats_date, nrdc_bi.ads_o2ofusion_ord_org_d.union_id, nrdc_bi.ads_o2ofusion_ord_org_d.union_name, nrdc_bi.ads_o2ofusion_ord_org_d.zone_id, nrdc_bi.ads_o2ofusion_ord_org_d.zone_name, nrdc_bi.ads_o2ofusion_ord_org_d.org_id, nrdc_bi.
| └─HashAgg_46 | 9954.85 | root | | group by:Column#115, Column#116, nrdc_bi.ads_o2ofusion_ord_org_d.org_id, nrdc_bi.ads_o2ofusion_ord_org_d.org_name, nrdc_bi.ads_o2ofusion_ord_org_d.stats_date, nrdc_bi.ads_o2ofusion_ord_org_d.union_id, nrdc_bi.ads_o2ofusion_ord_org_d.union_name, nrdc_bi.ads_o2ofu
| └─TableReader_47 | 9954.85 | root | | data:HashAgg_42
| └─HashAgg_42 | 9954.85 | cop[tikv] | | group by:case(eq(cast(nrdc_bi.ads_o2ofusion_ord_org_d.shipment_type, double BINARY), 8), “美团闪购”, eq(cast(nrdc_bi.ads_o2ofusion_ord_org_d.shipment_type, double BINARY), 10), “京东到家”, or(and(or(eq(cast(nrdc_bi.ads_o2ofusion_ord_org_d.shipment_type, double B
| └─TableFullScan_45 | 1070944.00 | cop[tikv] | table:ads_o2ofusion_ord_org_d | keep order:false, stats:pseudo
| CTE_2 | 1.00 | root | | Non-Recursive CTE
| └─Projection_58(Seed Part) | 1.00 | root | | nrdc_bi.ads_o2ofusion_ord_org_d.stats_date, nrdc_bi.ads_o2ofusion_ord_org_d.union_id, nrdc_bi.ads_o2ofusion_ord_org_d.union_name, nrdc_bi.ads_o2ofusion_ord_org_d.zone_id, nrdc_bi.ads_o2ofusion_ord_org_d.zone_name, nrdc_bi.ads_o2ofusion_ord_org_d.org_id, nrdc_bi.
| └─HashAgg_59 | 1.00 | root | | group by:Column#152, Column#153, Column#154, Column#155, Column#156, Column#157, Column#158, Column#159, funcs:sum(Column#143)->Column#80, funcs:firstrow(Column#144)->nrdc_bi.ads_o2ofusion_ord_org_d.stats_date, funcs:firstrow(Column#145)->nrdc_bi.ads_o2ofusion_o
| └─Projection_87 | 0.80 | root | | minus(cast(nrdc_bi.ads_retail_ord_store_d.ord_cnt, decimal(20,0) BINARY), Column#65)->Column#143, nrdc_bi.ads_o2ofusion_ord_org_d.stats_date, nrdc_bi.ads_o2ofusion_ord_org_d.union_id, nrdc_bi.ads_o2ofusion_ord_org_d.union_name, nrdc_bi.ads_o2ofusion_ord_org_d.zo
| └─IndexJoin_63 | 0.80 | root | | inner join, inner:IndexLookUp_62, outer key:nrdc_bi.ads_o2ofusion_ord_org_d.stats_date, nrdc_bi.ads_o2ofusion_ord_org_d.org_id, inner key:nrdc_bi.ads_retail_ord_store_d.stats_date, nrdc_bi.ads_retail_ord_store_d.org_id, equal cond:eq(nrdc_bi.ads_o2ofusion_ord_or
| ├─Selection_74(Build) | 0.80 | root | | not(isnull(nrdc_bi.ads_o2ofusion_ord_org_d.org_id)), not(isnull(nrdc_bi.ads_o2ofusion_ord_org_d.stats_date))
| │ └─CTEFullScan_84 | 1.00 | root | CTE:t1 | data:CTE_1
| └─IndexLookUp_62(Probe) | 1.00 | root | |
| ├─IndexRangeScan_60(Build) | 1.00 | cop[tikv] | table:t2, index:uk_date_ord(stats_date, org_id) | range: decided by [eq(nrdc_bi.ads_retail_ord_store_d.stats_date, nrdc_bi.ads_o2ofusion_ord_org_d.stats_date) eq(nrdc_bi.ads_retail_ord_store_d.org_id, nrdc_bi.ads_o2ofusion_ord_org_d.org_id)], keep order:false, stats:pseudo
| └─TableRowIDScan_61(Probe) | 1.00 | cop[tikv] | table:t2 | keep order:false, stats:pseudo
| CTE_1 | 1.00 | root | | Non-Recursive CTE
| └─Projection_75(Seed Part) | 9954.85 | root | | nrdc_bi.ads_o2ofusion_ord_org_d.stats_date, nrdc_bi.ads_o2ofusion_ord_org_d.union_id, nrdc_bi.ads_o2ofusion_ord_org_d.union_name, nrdc_bi.ads_o2ofusion_ord_org_d.zone_id, nrdc_bi.ads_o2ofusion_ord_org_d.zone_name, nrdc_bi.ads_o2ofusion_ord_org_d.org_id, nrdc_bi.
| └─HashAgg_80 | 9954.85 | root | | group by:nrdc_bi.ads_o2ofusion_ord_org_d.org_id, nrdc_bi.ads_o2ofusion_ord_org_d.org_name, nrdc_bi.ads_o2ofusion_ord_org_d.site_id, nrdc_bi.ads_o2ofusion_ord_org_d.stats_date, nrdc_bi.ads_o2ofusion_ord_org_d.union_id, nrdc_bi.ads_o2ofusion_ord_org_d.union_name,
| └─TableReader_81 | 9954.85 | root | | data:HashAgg_76
| └─HashAgg_76 | 9954.85 | cop[tikv] | | group by:nrdc_bi.ads_o2ofusion_ord_org_d.org_id, nrdc_bi.ads_o2ofusion_ord_org_d.org_name, nrdc_bi.ads_o2ofusion_ord_org_d.site_id, nrdc_bi.ads_o2ofusion_ord_org_d.stats_date, nrdc_bi.ads_o2ofusion_ord_org_d.union_id, nrdc_bi.ads_o2ofusion_ord_org_d.union_name,
| └─TableFullScan_79 | 1070944.00 | cop[tikv] | table:ads_o2ofusion_ord_org_d | keep order:false, stats:pseudo
±-----------------------------------------±-----------±----------±------------------------------------------------±----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

主要有CTE, SQL binding 不生效
~
~
~

nrdc_bi.ads_o2ofusion_ord_org_d和nrdc_bi.ads_retail_ord_store_d这两张表的统计信息健康度如何?从执行计划上看返回给TiDB侧的estRows也就1万行左右,不太可能占用这么大内存。另外(stats_date, org_id)组合在fusion_order_org或nrdc_bi.ads_retail_ord_store_d中是否至少一边是唯一的?


这是实际的情况,百w 级别看起来也不大, 就这个SQL 调用频率比较高,目前看每s 有10~20个左右

统计信息健康度是好的

另外(stats_date, org_id)组合在fusion_order_org或nrdc_bi.ads_retail_ord_store_d中是否至少一边是唯一的?
是的, 这个是nrdc_bi.ads_retail_ord_store_d的主键

可能还是和并发量有些关系,如果CTE下binding没法用(PS:这个我没有验证,感觉应该是可以用的吧),可以尝试如下方式:
1、tidb_executor_concurrency限制HashAgg的并发
2、扩容TiDB实例个数
3、token-limit限制TiDB实例并发
4、让研发改SQL重新发版,增加hint,改成STREAM_AGG

除此之外,可能还需要TiDB产研大佬看一下是否是存在其它缘由

好的, 多谢:+1:

客气,其实你的截图往右边拉,memory字段也能看到具体的内存占用的

此话题已在最后回复的 1 分钟后被自动关闭。不再允许新回复。