tidb-server 5.3.0 内存控制不生效

【 TiDB 使用环境】生产环境
【 TiDB 版本】5.3.0
【遇到的问题】tidb-server跑一个巨慢的SQL(一张8千万的表和一张1亿的表join)会OOM,设置了mem-quota-query、oom-action控制内存,但不生效。

配置如下:
server_configs:
tidb:
log.slow-threshold: 300
mem-quota-query: 8589934592
oom-action: cancel
oom-use-tmp-storage: true
performance.server-memory-quota: 30212254720
performance.txn-total-size-limit: 2173741824
tmp-storage-path: /home/shared/tmp-storage

tidb-server的log:
[2022/09/28 16:31:38.100 +08:00] [WARN] [memory_usage_alarm.go:140] [“tidb-server has the risk of OOM. Running SQLs and heap profile will be recorded in record path”] [“is server-memory-quota set”=true] [server-memory-quota=30212254720] [“tidb-server memory usage”=24235258432] [memory-usage-alarm-ratio=0.8] [“record path”="/home/shared/tmp-storage/1004_tidb/MC4wLjAuMDo0MDAwLzAuMC4wLjA6MTAwODA=/tmp-storage/record"]
[2022/09/28 16:31:46.500 +08:00] [WARN] [expensivequery.go:179] [expensive_query] [cost_time=60.07822628s] [cop_time=12.266157414s] [process_time=287.003s] [wait_time=0.973s] [request_count=374] [total_keys=347676927] [process_keys=174355378] [num_cop_tasks=374] [process_avg_time=0.7673877s] [process_p90_time=1.313s] [process_max_time=2.1390000000000002s] [process_max_addr=dx-op-tidb-tikv3-pm:20160] [wait_avg_time=0.002601604s] [wait_p90_time=0.003s] [wait_max_time=0.095s] [wait_max_addr=dx-op-tidb-tikv3-pm:20160] [stats=ods_bort_trade_t_order_item_part:436299555451174938,ods_bort_trade_t_order_part:436300359591002118] [conn_id=3] [user=dbadmin_01] [database=pipe_data_realtime] [table_ids="[18951,78473,18951,78473]"] [txn_start_ts=436300374716710913] [mem_max=“10375289917 Bytes (9.66 GB)”] [sql=“select 0 as id, scencekey ,10001 as resourceType, productid as resourceId, bizid as bizId, sum(quantity) as soldCount, CURRENT_timestamp() as dbctime, CURRENT_timestamp() as dbutime from (select id,scencekey from pipe_data_realtime.ods_bort_trade_t_order_part where state <> ‘4’) a join (select orderitemid, orderid, productid, variantid, quantity, bizid, refundstate, transferstate from pipe_data_realtime.ods_bort_trade_t_order_item_part where refundstate <> 2 and transferstate <> 1 )b on a.id = b.orderid group by scencekey, bizid , productid union all select 0 as id, scencekey ,10002 as resourceType, variantid as resourceId, bizid as bizId, sum(quantity) as soldCount , CURRENT_timestamp() as dbctime, CURRENT_timestamp() as dbutime from(select id,scencekey from pipe_data_realtime.ods_bort_trade_t_order_part where state <> ‘4’) a join (select orderitemid, orderid, productid, variantid, quantity, bizid, refundstate, transferstate from pipe_data_realtime.ods_bort_trade_t_order_item_part where refundstate <> 2 and transferstate <> 1 ) b on a.id = b.orderid group by scencekey, bizid , variantid”]
[2022/09/28 16:31:46.931 +08:00] [INFO] [coprocessor.go:1149] [“memory exceeds quota, destroy one token now.”] [consumed=10378389257] [quota=8589934592] [“total token count”=80] [“remaining token count”=37]

【期望效果】
tidb-server跑超过内存的SQL能按照配置kill掉,而不是OOM

[“memory exceeds quota, destroy one token now.”] [consumed=10378389257] [quota=8589934592] [“total token count”=80] [“remaining token count”=37]

这个说明配置是生效了。

但是sql并没有被kill掉。
查了下这个日志,只是被流控了。

流量控制

  • TiDB 支持对读数据算子的动态内存控制功能。读数据的算子默认启用 tidb_distsql_scan_concurrency 所允许的最大线程数来读取数据。当单条 SQL 语句的内存使用每超过 tidb_mem_quota_query 一次,读数据的算子就会停止一个线程。
  • 流控行为由参数 tidb_enable_rate_limit_action 控制。
  • 当流控被触发时,会在日志中打印一条包含关键字 memory exceeds quota, destroy one token now 的日志。

可以用tiflash进行加速看看 ;再者是否有条件进行过滤,主要是看结果集,而不是你的表有多大

结果集很大,所以会把tidb-server内存撑爆。所以想把这类查询kill掉,不让tidb-server OOM。但是内存控制不生效。

你没贴执行计划啊,其实我怀疑你这个内存没控制住主要是走了hashagg导致的。
请session级别设置:
set tidb_hashagg_final_concurrency=1;
set tidb_hashagg_partial_concurrency=1;
这样来控制聚合

应该是可以,看下是否有对应bug,不然就升级版本看看