【TiDB 使用环境】生产环境
【TiDB 版本】8.5.1
【部署方式】自建机房私有化部署
【操作系统/CPU 架构/芯片详情】centos7.9
【机器部署详情】混合部署
【集群数据量】单节点tikv 500G
【集群节点数】7
【遇到的问题:问题现象及影响】
在自建机房部署tidb集群,tidb和pd是混布的,单节点最大内存47GB,由于之前出现过节点tidb server因OOM导致重启,故设置tidb_server_memory_limit为30GB;随着使用的时间越来越长,集群中heapinuse的大小只增不减,查询监控发现,现在因为这个设置,会经常导致运行中的sql被kill掉,而影响业务(报错见下面ERROR报错的日志处)。想请问这种情况的优化方式是什么?以及heapinuse为何会占用这么大内存,何时会释放。
【资源配置】进入到 TiDB Dashboard -集群信息 (Cluster Info) -主机(Hosts) 截图此页面
【复制黏贴 ERROR 报错的日志】
2026-02-04 01:32:08 (UTC+08:00)TiDB 1.1.1.1:4000[coprocessor.go:1389] ["[TIME_COP_PROCESS] resp_time:425.120824ms txnStartTS:464031529392668674 region_id:11542710 store_addr:1.1.1.1:20160 stats:Cop:{num_rpc:1, total_time:425.1ms} kv_process_ms:277 kv_wait_ms:0 kv_read_ms:122 processed_versions:50144 total_versions:53086 rocksdb_delete_skipped_count:5168 rocksdb_key_skipped_count:108711 rocksdb_cache_hit_count:510 rocksdb_read_count:756 rocksdb_read_byte:1946526"] [conn=827668692] [session_alias=]
2026-02-04 01:32:09 (UTC+08:00)TiDB 1.1.1.1:4000[servermemorylimit.go:159] ["global memory controller tries to kill the top1 memory consumer"] [conn=827668692] ["sql digest"=74810cb713eae918c74284fbc62d75fd2a4cffe9a5038238f53e3b65191ad948] ["sql text"="select Rec_id\n ,Defect\n ,Stoc"] [tidb_server_memory_limit=32212254720] ["heap inuse"=32224280576] ["sql memory usage"=227603297]
2026-02-04 01:32:09 (UTC+08:00)TiDB 1.1.1.1:4000[sqlkiller.go:61] ["kill initiated"] ["connection ID"=827668692] [reason="[executor:8176]Your query has been cancelled due to exceeding the allowed memory limit for the tidb-server instance and this query is currently using the most memory. Please try narrowing your query scope or increase the tidb_server_memory_limit and try again.[conn=827668692]"]
2026-02-04 01:32:09 (UTC+08:00)TiDB 1.1.1.1:4000[sqlkiller.go:128] ["global memory controller, NeedKill signal is received successfully"] [conn=827668692]
