tidb组件节点内存缓慢增长,直至消耗完,BUG了?

【 TiDB 使用环境】生产环境
【 TiDB 版本】
Cluster version: v4.0.9
【遇到的问题:问题现象及影响】
今天业务反馈系统查询很慢。时快时慢不稳定,数据库也没啥变更,但是看监控发现tidb组件节点内存在逐渐消耗。服务器内存吃完。可能跟这个有关吗?有 没大佬帮忙分析一下,谢谢~

集群修改过tikv_gc_life_time时间为一个月(为了做增量备份修改的,我们业务场景也很少有修改),不知道跟这有关系没?我把tikv_gc_life_time改为1h了,内存也没降下来
【资源配置】
【附件:截图/日志/监控】
集群信息:


TIDB监控面板截图:

TIDB组件日志信息:(没啥异常的,但是刷的非常快~~)

看下那时间段的慢SQL, 表里有没有 text lob类型的字段

1 Like


我们慢查询好多啊
这个text lob类型有啥影响吗?

之前碰到过 text导致内存吃光的情况,不过那个是增长很快的,又看了下你这个 你的tidb 内存是个缓慢增长的状态 goroutine有突增,感觉像有泄露

1 Like

就在刚刚,有个tidb节点 heap memory瞬间降下来了, 这个heap memory是啥用途呀? 会不会gc的数据

瞬间都下来了,发生了啥?我那会修改过gc_life_time,gc的数据是存在tikv的吧?会影响tidb的内存吗?有点晕不太懂原理


-----知道了,突然下来了,是因为这个tidb节点进程崩溃重启了,所以内存释放了

发生OOM了,你看下 memory-usage-alarm-ratio 设置的多少。然后找下tidb日志,看下OOM之前,有没有把 heap ,running_sql, goroutinue 这些信息dump出来。

然后分析一下heap,running_sql,看看究竟是什么地方导致的OOM问题。

1 Like

OOM之前的dump信息贴一下看看,也可能是内存泄露相关的bug

1 Like

日志文件太大了,我过滤oom出来如下信息大佬看一下:
[root@b26 tidb-4000]#
[root@b26 tidb-4000]#
[root@b26 tidb-4000]# grep -i oom tidb-2022-11-09T00-34-18.088.log |more
[2022/11/08 20:17:39.510 +08:00] [WARN] [memory_usage_alarm.go:141] [“tidb-server has the risk of OOM. Running SQLs and heap profile will be recorded in record path”] [“is server-memory-quota set”=false] [“system me
mory total”=33533276160] [“system memory usage”=31310999552] [“tidb-server memory usage”=19779988200] [memory-usage-alarm-ratio=0.8] [“record path”=“/tmp/1025_tidb/MC4wLjAuMDo0MDAwLzAuMC4wLjA6MTAwODA=/tmp-storage/re
cord”]
[2022/11/08 20:17:51.116 +08:00] [WARN] [memory_usage_alarm.go:141] [“tidb-server has the risk of OOM. Running SQLs and heap profile will be recorded in record path”] [“is server-memory-quota set”=false] [“system me
mory total”=33533276160] [“system memory usage”=31315435520] [“tidb-server memory usage”=20917584008] [memory-usage-alarm-ratio=0.8] [“record path”=“/tmp/1025_tidb/MC4wLjAuMDo0MDAwLzAuMC4wLjA6MTAwODA=/tmp-storage/re
cord”]
[2022/11/08 20:18:02.945 +08:00] [WARN] [memory_usage_alarm.go:141] [“tidb-server has the risk of OOM. Running SQLs and heap profile will be recorded in record path”] [“is server-memory-quota set”=false] [“system me
mory total”=33533276160] [“system memory usage”=31339331584] [“tidb-server memory usage”=22594641656] [memory-usage-alarm-ratio=0.8] [“record path”=“/tmp/1025_tidb/MC4wLjAuMDo0MDAwLzAuMC4wLjA6MTAwODA=/tmp-storage/re
cord”]
[2022/11/08 20:18:15.052 +08:00] [WARN] [memory_usage_alarm.go:141] [“tidb-server has the risk of OOM. Running SQLs and heap profile will be recorded in record path”] [“is server-memory-quota set”=false] [“system me
mory total”=33533276160] [“system memory usage”=31360589824] [“tidb-server memory usage”=23886401472] [memory-usage-alarm-ratio=0.8] [“record path”=“/tmp/1025_tidb/MC4wLjAuMDo0MDAwLzAuMC4wLjA6MTAwODA=/tmp-storage/re
cord”]
[2022/11/08 20:18:36.624 +08:00] [WARN] [memory_usage_alarm.go:141] [“tidb-server has the risk of OOM. Running SQLs and heap profile will be recorded in record path”] [“is server-memory-quota set”=false] [“system me
mory total”=33533276160] [“system memory usage”=31433367552] [“tidb-server memory usage”=25614050064] [memory-usage-alarm-ratio=0.8] [“record path”=“/tmp/1025_tidb/MC4wLjAuMDo0MDAwLzAuMC4wLjA6MTAwODA=/tmp-storage/re
cord”]
[2022/11/08 20:18:48.469 +08:00] [WARN] [memory_usage_alarm.go:141] [“tidb-server has the risk of OOM. Running SQLs and heap profile will be recorded in record path”] [“is server-memory-quota set”=false] [“system me
mory total”=33533276160] [“system memory usage”=31401734144] [“tidb-server memory usage”=14070173040] [memory-usage-alarm-ratio=0.8] [“record path”=“/tmp/1025_tidb/MC4wLjAuMDo0MDAwLzAuMC4wLjA6MTAwODA=/tmp-storage/re
cord”]
[2022/11/08 20:19:00.565 +08:00] [WARN] [memory_usage_alarm.go:141] [“tidb-server has the risk of OOM. Running SQLs and heap profile will be recorded in record path”] [“is server-memory-quota set”=false] [“system me
mory total”=33533276160] [“system memory usage”=31400734720] [“tidb-server memory usage”=15317806152] [memory-usage-alarm-ratio=0.8] [“record path”=“/tmp/1025_tidb/MC4wLjAuMDo0MDAwLzAuMC4wLjA6MTAwODA=/tmp-storage/re
cord”]
[2022/11/08 20:19:13.193 +08:00] [WARN] [memory_usage_alarm.go:141] [“tidb-server has the risk of OOM. Running SQLs and heap profile will be recorded in record path”] [“is server-memory-quota set”=false] [“system me
mory total”=33533276160] [“system memory usage”=31428382720] [“tidb-server memory usage”=16562777040] [memory-usage-alarm-ratio=0.8] [“record path”=“/tmp/1025_tidb/MC4wLjAuMDo0MDAwLzAuMC4wLjA6MTAwODA=/tmp-storage/re
cord”]
[2022/11/08 20:19:24.745 +08:00] [WARN] [memory_usage_alarm.go:141] [“tidb-server has the risk of OOM. Running SQLs and heap profile will be recorded in record path”] [“is server-memory-quota set”=false] [“system me
mory total”=33533276160] [“system memory usage”=31451799552] [“tidb-server memory usage”=17912782440] [memory-usage-alarm-ratio=0.8] [“record path”=“/tmp/1025_tidb/MC4wLjAuMDo0MDAwLzAuMC4wLjA6MTAwODA=/tmp-storage/re
cord”]
[2022/11/08 20:19:36.860 +08:00] [WARN] [memory_usage_alarm.go:141] [“tidb-server has the risk of OOM. Running SQLs and heap profile will be recorded in record path”] [“is server-memory-quota set”=false] [“system me
mory total”=33533276160] [“system memory usage”=31475445760] [“tidb-server memory usage”=20011172064] [memory-usage-alarm-ratio=0.8] [“record path”=“/tmp/1025_tidb/MC4wLjAuMDo0MDAwLzAuMC4wLjA6MTAwODA=/tmp-storage/re
cord”]
[2022/11/08 20:19:48.782 +08:00] [WARN] [memory_usage_alarm.go:141] [“tidb-server has the risk of OOM. Running SQLs and heap profile will be recorded in record path”] [“is server-memory-quota set”=false] [“system me
mory total”=33533276160] [“system memory usage”=31427584000] [“tidb-server memory usage”=21546932872] [memory-usage-alarm-ratio=0.8] [“record path”=“/tmp/1025_tidb/MC4wLjAuMDo0MDAwLzAuMC4wLjA6MTAwODA=/tmp-storage/re
cord”]
[2022/11/08 20:20:00.006 +08:00] [WARN] [memory_usage_alarm.go:141] [“tidb-server has the risk of OOM. Running SQLs and heap profile will be recorded in record path”] [“is server-memory-quota set”=false] [“system me
mory total”=33533276160] [“system memory usage”=31404859392] [“tidb-server memory usage”=22523424224] [memory-usage-alarm-ratio=0.8] [“record path”=“/tmp/1025_tidb/MC4wLjAuMDo0MDAwLzAuMC4wLjA6MTAwODA=/tmp-storage/re
cord”]
[2022/11/08 20:20:11.863 +08:00] [WARN] [memory_usage_alarm.go:141] [“tidb-server has the risk of OOM. Running SQLs and heap profile will be recorded in record path”] [“is server-memory-quota set”=false] [“system me
mory total”=33533276160] [“system memory usage”=31412420608] [“tidb-server memory usage”=23954927736] [memory-usage-alarm-ratio=0.8] [“record path”=“/tmp/1025_tidb/MC4wLjAuMDo0MDAwLzAuMC4wLjA6MTAwODA=/tmp-storage/re
cord”]
[2022/11/08 20:20:56.659 +08:00] [INFO] [printer.go:47] [“loaded config”] [config=“{"host":"0.0.0.0","advertise-address":"192.168.241.26","port":4000,"cors":"","store":"tikv","path":"192.168.241
.49:12379,192.168.241.26:12379,192.168.241.21:12379","socket":"","lease":"45s","run-ddl":true,"split-table":true,"token-limit":1000,"oom-use-tmp-storage":true,"tmp-storage-path":"/tmp/1025_tidb/M
C4wLjAuMDo0MDAwLzAuMC4wLjA6MTAwODA=/tmp-storage","oom-action":"log","mem-quota-query":8589934592,"tmp-storage-quota":-1,"enable-streaming":false,"enable-batch-dml":false,"lower-case-table-names":2,"s
erver-version":"","log":{"level":"info","format":"text","disable-timestamp":null,"enable-timestamp":null,"disable-error-stack":null,"enable-error-stack":null,"file":{"filename":"/data/deplo
y/install/log/tidb-4000/tidb.log","max-size":300,"max-days":0,"max-backups":0},"enable-slow-log":true,"slow-query-file":"log/tidb_slow_query.log","slow-threshold":300,"expensive-threshold":10000,"q
uery-log-max-len":4096,"record-plan-in-slow-log":1},"security":{"skip-grant-table":false,"ssl-ca":"","ssl-cert":"","ssl-key":"","require-secure-transport":false,"cluster-ssl-ca":"","cluster
-ssl-cert":"","cluster-ssl-key":"","cluster-verify-cn":null},"status":{"status-host":"0.0.0.0","metrics-addr":"","status-port":10080,"metrics-interval":15,"report-status":true,"record-db-qp
s":false},"performance":{"max-procs":0,"max-memory":0,"server-memory-quota":0,"memory-usage-alarm-ratio":0.8,"stats-lease":"3s","stmt-count-limit":5000,"feedback-probability":0,"query-feedback-li
mit":512,"pseudo-estimate-ratio":0.8,"force-priority":"NO_PRIORITY","bind-info-lease":"3s","txn-total-size-limit":10737418240,"tcp-keep-alive":true,"cross-join":true,"run-auto-analyze":true,"agg-
push-down-join":false,"committer-concurrency":16,"max-txn-ttl":600000,"gogc":100},"prepared-plan-cache":{"enabled":false,"capacity":100,"memory-guard-ratio":0.1},"opentracing":{"enable":false,"rp
c-metrics":false,"sampler":{"type":"const","param":1,"sampling-server-url":"","max-operations":0,"sampling-refresh-interval":0},"reporter":{"queue-size":0,"buffer-flush-interval":0,"log-spans
":false,"local-agent-host-port":""}},"proxy-protocol":{"networks":"","header-timeout":5},"tikv-client":{"grpc-connection-count":4,"grpc-keepalive-time":10,"grpc-keepalive-timeout":3,"commit-tim
eout":"41s","max-batch-size":128,"overload-threshold":200,"max-batch-wait-time":0,"batch-wait-size":8,"enable-chunk-rpc":true,"region-cache-ttl":600,"store-limit":0,"store-liveness-timeout":"5s"
,"copr-cache":{"enable":false,"capacity-mb":1000,"admission-max-ranges":500,"admission-max-result-mb":10,"admission-min-process-ms":5}},"binlog":{"enable":false,"ignore-error":false,"write-timeout
":"15s","binlog-socket":"","strategy":"range"},"compatible-kill-query":false,"plugin":{"dir":"","load":""},"pessimistic-txn":{"enable":true,"max-retry-count":256},"check-mb4-value-in-ut
f8":true,"max-index-length":3072,"graceful-wait-before-shutdown":0,"alter-primary-key":false,"treat-old-version-utf8-as-utf8mb4":true,"enable-table-lock":false,"delay-clean-table-lock":0,"split-region-
max-num":1000,"stmt-summary":{"enable":true,"enable-internal-query":false,"max-stmt-count":200,"max-sql-length":4096,"refresh-interval":1800,"history-size":24},"repair-mode":false,"repair-table-lis
t":[],"isolation-read":{"engines":["tikv","tiflash","tidb"]},"max-server-connections":0,"new_collations_enabled_on_first_bootstrap":false,"experimental":{"allow-expression-index":false},"enable-c
ollect-execution-info":true,"skip-register-to-dashboard":false,"enable-telemetry":true}”]
[2022/11/08 22:06:07.441 +08:00] [INFO] [2pc.go:822] [“prewrite encounters lock”] [conn=29] [lock=“key: {tableID=2230, handle=68540032299}, primary: {tableID=2230, indexID=1, indexValues={123, http://www.t-boom.com.
cn/wap/indeax.php?id=795213, }}, txnStartTS: 437234267549335567, lockForUpdateTS:0, ttl: 4321, type: Put”]
[2022/11/08 22:06:07.441 +08:00] [INFO] [2pc.go:822] [“prewrite encounters lock”] [conn=29] [lock=“key: {tableID=2230, handle=68540043835}, primary: {tableID=2230, indexID=1, indexValues={123, http://www.t-boom.com.
cn/wap/indeax.php?id=795213, }}, txnStartTS: 437234267549335567, lockForUpdateTS:0, ttl: 4321, type: Put”]
[root@b26 tidb-4000]#
[root@b26 tidb-4000]#

是0.8

日志级别可以改成warn,日志会少很多

1 Like

[2022/11/08 20:20:11.863 +08:00] [WARN] [memory_usage_alarm.go:141] [“tidb-server has the risk of OOM. Running SQLs and heap profile will be recorded in record path”] [“is server-memory-quota set”=false] [“system me
mory total”=33533276160] [“system memory usage”=31412420608] [“tidb-server memory usage”=23954927736] [memory-usage-alarm-ratio=0.8] [“record path”=“/tmp/1025_tidb/MC4wLjAuMDo0MDAwLzAuMC4wLjA6MTAwODA=/tmp-storage/re
cord”]

你看下这个路径的
/tmp/1025_tidb/MC4wLjAuMDo0MDAwLzAuMC4wLjA6MTAwODA=/tmp-storage/re
cord

先看下有running_sql,看下有没有什么异常的sql消耗内存的,没有的话就把heap上传一下吧

1 Like

有挺多的,这个running_sql里面的是异常的sql吗?什么情况会记录这个?


这是running_sql日志文件:
running_sql2022-11-08T20_19_13+08_00.txt (118.9 KB)

把时间最新的heap文件再传一下。

1 Like

这个文件杂看呀,没法直接more呢
heap2022-11-08T20_20_11+08_00.txt (9.6 MB)

直接用终端执行 go tool pprof -http=127.0.0.1:8080 heap文件,然后通过浏览器查看。

感觉这个问题还是出在SQL层面上,你那个SQL where 条件中in的值太多了。

2 Likes

好的,感谢大佬耐心解答 :pray:

感觉还是跟gc时间设置太久有关,调成1h后,有改善吗?

其实我也不太确定杂回事,之前我们gc时间一直都是设置的1个月,我问过开发最近也没有大量的数据修改。
我把gc时间改成1小时后,下次gc后也内存没有下来,直至tidb oom内存才释放。
我的结论是SQL查询的问题。后台确实也有大量的查询

这个sql是最近上线的业务吗?表数据量有多大?表健康度怎么样?是否需要分析下表?你那个sql里确实in的值很多。可以创建个联合索引,避免再回表去查另一个值,反正也就是求个count,看看有没有效果

原来是这么看的