为什么grafana和kubectl top看到的内存信使用量差那么多

TiDBer_jYQINSnf · 2021 年1 月 9 日 02:06

为提高效率，提问时请提供以下信息，问题描述清晰可优先响应。

【TiDB 版本】：v4.0.8
【问题描述】：grafana监控显示用了19G，kubectl top显示用了29G

补充答案：
kubectl top pod 得到的内存使用量，并不是cadvisor 中的container_memory_usage_bytes，而是container_memory_working_set_bytes，计算方式为：

container_memory_usage_bytes == container_memory_rss + container_memory_cache + kernel memory
container_memory_working_set_bytes = container_memory_usage_bytes - total_inactive_file（未激活的匿名缓存页）

container_memory_working_set_bytes是容器真实使用的内存量，也是limit限制时的 oom 判断依据

github.com/kubernetes/kubernetes

kubelet counts active page cache against memory.available (maybe it shouldn't?)

已打开 04:30PM - 31 Mar 17 UTC

vdavidoff

sig/node kind/feature lifecycle/frozen

**Is this a request for help?** (If yes, you should use our troubleshooting guid…e and community support channels, see http://kubernetes.io/docs/troubleshooting/.): No **What keywords did you search in Kubernetes issues before filing this one?** (If you have found any duplicates, you should instead reply there.): active_file inactive_file working_set WorkingSet cAdvisor memory.available --- **Is this a BUG REPORT or FEATURE REQUEST?** (choose one): We'll say BUG REPORT (though this is arguable) **Kubernetes version** (use `kubectl version`): 1.5.3 **Environment**: - **Cloud provider or hardware configuration**: - **OS** (e.g. from /etc/os-release): NAME="Ubuntu" VERSION="14.04.5 LTS, Trusty Tahr" ID=ubuntu ID_LIKE=debian PRETTY_NAME="Ubuntu 14.04.5 LTS" VERSION_ID="14.04" - **Kernel** (e.g. `uname -a`): Linux HOSTNAME_REDACTED 3.13.0-44-generic #73-Ubuntu SMP Tue Dec 16 00:22:43 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux - **Install tools**: - **Others**: **What happened**: A pod was evicted due to memory pressure on the node, when it appeared to me that there shouldn't have been sufficient memory pressure to cause an eviction. Further digging seems to have revealed that active page cache is being counted against memory.available. **What you expected to happen**: memory.available would not have active page cache counted against it, since it is reclaimable by the kernel. This also seems to greatly complicate a general case for configuring memory eviction policies, since in a general sense it's effectively impossible to understand how much page cache will be active at any given time on any given node, or how long it will stay active (in relation to eviction grace periods). **How to reproduce it** (as minimally and precisely as possible): Cause a node to chew up enough active page cache that the existing calculation for memory.available trips a memory eviction threshold, even though the threshold would not be tripped if the page cache - active and inactive - were freed for anon memory. **Anything else we need to know**: I discussed this with @derekwaynecarr in #sig-node and am opening this issue at his request ([conversation starts here](https://kubernetes.slack.com/archives/C0BP8PW9G/p1490970061856526)). Before poking around on Slack or opening this issue, I did my best to read through the 1.5.3 release code, Kubernetes documentation, and cgroup kernel documentation to make sure I understood what was going on here. The short of it is that I believe [this calculation](https://kubernetes.io/docs/concepts/cluster-administration/out-of-resource/#eviction-signals): memory.available := node.status.capacity[memory] - node.stats.memory.workingSet Is using cAdvisor's value for working set, which if I traced the code correctly, amounts to: $cgroupfs/memory.usage_in_bytes - total_inactive_file Where, according to my interpretation of the kernel documentation, usage_in_bytes includes all page cache: $kernel/Documentation/cgroups/memory.txt ```2.1. Design The core of the design is a counter called the res_counter. The res_counter tracks the current memory usage and limit of the group of processes associated with the controller. ... 2.2.1 Accounting details All mapped anon pages (RSS) and cache pages (Page Cache) are accounted. ``` Ultimately my issue is concerning how I can set generally applicable memory eviction thresholds if active page cache is counting against those, and there's no way to to know (1) generally how much page cache will be active across a cluster's nodes, to use as part of general threshold calculations (2) how long active page cache will stay active, to use as part of eviction grace period calculations. I understand that there are many layers here and that this is not a particularly simple problem to solve generally correctly, or even understand top to bottom. So I apologize up front if any of my conclusions are incorrect or I'm missing anything major, and I appreciate any feedback you all can provide. As requested by @derekwaynecarr: cc @sjenning @derekwaynecarr

github.com/kubernetes/kubernetes

kubernetes should not count active_file as used memory, I have been waiting for 4 years!

已打开 02:36AM - 24 Aug 21 UTC

已关闭 09:41PM - 14 Sep 21 UTC

zhxjdwh

kind/bug sig/node needs-triage

#### What happened: kubernetes should not count active_file as used memory！ ht…tps://github.com/kubernetes/kubernetes/issues/43916 #### Environment: - version: v1.15.3 - Cloud provider or hardware configuration: none - OS : centos7.4 - Kernel: Linux master1 4.4.124-1.el7.elrepo.x86_64 #1 SMP Sun Mar 25 05:21:59 EDT 2018 x86_64 x86_64 x86_64 GNU/Linux

TiDBer_jYQINSnf · 2021 年1 月 10 日 05:17

单tikv数据才40G左右，怎么就占那么大内存，从哪里看都哪些组件占了?

yilong · 2021 年1 月 11 日 05:35

进到 pod 里查一下 tikv 占用的内存是多少
kubectl exec -it sh -n
使用 free -g ， top 等命令看下哪个进程占用多

TiDBer_jYQINSnf · 2021 年1 月 11 日 06:18

free 看到的不准。

给pod限制了32G内存

tikv1的grafana：

yilong · 2021 年1 月 11 日 06:27

那从grafana开始查吧，看看监控的是哪里的数据。参考这个文档看看能找到吗？多谢

TiDBer_jYQINSnf · 2021 年1 月 11 日 06:46

直接通过curl的tikv1的20180端口拿到的。这个数据就是不准的。

yilong · 2021 年1 月 11 日 06:54

请问您看的是哪个监控项中的 memory？可以看看 over-view 监控项的 tikv memory 和 system memory 有没有哪个是你想要的监控项

TiDBer_jYQINSnf · 2021 年1 月 11 日 07:02

over-view下的tikv memory是17G左右，一样的。system memory没找到。

yilong · 2021 年1 月 11 日 07:20

好的，有没有其他系统也使用Prometheus的？看下 Prometheus 是不是取 pod 的内存不准呢？

TiDBer_jYQINSnf · 2021 年1 月 11 日 07:21

就是k8s的operator带的grafana，因为经常是tikv oom了，但是看监控，内存还不到分配的一半。

5kbpers-PingCAP · 2021 年1 月 11 日 10:07

请问有开 THP 吗？它的计算方式是拿 procfs stats 里的 rss，再乘以 page size

TiDBer_jYQINSnf · 2021 年1 月 12 日 06:46

这样看是关了的？

5kbpers-PingCAP · 2021 年1 月 13 日 13:31

这个看样子是开着的。
我们监控里 Prometheus 统计的内存是这样的：

从 /proc/self/stats 里获取 rss 字段，是进程实际在内存中使用占用的页数
拿这个页数乘以 libc 中给出的 PageSize，计算出字节数

这个数字按理来说应该是比较准确的。

我猜测可能 thp 会影响最终计算出来的字节数的准确性，不过没有验证成功。。。
另外在生产环境中出于其他方面的考量我们也是不建议开启 thp 的，详细的情况可以参考这篇文章 https://pingcap.com/blog-cn/why-should-we-disable-thp/

TiDBer_jYQINSnf · 2021 年1 月 14 日 08:05

大概说下我们的环境: 物理机上跑虚机，虚机作为k8s的node。在k8s里面创建tidb集群。这种情况下几乎都是不准的。tikv拿到的几乎是实际kubectl top看到的一半多一些。

另外默认用operator跑起来的集群，tikv的内存使用量占limit百分比是多少? 我们的环境，刚起来的tikv的pod过不了半小时通过kubectl top看都能达到分配的limit的90%以上。

5kbpers-PingCAP · 2021 年1 月 14 日 11:36

tikv 默认会设置一个内存大小 40% 的 rocksdb block cache。

看了下之前的回复是说 free -g 看到的是 220G？建议手动配置一下 storage.block-cache.capacity

5kbpers-PingCAP · 2021 年1 月 15 日 05:03

另外用 top node 的话看到的内存是和监控上一致的吗？我看了一下 top node 查的也是 rss，而 top pod 查的是 cgroup 的 inactive files。

方便的话也可以看一下/proc/${tikv-pid}/smaps，这里面的内存大小应该比 rss 要稍微准确一些。

TiDBer_jYQINSnf · 2021 年1 月 15 日 11:01

“storage”: {
“data-dir”: “/var/lib/tikv”,
“gc-ratio-threshold”: 1.1,
“max-key-size”: 4096,
“scheduler-concurrency”: 524288,
“scheduler-worker-pool-size”: 4,
“scheduler-pending-write-threshold”: “100MiB”,
“reserve-space”: “2GiB”,
“block-cache”: {
“shared”: true,
“capacity”: “14417MiB”,
“num-shard-bits”: 6,
“strict-capacity-limit”: false,
“high-pri-pool-ratio”: 0.8,
“memory-allocator”: “nodump”
}
}
看这个配置，storage.block-cache.capacity没200G。

TiDBer_jYQINSnf · 2021 年1 月 15 日 12:59

现在是这样的：
进入docker中，top和cat /proc/1/smaps算出来的是一样的，都是16.9G
通过kubectl top 获取pod的内存使用量是26.2G
通过grafana看到的是14.6G

三个地方看到的都不一样。关键是k8s是以pod的内存占用oom的。
多余的内存怎么看哪里占用了？关键差别在pod外面和pod内部实际进程使用。

TiDBer_jYQINSnf · 2021 年1 月 15 日 13:26

https://www.ibm.com/support/pages/kubectl-top-pods-and-docker-stats-show-different-memory-statistics

看这个，应该是kubectl top 统计不准。

5kbpers-PingCAP · 2021 年1 月 19 日 12:29

了解，那总结来看内存误差应该是由于 kubectl top 的误差和 linux rss 的误差结合导致的。
关于 oom 的问题，我们在设置了 block cache 大小之后还会有吗？