TiKV服务莫名重启

TiDBer_ssvwtrcq · 2023 年4 月 25 日 06:43

【 TiDB 使用环境】测试
【 TiDB 版本】v6.5.2
【复现路径】刚部署了不超过24小时的新集群，只做过TPC-H测试，TPC-C的测试的数据生成部分。空闲一段时间(一晚上)后2个KV节点在早上8点49自动重启了。
经过调查分析，在操作系统日志中发现了如下日志：

Apr 25 08:49:04 tikv119 kernel: Out of memory: Kill process 4583 (tikv-server) score 917 or sacrifice child 
Apr 25 08:49:04 tikv119 kernel: Killed process 4583 (tikv-server), UID 0, total-vm:44109576kB, anon-rss:30524348kB, file-rss:520kB, shmem-rss:0kB 
Apr 25 08:49:08 tikv119 systemd: tikv-20160.service: main process exited, code=killed, status=9/KILL 
Apr 25 08:49:08 tikv119 systemd: Unit tikv-20160.service entered failed state. 
Apr 25 08:49:08 tikv119 systemd: tikv-20160.service failed. 
Apr 25 08:49:23 tikv119 systemd: tikv-20160.service holdoff time over, scheduling restart. 
Apr 25 08:49:23 tikv119 systemd: Stopped tikv service. 
Apr 25 08:49:23 tikv119 systemd: Started tikv service. 
Apr 25 08:49:23 tikv119 bash: sync ... 
Apr 25 08:49:23 tikv119 bash: real#0110m0.003s 
Apr 25 08:49:23 tikv119 bash: user#0110m0.000s 
Apr 25 08:49:23 tikv119 bash: sys#0110m0.001s 
Apr 25 08:49:23 tikv119 bash: ok

【遇到的问题：问题现象及影响】
【资源配置】
1个Server、1个PD节点、3个KV节点。
【附件：截图/日志/监控】
PD节点的日志
pd.log (3.2 KB)

Server节点的日志
tidb-server.log (13.6 KB)

其中一台KV节点的日志
tikv119.log (3.3 MB)

xfworld · 2023 年4 月 25 日 06:56

Out of memory…

TiDBer_ssvwtrcq · 2023 年4 月 25 日 07:34

知道是oom，但是不明原因。一晚上没有任何的查询操作。

TiDBer_ssvwtrcq · 2023 年4 月 25 日 07:35

怀疑是不是系统bug造成的内存溢出。

tidb菜鸟一只 · 2023 年4 月 25 日 07:42

SHOW config WHERE TYPE=‘tikv’ AND NAME LIKE ‘%storage.block-cache.capacity%’;–这个参数看看是多少

TiDBer_ssvwtrcq · 2023 年4 月 25 日 07:45

TiDBer_ssvwtrcq · 2023 年4 月 25 日 07:46

KV机器的物理内存配置是32G

tidb菜鸟一只 · 2023 年4 月 25 日 08:02

那tikv上还有别的进程？不是混合部署吧？

TiDBer_ssvwtrcq · 2023 年4 月 25 日 08:14

提供的日志可能不准确。刚刚发现kv节点和server节点时间莫名的出现了30分钟时差。不是混合部署。

TiDBer_ssvwtrcq · 2023 年4 月 25 日 08:15

KV节点的日志是准确的。 kv节点比pd和server节点慢30分钟。

ffeenn · 2023 年4 月 25 日 09:00

服务器上的实际使用内存目前是多少。看一下监控那个时间点的资源使用情况。

xfworld · 2023 年4 月 25 日 10:15

你的集群咋装的？ tiup 会执行环境检查，如果环境不ok，是不能安装的…

时间差那么多…

TiDBer_ssvwtrcq · 2023 年4 月 25 日 10:26

服务器主板的时钟出问题了。我们将集群重建了。谢谢各位

system · 2023 年6 月 24 日 10:27

此话题已在最后回复的 60 天后被自动关闭。不再允许新回复。