为什么这个报错反复出现？[ERROR] [server.rs:866] ["failed to init io snooper"] [err_code=KV:Unknown] [err="\"IO snooper is n ot started due to not compiling with BCC\""]

dba_wxm · 2021 年6 月 15 日 08:49

前面3个tikv中的其中一个节点，突然down掉，起不来，日志报错入标题；经过一顿扩缩容操作后，新扩容了一个新的节点，老的节点下线了；新的节点又突然down掉，报如上错，到底哪里出了问题？？

[2021/06/15 16:39:21.933 +08:00] [INFO] [server.rs:270] [“using config”] [config="{“log-level”:“info”,“log-file”:"/data/tidb-
deploy/tikv-20160/log/tikv.log",“log-format”:“text”,“slow-log-file”:"",“slow-log-threshold”:“1s”,“log-rotation-timespa
n”:“1d”,“log-rotation-size”:“300MiB”,“panic-when-unexpected-key-or-data”:false,“enable-io-snoop”:true,“abort-on-panic”:
false,“readpool”:{“unified”:{“min-thread-count”:1,“max-thread-count”:4,“stack-size”:“10MiB”,“max-tasks-per-worker”:200
0},“storage”:{“use-unified-pool”:true,“high-concurrency”:4,“normal-concurrency”:4,“low-concurrency”:4,“max-tasks-per-work
er-high”:2000,“max-tasks-per-worker-normal”:2000,“max-tasks-per-worker-low”:2000,“stack-size”:“10MiB”},“coprocessor”:{“u
se-unified-pool”:true,“high-concurrency”:3,“normal-concurrency”:3,“low-concurrency”:3,“max-tasks-per-worker-high”:2000,“ma
x-tasks-per-worker-normal”:2000,“max-tasks-per-worker-low”:2000,“stack-size”:“10MiB”}},“server”:{“addr”:“0.0.0.0:20160”
,“advertise-addr”:“192.168.5.49:20160”,“status-addr”:“0.0.0.0:20180”,“advertise-status-addr”:“192.168.5.49:20180”,“stat
us-thread-pool-size”:1,“max-grpc-send-msg-len”:10485760,“grpc-compression-type”:“none”,“grpc-concurrency”:5,“grpc-concurre
nt-stream”:1024,“grpc-raft-conn-num”:1,“grpc-memory-pool-quota”:9223372036854775807,“grpc-stream-initial-window-size”:“2MiB
“,“grpc-keepalive-time”:“10s”,“grpc-keepalive-timeout”:“3s”,“concurrent-send-snap-limit”:32,“concurrent-recv-snap-limit”
:32,“end-point-recursion-limit”:1000,“end-point-stream-channel-size”:8,“end-point-batch-row-limit”:64,“end-point-stream-batch
-row-limit”:128,“end-point-enable-batch-if-possible”:true,“end-point-request-max-handle-duration”:“1m”,“end-point-max-concur
rency”:4,“snap-max-write-bytes-per-sec”:“100MiB”,“snap-max-total-size”:“0KiB”,“stats-concurrency”:1,“heavy-load-threshol
d”:300,“heavy-load-wait-duration”:“1ms”,“enable-request-batch”:true,“background-thread-count”:2,“end-point-slow-log-thresh
old”:“1s”,“forward-max-connections-per-address”:4,“labels”:{}},“storage”:{“data-dir”:”/data/tidb-data/tikv-20160”,“gc-
ratio-threshold”:1.1,“max-key-size”:4096,“scheduler-concurrency”:524288,“scheduler-worker-pool-size”:4,“scheduler-pending-wr
ite-threshold”:“100MiB”,“reserve-space”:“0KiB”,“enable-async-apply-prewrite”:false,“enable-ttl”:false,“ttl-check-poll-in
terval”:“12h”,“block-cache”:{“shared”:true,“capacity”:“1331MiB”,“num-shard-bits”:6,“strict-capacity-limit”:false,“hi
gh-pri-pool-ratio”:0.8,“memory-allocator”:“nodump”}},“pd”:{“endpoints”:[“192.168.5.43:2379”,“192.168.5.44:2379”],“retr
y-interval”:“300ms”,“retry-max-count”:9223372036854775807,“retry-log-every”:10,“update-interval”:“10m”,“enable-forwardin
g”:false},“metric”:{“job”:“tikv”},“raftstore”:{“prevote”:true,“raftdb-path”:"/data/tidb-data/tikv-20160/raft",“capac
ity”:“0KiB”,“raft-base-tick-interval”:“1s”,“raft-heartbeat-ticks”:2,“raft-election-timeout-ticks”:10,“raft-min-election-
timeout-ticks”:10,“raft-max-election-timeout-ticks”:20,“raft-max-size-per-msg”:“1MiB”,“raft-max-inflight-msgs”:256,“raft-e
ntry-max-size”:“8MiB”,“raft-log-gc-tick-interval”:“10s”,“raft-log-gc-threshold”:50,“raft-log-gc-count-limit”:73728,“raft
-log-gc-size-limit”:“72MiB”,“raft-log-reserve-max-ticks”:6,“raft-engine-purge-interval”:“10s”,“raft-entry-cache-life-time
“:“30s”,“raft-reject-transfer-leader-duration”:“3s”,“split-region-check-tick-interval”:“10s”,“region-split-check-diff”:
“6MiB”,“region-compact-check-interval”:“5m”,“region-compact-check-step”:100,“region-compact-min-tombstones”:10000,“region-
compact-tombstones-percent”:30,“pd-heartbeat-tick-interval”:“1m”,“pd-store-heartbeat-tick-interval”:“10s”,“snap-mgr-gc-tic
k-interval”:“1m”,“snap-gc-timeout”:“4h”,“lock-cf-compact-interval”:“10m”,“lock-cf-compact-bytes-threshold”:“256MiB”,
“notify-capacity”:40960,“messages-per-tick”:4096,“max-peer-down-duration”:“5m”,“max-leader-missing-duration”:“2h”,“abnor
mal-leader-missing-duration”:“10m”,“peer-stale-state-check-interval”:“5m”,“leader-transfer-max-log-lag”:128,“snap-apply-ba
tch-size”:“10MiB”,“consistency-check-interval”:“0s”,“report-region-flow-interval”:“1m”,“raft-store-max-leader-lease”:”
9s”,“right-derive-when-split”:true,“allow-remove-leader”:false,“merge-max-log-gap”:10,“merge-check-tick-interval”:“2s”,"
use-delete-range":false,“cleanup-import-sst-interval”:“10m”,“local-read-batch-size”:1024,“apply-max-batch-size”:256,"apply

[2021/06/15 16:39:21.937 +08:00] [ERROR] [server.rs:866] [“failed to init io snooper”] [err_code=KV:Unknown] [err="“IO snooper is n
ot started due to not compiling with BCC”"]
[2021/06/15 16:39:21.937 +08:00] [INFO] [mod.rs:116] [“encryption: none of key dictionary and file dictionary are found.”]
[2021/06/15 16:39:21.937 +08:00] [INFO] [mod.rs:477] [“encryption is disabled.”]
[2021/06/15 16:39:22.007 +08:00] [INFO] [future.rs:146] [“starting working thread”] [worker=gc-worker]
[2021/06/15 16:39:22.071 +08:00] [INFO] [mod.rs:214] [“Storage started.”]
[2021/06/15 16:39:22.080 +08:00] [INFO] [node.rs:176] [“put store to PD”] [store=“id: 123045 address: “192.168.5.49:20160” version
: “5.0.2” status_address: “192.168.5.49:20180” git_hash: “6e6ea0e02c2caac556f95a821f92b28fc88dba85” start_timestamp: 162374636
2 deploy_path: “/data/tidb-deploy/tikv-20160/bin””]
[2021/06/15 16:39:22.083 +08:00] [INFO] [node.rs:243] [“initializing replication mode”] [store_id=123045] [status=Some()]
[2021/06/15 16:39:22.083 +08:00] [INFO] [replication_mode.rs:51] [“associated store labels”] [labels="[]"] [store_id=4]
[2021/06/15 16:39:22.083 +08:00] [INFO] [replication_mode.rs:51] [“associated store labels”] [labels="[key: “host” value: “tikv44
“]”] [store_id=5]
[2021/06/15 16:39:22.083 +08:00] [INFO] [replication_mode.rs:51] [“associated store labels”] [labels=”[]"] [store_id=123045]
[2021/06/15 16:39:22.083 +08:00] [INFO] [replication_mode.rs:51] [“associated store labels”] [labels="[key: “host” value: "tikv45
“]”] [store_id=1]
[2021/06/15 16:39:22.083 +08:00] [INFO] [node.rs:387] [“start raft store thread”] [store_id=123045]
[2021/06/15 16:39:22.084 +08:00] [INFO] [snap.rs:1137] [“Initializing SnapManager, encryption is enabled: false”]
[2021/06/15 16:39:22.185 +08:00] [INFO] [peer.rs:191] [“create peer”] [peer_id=411767325] [region_id=411767323]
[2021/06/15 16:39:22.190 +08:00] [INFO] [raft.rs:2443] [“switched to configuration”] [config=“Configuration { voters: Configuration
{ incoming: Configuration { voters: {411767324, 411767325, 411767326} }, outgoing: Configuration { voters: {} } }, learners: {}, lea
rners_next: {}, auto_leave: false }”] [raft_id=411767325] [region_id=411767323]
[2021/06/15 16:39:22.190 +08:00] [INFO] [raft.rs:1064] [“became follower at term 166”] [term=166] [raft_id=411767325] [region_id=411
767323]
[2021/06/15 16:39:22.190 +08:00] [INFO] [raft.rs:375] [newRaft] [peers=“Configuration { incoming: Configuration { voters: {411767324
, 411767325, 411767326} }, outgoing: Configuration { voters: {} } }”] [“last term”=143] [“last index”=170] [applied=157] [commit=170
] [term=166] [raft_id=411767325] [region_id=411767323]
[2021/06/15 16:39:22.190 +08:00] [INFO] [raw_node.rs:285] [“RawNode created with id 411767325.”] [id=411767325] [raft_id=411767325]
[region_id=411767323]
[2021/06/15 16:39:22.190 +08:00] [INFO] [peer.rs:191] [“create peer”] [peer_id=414801547] [region_id=414801545]
[2021/06/15 16:39:22.191 +08:00] [INFO] [raft.rs:2443] [“switched to configuration”] [config=“Configuration { voters: Configuration
{ incoming: Configuration { voters: {414801548, 414801546, 414801547} }, outgoing: Configuration { voters: {} } }, learners: {}, lea
rners_next: {}, auto_leave: false }”] [raft_id=414801547] [region_id=414801545]
[2021/06/15 16:39:22.191 +08:00] [INFO] [raft.rs:1064] [“became follower at term 94”] [term=94] [raft_id=414801547] [region_id=41480
1545]

yilong · 2021 年6 月 15 日 10:22

先检查下你的空间够不够。如果是测试环境不重要，可以设置 reserve-space 为 0。或者就扩容大磁盘
https://docs.pingcap.com/zh/tidb/stable/tikv-configuration-file#reserve-space

dba_wxm · 2021 年6 月 16 日 01:08

空间是够的，reserve-space 设置为0 了的（上面贴的日志里也可以看到），无效
[root@localhost ~]# df -lh
Filesystem Size Used Avail Use% Mounted on
devtmpfs 3.9G 0 3.9G 0% /dev
tmpfs 3.9G 0 3.9G 0% /dev/shm
tmpfs 3.9G 305M 3.6G 8% /run
tmpfs 3.9G 0 3.9G 0% /sys/fs/cgroup
/dev/mapper/centos-root 37G 2.0G 36G 6% /
/dev/sda1 1014M 193M 822M 19% /boot
/dev/mapper/data-data 30G 7.0G 24G 24% /data
tmpfs 783M 0 783M 0% /run/user/0

yilong · 2021 年6 月 16 日 02:06

tiup cluster display <集群名称> 反馈下
如果是在 /data 目录感觉是空间不足了，先看看其他两个节点有没有可以清理的？如果所有的都在 /data 目录下，目前一个 tikv 应该是占用 8 G 吗？
反馈下，最新启动的所有日志。

dba_wxm · 2021 年6 月 16 日 04:02

[root@localhost ~]# tiup cluster display hydee-tidb
Starting component cluster: /root/.tiup/components/cluster/v1.5.1/tiup-cluster display hydee-tidb
Cluster type: tidb
Cluster name: hydee-tidb
Cluster version: v5.0.2
Deploy user: tidb
SSH type: builtin
Dashboard URL: http://192.168.5.44:2379/dashboard
ID Role Host Ports OS/Arch Status Data Dir Deploy Dir

192.168.5.46:9093 alertmanager 192.168.5.46 9093/9094 linux/x86_64 Up /data/tidb-data/alertmanager-9093 /data/tidb-deploy/alertmanager-9093
192.168.5.43:8300 cdc 192.168.5.43 8300 linux/x86_64 Up /data/tidb-data/cdc-8300 /data/tidb-deploy/cdc-8300
192.168.5.44:8300 cdc 192.168.5.44 8300 linux/x86_64 Up /data/tidb-data/cdc-8300 /data/tidb-deploy/cdc-8300
192.168.5.46:3000 grafana 192.168.5.46 3000 linux/x86_64 Up - /data/tidb-deploy/grafana-3000
192.168.5.43:2379 pd 192.168.5.43 2379/2380 linux/x86_64 Up|L /data/tidb-data/pd-2379 /data/tidb-deploy/pd-2379
192.168.5.44:2379 pd 192.168.5.44 2379/2380 linux/x86_64 Up|UI /data/tidb-data/pd-2379 /data/tidb-deploy/pd-2379
192.168.5.46:9090 prometheus 192.168.5.46 9090 linux/x86_64 Up /data/tidb-data/prometheus-9090 /data/tidb-deploy/prometheus-9090
192.168.5.43:4000 tidb 192.168.5.43 4000/10080 linux/x86_64 Up - /data/tidb-deploy/tidb-4000
192.168.5.46:4000 tidb 192.168.5.46 4000/10080 linux/x86_64 Up - /data/tidb-deploy/tidb-4000
192.168.5.44:20160 tikv 192.168.5.44 20160/20180 linux/x86_64 Up /data/tidb-data/tikv-20160 /data/tidb-deploy/tikv-20160
192.168.5.45:20160 tikv 192.168.5.45 20160/20180 linux/x86_64 Up /data/tidb-data/tikv-20160 /data/tidb-deploy/tikv-20160
192.168.5.49:20160 tikv 192.168.5.49 20160/20180 linux/x86_64 Down /data/tidb-data/tikv-20160 /data/tidb-deploy/tikv-20160

其他2个节点的磁盘空间更大一些，没问题
节点1 ：
[root@localhost ~]# df -lh
Filesystem Size Used Avail Use% Mounted on
devtmpfs 3.9G 0 3.9G 0% /dev
tmpfs 3.9G 0 3.9G 0% /dev/shm
tmpfs 3.9G 393M 3.5G 11% /run
tmpfs 3.9G 0 3.9G 0% /sys/fs/cgroup
/dev/mapper/centos-root 37G 2.0G 36G 6% /
/dev/sda1 1014M 193M 822M 19% /boot
/dev/mapper/data-data 80G 15G 66G 18% /data
tmpfs 783M 0 783M 0% /run/user/0

节点2：
[root@localhost ~]# df -lh
Filesystem Size Used Avail Use% Mounted on
devtmpfs 3.9G 0 3.9G 0% /dev
tmpfs 3.9G 0 3.9G 0% /dev/shm
tmpfs 3.9G 409M 3.5G 11% /run
tmpfs 3.9G 0 3.9G 0% /sys/fs/cgroup
/dev/mapper/centos-root 37G 4.5G 33G 13% /
/dev/sda1 1014M 193M 822M 19% /boot
/dev/mapper/data-data 80G 11G 70G 14% /data
tmpfs 783M 0 783M 0% /run/user/0

tikv_20210616.log (944.0 KB) tikv_stderr.log (817.2 KB)

yilong · 2021 年6 月 16 日 09:47

麻烦反馈下 /var/log/message 日志
请手工在目录下创建文件试试是否可以
也看下 df -i 的信息

dba_wxm · 2021 年6 月 17 日 01:15

1 在/var/log/message 提示是内存分配问题吗？
2 手工在目录下可以建文件
3 df-i 如下

Jun 17 09:09:41 localhost systemd: tikv-20160.service holdoff time over, scheduling restart.
Jun 17 09:09:41 localhost systemd: Stopped tikv service.
Jun 17 09:09:41 localhost systemd: Started tikv service.
Jun 17 09:09:41 localhost run_tikv.sh: sync …
Jun 17 09:09:41 localhost run_tikv.sh: real#0110m0.002s
Jun 17 09:09:41 localhost run_tikv.sh: user#0110m0.001s
Jun 17 09:09:41 localhost run_tikv.sh: sys#0110m0.000s
Jun 17 09:09:41 localhost run_tikv.sh: ok
Jun 17 09:09:43 localhost kernel: raftstore-12304 invoked oom-killer: gfp_mask=0xd0, order=0, oom_score_adj=0
Jun 17 09:09:43 localhost kernel: raftstore-12304 cpuset=/ mems_allowed=0
Jun 17 09:09:43 localhost kernel: CPU: 2 PID: 25567 Comm: raftstore-12304 Kdump: loaded Tainted: G ------------ T 3.10.0-1127.13.1.el7.x86_64 #1
Jun 17 09:09:43 localhost kernel: Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 09/30/2014
Jun 17 09:09:43 localhost kernel: Call Trace:
Jun 17 09:09:43 localhost kernel: [] dump_stack+0x19/0x1b
Jun 17 09:09:43 localhost kernel: [] dump_header+0x90/0x229
Jun 17 09:09:43 localhost kernel: [] ? mem_cgroup_reclaim+0x4e/0x120
Jun 17 09:09:43 localhost kernel: [] oom_kill_process+0x25e/0x3f0
Jun 17 09:09:43 localhost kernel: [] ? cpuset_mems_allowed_intersects+0x21/0x30
Jun 17 09:09:43 localhost kernel: [] mem_cgroup_oom_synchronize+0x546/0x570
Jun 17 09:09:43 localhost kernel: [] ? mem_cgroup_charge_common+0xc0/0xc0
Jun 17 09:09:43 localhost kernel: [] pagefault_out_of_memory+0x14/0x90
Jun 17 09:09:43 localhost kernel: [] mm_fault_error+0x6a/0x157
Jun 17 09:09:43 localhost kernel: [] __do_page_fault+0x491/0x500
Jun 17 09:09:43 localhost kernel: [] do_page_fault+0x35/0x90
Jun 17 09:09:43 localhost kernel: [] page_fault+0x28/0x30
Jun 17 09:09:43 localhost kernel: Task in /system.slice/tikv-20160.service killed as a result of limit of /system.slice/tikv-20160.service
Jun 17 09:09:43 localhost kernel: memory: usage 2097152kB, limit 2097152kB, failcnt 34415
Jun 17 09:09:43 localhost kernel: memory+swap: usage 2097152kB, limit 9007199254740988kB, failcnt 0
Jun 17 09:09:43 localhost kernel: kmem: usage 0kB, limit 9007199254740988kB, failcnt 0
Jun 17 09:09:43 localhost kernel: Memory cgroup stats for /system.slice/tikv-20160.service: cache:4KB rss:2097148KB rss_huge:194560KB mapped_file:0KB swap:0KB inactive_anon:0KB active_anon:2097108KB inactive_file:4KB active_file:0KB unevictable:0KB
Jun 17 09:09:43 localhost kernel: [ pid ] uid tgid total_vm rss nr_ptes swapents oom_score_adj name
Jun 17 09:09:43 localhost kernel: [25482] 1000 25482 798110 526966 1213 0 0 tikv-server
Jun 17 09:09:43 localhost kernel: Memory cgroup out of memory: Kill process 25593 (status-server) score 1007 or sacrifice child
Jun 17 09:09:43 localhost kernel: Killed process 25482 (tikv-server), UID 1000, total-vm:3192440kB, anon-rss:2094720kB, file-rss:13144kB, shmem-rss:0kB
Jun 17 09:09:43 localhost systemd: tikv-20160.service: main process exited, code=killed, status=9/KILL
Jun 17 09:09:43 localhost systemd: Unit tikv-20160.service entered failed state.
Jun 17 09:09:43 localhost systemd: tikv-20160.service failed.

[root@localhost log]# df -i
Filesystem Inodes IUsed IFree IUse% Mounted on
devtmpfs 998197 368 997829 1% /dev
tmpfs 1001132 1 1001131 1% /dev/shm
tmpfs 1001132 556 1000576 1% /run
tmpfs 1001132 16 1001116 1% /sys/fs/cgroup
/dev/mapper/centos-root 19394560 40797 19353763 1% /
/dev/sda1 524288 334 523954 1% /boot
/dev/mapper/data-data 15726592 532 15726060 1% /data
tmpfs 1001132 1 1001131 1% /run/user/0

yilong · 2021 年6 月 17 日 02:06

麻烦找一下两个时间能够对应上的， tikv.log 和 message 日志，这个 message 日志是 17 号的。
是 vmware的虚拟机吗？如果可以的话，重新分配一个和其他两个磁盘容量差不多的新机器，扩容后，缩容这个有问题的机器吧。

dba_wxm · 2021 年6 月 17 日 02:51

1 你说的这2个日志一直在刷， 16号的也有，见下；
2 是阿里云的机器，你说的这个动作已经做过一次了，这次报错的，就是新扩容的机器；原来有一台46的机器，报这个错，49 是扩容的新机器， 46 缩容掉了，现在49 又报这个错。

Jun 16 00:01:01 localhost systemd: Started Session 3245 of user root.
Jun 16 00:01:03 localhost systemd: tikv-20160.service holdoff time over, scheduling restart.
Jun 16 00:01:03 localhost run_tikv.sh: sync …
Jun 16 00:01:03 localhost run_tikv.sh: real#0110m0.002s
Jun 16 00:01:03 localhost run_tikv.sh: user#0110m0.001s
Jun 16 00:01:03 localhost run_tikv.sh: sys#0110m0.000s
Jun 16 00:01:03 localhost run_tikv.sh: ok
Jun 16 00:01:04 localhost kernel: raftstore-12304 invoked oom-killer: gfp_mask=0xd0, order=0, oom_score_adj=0
Jun 16 00:01:04 localhost kernel: raftstore-12304 cpuset=/ mems_allowed=0
Jun 16 00:01:04 localhost kernel: Call Trace:
Jun 16 00:01:04 localhost kernel: [] dump_stack+0x19/0x1b
Jun 16 00:01:04 localhost kernel: [] mem_cgroup_oom_synchronize+0x546/0x570
Jun 16 00:01:04 localhost kernel: [] ? mem_cgroup_charge_common+0xc0/0xc0
Jun 16 00:01:04 localhost kernel: [] pagefault_out_of_memory+0x14/0x90
Jun 16 00:01:04 localhost kernel: [] mm_fault_error+0x6a/0x157
Jun 16 00:01:04 localhost kernel: [] __do_page_fault+0x491/0x500
Jun 16 00:01:04 localhost kernel: [] do_page_fault+0x35/0x90
Jun 16 00:01:04 localhost kernel: [] page_fault+0x28/0x30
Jun 16 00:01:04 localhost kernel: [] ? cpuset_mems_allowed_intersects+0x21/0x30
Jun 16 00:01:04 localhost kernel: [] mem_cgroup_oom_synchronize+0x546/0x570
Jun 16 00:01:04 localhost kernel: [] ? mem_cgroup_charge_common+0xc0/0xc0
Jun 16 00:01:04 localhost kernel: [] pagefault_out_of_memory+0x14/0x90
Jun 16 00:01:04 localhost kernel: [] mm_fault_error+0x6a/0x157
Jun 16 00:01:04 localhost kernel: [] __do_page_fault+0x491/0x500
Jun 16 00:01:04 localhost kernel: [] do_page_fault+0x35/0x90
Jun 16 00:01:04 localhost kernel: [] page_fault+0x28/0x30
Jun 16 00:01:04 localhost kernel: Task in /system.slice/tikv-20160.service killed as a result of limit of /system.slice/tikv-20160.s
ervice
Jun 16 00:01:04 localhost kernel: memory: usage 2097152kB, limit 2097152kB, failcnt 35932
Jun 16 00:01:04 localhost kernel: memory+swap: usage 2097152kB, limit 9007199254740988kB, failcnt 0
Jun 16 00:01:04 localhost kernel: kmem: usage 0kB, limit 9007199254740988kB, failcnt 0
Jun 16 00:01:04 localhost kernel: Memory cgroup stats for /system.slice/tikv-20160.service: cache:12KB rss:2097056KB rss_huge:239616
KB mapped_file:0KB swap:0KB inactive_anon:0KB active_anon:2096936KB inactive_file:8KB active_file:4KB unevictable:0KB
Jun 16 00:01:04 localhost kernel: [ pid ] uid tgid total_vm rss nr_ptes swapents oom_score_adj name
Jun 16 00:01:04 localhost kernel: [ 1611] 1000 1611 802205 527041 1211 0 0 tikv-server
Jun 16 00:01:04 localhost kernel: Memory cgroup out of memory: Kill process 1722 (status-server) score 1007 or sacrifice child
Jun 16 00:01:04 localhost kernel: Killed process 1611 (tikv-server), UID 1000, total-vm:3208820kB, anon-rss:2095096kB, file-rss:1306
8kB, shmem-rss:0kB
Jun 16 00:01:05 localhost systemd: tikv-20160.service: main process exited, code=killed, status=9/KILL
Jun 16 00:01:05 localhost systemd: Unit tikv-20160.service entered failed state.
Jun 16 00:01:05 localhost systemd: tikv-20160.service failed.
Jun 16 00:01:20 localhost systemd: tikv-20160.service holdoff time over, scheduling restart.
Jun 16 00:01:20 localhost systemd: Stopped tikv service.
Jun 16 00:01:20 localhost systemd: Started tikv service.
Jun 16 00:01:20 localhost run_tikv.sh: sync …
Jun 16 00:01:20 localhost run_tikv.sh: real#0110m0.002s
Jun 16 00:01:20 localhost run_tikv.sh: user#0110m0.001s
Jun 16 00:01:20 localhost run_tikv.sh: sys#0110m0.000s
Jun 16 00:01:20 localhost run_tikv.sh: ok
Jun 16 00:01:22 localhost kernel: raftstore-12304 invoked oom-killer: gfp_mask=0xd0, order=0, oom_score_adj=0
Jun 16 00:01:22 localhost kernel: raftstore-12304 cpuset=/ mems_allowed=0
Jun 16 00:01:22 localhost kernel: CPU: 2 PID: 1823 Comm: raftstore-12304 Kdump: loaded Tainted: G ------------ T 3.10.
0-1127.13.1.el7.x86_64 #1
Jun 16 00:01:22 localhost kernel: Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 09
/30/2014

yilong · 2021 年6 月 17 日 03:02

这个内存限制只有2G吗？

image1467×762 35.4 KB
检查下 /system.slice/tikv-20160.service 这个文件配置里的限制。

dba_wxm · 2021 年6 月 17 日 06:46

修改成了5G ，报错变成如下，还是起不来：

Jun 17 14:42:36 localhost systemd: tikv-20160.service holdoff time over, scheduling restart.
Jun 17 14:42:36 localhost systemd: Stopped tikv service.
Jun 17 14:42:36 localhost systemd: Started tikv service.
Jun 17 14:42:36 localhost run_tikv.sh: sync …
Jun 17 14:42:36 localhost run_tikv.sh: real#0110m0.009s
Jun 17 14:42:36 localhost run_tikv.sh: user#0110m0.000s
Jun 17 14:42:36 localhost run_tikv.sh: sys#0110m0.001s
Jun 17 14:42:36 localhost run_tikv.sh: ok
Jun 17 14:42:37 localhost systemd: tikv-20160.service: main process exited, code=killed, status=6/ABRT
Jun 17 14:42:37 localhost systemd: Unit tikv-20160.service entered failed state.
Jun 17 14:42:37 localhost systemd: tikv-20160.service failed.
Jun 17 14:42:53 localhost systemd: tikv-20160.service holdoff time over, scheduling restart.
Jun 17 14:42:53 localhost systemd: Stopped tikv service.
Jun 17 14:42:53 localhost systemd: Started tikv service.
Jun 17 14:42:53 localhost run_tikv.sh: sync …
Jun 17 14:42:53 localhost run_tikv.sh: real#0110m0.007s
Jun 17 14:42:53 localhost run_tikv.sh: user#0110m0.001s
Jun 17 14:42:53 localhost run_tikv.sh: sys#0110m0.000s
Jun 17 14:42:53 localhost run_tikv.sh: ok
Jun 17 14:42:54 localhost systemd: tikv-20160.service: main process exited, code=killed, status=6/ABRT
Jun 17 14:42:54 localhost systemd: Unit tikv-20160.service entered failed state.
Jun 17 14:42:54 localhost systemd: tikv-20160.service failed.

还有 tikv_stderr.log 这个文件里面一直报类似下面这种错，内存明明是够的，为啥分配不成功？而且是那么小的量
memory allocation of 24124 bytes failed
: Malformed conf string
: Malformed conf string
memory allocation of 72847 bytes failed
: Malformed conf string
: Malformed conf string
memory allocation of 69762 bytes failed
: Malformed conf string
: Malformed conf string
: Malformed conf string
: Malformed conf string
memory allocation of 69762 bytes failed
: Malformed conf string
: Malformed conf string
memory allocation of 1880 bytes failed
: Malformed conf string
: Malformed conf string
memory allocation of 1616 bytes failed
: Malformed conf string
: Malformed conf string
memory allocation of 752 bytes failed

yilong · 2021 年6 月 17 日 06:55

看下其他两个能起到的这个文件是如何配置的，看看配置的内存是多少？
机器配置一样吗？这个虚拟机分配了多少内存？找一个和其他两个节点一样大小的虚拟机，添加试试。
tikv 别用配置不一样的。

dba_wxm · 2021 年6 月 18 日 09:54

通过调大内存起来了，原来是8G，调大到16G内存，这个tikv节点起来了
为什么会down掉的原因尚不清楚，down掉之后，重启需要消耗较大内存，我这里是消耗了5G，再加上其他，机器本身8G内存，起不来该节点
这个问题解决了

Hacker_xUwtuKxa · 2022 年7 月 18 日 13:27

我这个是新扩容的实例，tikv分配了8C 48gb内存，1.5TB存储，也是报这个错误。是原因呢？

songxuecheng · 2022 年7 月 19 日 00:33

最好新建帖子

system · 2022 年10 月 31 日 19:17

此话题已在最后回复的 1 分钟后被自动关闭。不再允许新回复。