tikv部署报错

【 TiDB 使用环境】生产环境
【 TiDB 版本】v6.5.0
【复现路径】在联通云服务器上部署k8s,然后安装tidb-cluster.yaml出现报错如下
【资源配置】

kubectl get nodes -o wide
NAME       STATUS   ROLES                  AGE     VERSION   INTERNAL-IP    EXTERNAL-IP   OS-IMAGE                KERNEL-VERSION                CONTAINER-RUNTIME
master01   Ready    control-plane,master   8d      v1.23.4   10.128.0.24    <none>        CentOS Linux 7 (Core)   3.10.0-1160.88.1.el7.x86_64   docker://20.10.21
master02   Ready    control-plane,master   8d      v1.23.4   10.128.0.25    <none>        CentOS Linux 7 (Core)   3.10.0-1160.88.1.el7.x86_64   docker://20.10.21
node03     Ready    <none>                 13h     v1.23.4   10.128.0.202   <none>        CentOS Linux 7 (Core)   5.4.238-1.el7.elrepo.x86_64   docker://20.10.21
node04     Ready    <none>                 6d20h   v1.23.4   10.128.0.203   <none>        CentOS Linux 7 (Core)   5.4.238-1.el7.elrepo.x86_64   docker://20.10.21
node05     Ready    <none>                 6d20h   v1.23.4   10.128.0.205   <none>        CentOS Linux 7 (Core)   3.10.0-1160.88.1.el7.x86_64   docker://20.10.21
node06     Ready    <none>                 6d20h   v1.23.4   10.128.0.206   <none>        CentOS Linux 7 (Core)   3.10.0-1160.88.1.el7.x86_64   docker://20.10.21
ulimit -a
core file size          (blocks, -c) 0
data seg size           (kbytes, -d) unlimited
scheduling priority             (-e) 0
file size               (blocks, -f) unlimited
pending signals                 (-i) 256943
max locked memory       (kbytes, -l) 64
max memory size         (kbytes, -m) unlimited
open files                      (-n) 655350
pipe size            (512 bytes, -p) 8
POSIX message queues     (bytes, -q) 819200
real-time priority              (-r) 0
stack size              (kbytes, -s) 8192
cpu time               (seconds, -t) unlimited
max user processes              (-u) 655350
virtual memory          (kbytes, -v) unlimited
file locks                      (-x) unlimited
cat /etc/sysctl.d/kubernetes.conf 
net.bridge.bridge-nf-call-iptables=1
net.bridge.bridge-nf-call-ip6tables=1
net.ipv4.ip_forward = 1
net.ipv4.conf.default.rp_filter = 1
net.ipv4.conf.default.accept_source_route = 0
net.ipv4.conf.all.secure_redirects = 0
kernel.sysrq = 0
kernel.core_uses_pid = 1
net.ipv4.tcp_syncookies = 1
kernel.msgmnb = 655360
kernel.msgmax = 655360
kernel.shmmax = 68719476736
kernel.shmall = 4294967296
net.ipv4.tcp_max_tw_buckets = 6000
net.ipv4.tcp_sack = 1
net.ipv4.tcp_window_scaling = 1
net.ipv4.tcp_rmem = 4096 87380 4194304
net.ipv4.tcp_wmem = 4096 16384 4194304
net.core.wmem_default = 8388608
net.core.rmem_default = 8388608
net.core.rmem_max = 16777216
net.core.wmem_max = 16777216
net.core.netdev_max_backlog = 262144
net.core.somaxconn = 262144
net.ipv4.tcp_max_orphans = 3276800
net.ipv4.tcp_max_syn_backlog = 262144
net.ipv4.tcp_timestamps = 0
net.ipv4.tcp_synack_retries = 2
net.ipv4.tcp_syn_retries = 2
net.ipv4.tcp_tw_recycle = 0
net.ipv4.tcp_tw_reuse = 1
net.ipv4.tcp_mem = 94500000 915000000 927000000
net.ipv4.tcp_fin_timeout = 1
net.ipv4.tcp_keepalive_time = 30
net.ipv4.tcp_keepalive_probes=3
net.ipv4.tcp_orphan_retries=3
net.ipv4.ip_local_port_range = 1024 65500

【附件:截图/日志/监控】

kubectl logs -f advanced-tidb-tikv-0 -n advanced-tidb
starting tikv-server ...
/tikv-server --pd=http://advanced-tidb-pd:2379 --advertise-addr=advanced-tidb-tikv-0.advanced-tidb-tikv-peer.advanced-tidb.svc:20160 --addr=0.0.0.0:20160 --status-addr=0.0.0.0:20180 --advertise-status-addr=advanced-tidb-tikv-0.advanced-tidb-tikv-peer.advanced-tidb.svc:20180 --data-dir=/var/lib/tikv --capacity=0 --config=/etc/tikv/tikv.toml

[2023/03/24 10:44:42.075 +08:00] [INFO] [lib.rs:85] ["Welcome to TiKV"]
[2023/03/24 10:44:42.076 +08:00] [INFO] [lib.rs:90] ["Release Version:   6.5.0"]
[2023/03/24 10:44:42.076 +08:00] [INFO] [lib.rs:90] ["Edition:           Community"]
[2023/03/24 10:44:42.076 +08:00] [INFO] [lib.rs:90] ["Git Commit Hash:   47b81680f75adc4b7200480cea5dbe46ae07c4b5"]
[2023/03/24 10:44:42.076 +08:00] [INFO] [lib.rs:90] ["Git Commit Branch: heads/refs/tags/v6.5.0"]
[2023/03/24 10:44:42.076 +08:00] [INFO] [lib.rs:90] ["UTC Build Time:    Unknown (env var does not exist when building)"]
[2023/03/24 10:44:42.076 +08:00] [INFO] [lib.rs:90] ["Rust Version:      rustc 1.67.0-nightly (96ddd32c4 2022-11-14)"]
[2023/03/24 10:44:42.076 +08:00] [INFO] [lib.rs:90] ["Enable Features:   pprof-fp jemalloc mem-profiling portable sse test-engine-kv-rocksdb test-engine-raft-raft-engine cloud-aws cloud-gcp cloud-azure"]
[2023/03/24 10:44:42.076 +08:00] [INFO] [lib.rs:90] ["Profile:           dist_release"]
[2023/03/24 10:44:42.076 +08:00] [INFO] [mod.rs:79] ["cgroup quota: memory=Some(9223372036854771712), cpu=None, cores={12, 2, 1, 0, 8, 3, 9, 11, 4, 15, 10, 6, 7, 14, 5, 13}"]
[2023/03/24 10:44:42.079 +08:00] [INFO] [mod.rs:86] ["memory limit in bytes: 68995201024, cpu cores quota: 16"]
[2023/03/24 10:44:42.079 +08:00] [WARN] [server.rs:1877] ["check: kernel"] [err="kernel parameters net.core.somaxconn got 128, expect 32768"]
[2023/03/24 10:44:42.079 +08:00] [WARN] [server.rs:1877] ["check: kernel"] [err="check_kernel_params failed No such file or directory (os error 2)"]
[2023/03/24 10:44:42.079 +08:00] [WARN] [server.rs:1877] ["check: kernel"] [err="kernel parameters vm.swappiness got 30, expect 0"]
[2023/03/24 10:44:42.089 +08:00] [INFO] [util.rs:598] ["connecting to PD endpoint"] [endpoints=http://advanced-tidb-pd:2379]
[2023/03/24 10:44:42.091 +08:00] [INFO] [<unknown>] ["Disabling AF_INET6 sockets because ::1 is not available."]
[2023/03/24 10:44:42.092 +08:00] [INFO] [<unknown>] ["TCP_USER_TIMEOUT is available. TCP_USER_TIMEOUT will be used thereafter"]
[2023/03/24 10:44:42.096 +08:00] [INFO] [util.rs:598] ["connecting to PD endpoint"] [endpoints=http://advanced-tidb-pd-0.advanced-tidb-pd-peer.advanced-tidb.svc:2379]
[2023/03/24 10:44:42.099 +08:00] [INFO] [util.rs:598] ["connecting to PD endpoint"] [endpoints=http://advanced-tidb-pd-1.advanced-tidb-pd-peer.advanced-tidb.svc:2379]
[2023/03/24 10:44:42.102 +08:00] [INFO] [util.rs:763] ["connected to PD member"] [endpoints=http://advanced-tidb-pd-1.advanced-tidb-pd-peer.advanced-tidb.svc:2379]
[2023/03/24 10:44:42.103 +08:00] [INFO] [util.rs:590] ["all PD endpoints are consistent"] [endpoints="[\"http://advanced-tidb-pd:2379\"]"]
[2023/03/24 10:44:42.108 +08:00] [INFO] [server.rs:461] ["connect to PD cluster"] [cluster_id=7213282108953497471]
[2023/03/24 10:44:42.110 +08:00] [INFO] [config.rs:2170] ["readpool.storage.use-unified-pool is not set, set to true by default"]
[2023/03/24 10:44:42.110 +08:00] [INFO] [config.rs:2193] ["readpool.coprocessor.use-unified-pool is not set, set to true by default"]
[2023/03/24 10:44:42.207 +08:00] [INFO] [server.rs:1885] ["beginning system configuration check"]
[2023/03/24 10:44:42.207 +08:00] [FATAL] [server.rs:1896] ["the maximum number of open file descriptors is too small, got 65536, expect greater or equal to 82920"]
cat /etc/systemd/system/docker.service.d/limit-nofile.conf
[Service]
LimitNOFILE=1048576

这个地方也修改过 :rofl:

docker.service 的参数加上这一行呢?
LimitNPROC=1048576

也改了,不行

open files (-n) 655350
太小了,要求最起码82920

655350比82920大吧 :joy:

所有的机器节点都是655350了?

ansible k8s -m shell -a "ulimit -n"
node05 | CHANGED | rc=0 >>
655350
master01 | CHANGED | rc=0 >>
655350
master02 | CHANGED | rc=0 >>
655350
node03 | CHANGED | rc=0 >>
1048576
node04 | CHANGED | rc=0 >>
1048576
node06 | CHANGED | rc=0 >>
655350

和这个问题比较类似,可以参考下 k8s 部署 tidbv5.4.1,tikv 一直处于 CrashLoopBackOff 状态 提示连接数问题,已修改问题没解决