帮忙定位一下kv启动失败的原因

【 TiDB 使用环境】生产环境
【 TiDB 版本】v7.1.0
【复现路径】做过哪些操作出现的问题
在vm虚拟机上部署的tidb集群,今天启动集群,突然无法启动,启动集群报错,让查看kv的日志,kv没有明显的error,帮忙定位一下错误原因。

【遇到的问题:问题现象及影响】
kv无法启动,没有明显error
【资源配置】进入到 TiDB Dashboard -集群信息 (Cluster Info) -主机(Hosts) 截图此页面
【附件:截图/日志/监控】

一下是启动日志
[2023/10/31 20:04:29.670 +08:00] [INFO] [lib.rs:88] [“Welcome to TiKV”]
[2023/10/31 20:04:29.674 +08:00] [INFO] [lib.rs:93] [“Release Version: 7.1.0”]
[2023/10/31 20:04:29.674 +08:00] [INFO] [lib.rs:93] [“Edition: Community”]
[2023/10/31 20:04:29.674 +08:00] [INFO] [lib.rs:93] [“Git Commit Hash: 0c34464e386940a60f2a2ce279a4ef18c9c6c45b”]
[2023/10/31 20:04:29.674 +08:00] [INFO] [lib.rs:93] [“Git Commit Branch: heads/refs/tags/v7.1.0”]
[2023/10/31 20:04:29.674 +08:00] [INFO] [lib.rs:93] [“UTC Build Time: Unknown (env var does not exist when building)”]
[2023/10/31 20:04:29.674 +08:00] [INFO] [lib.rs:93] [“Rust Version: rustc 1.67.0-nightly (96ddd32c4 2022-11-14)”]
[2023/10/31 20:04:29.674 +08:00] [INFO] [lib.rs:93] [“Enable Features: pprof-fp jemalloc mem-profiling portable sse test-engine-kv-rocksdb test-engine-raft-raft-engine cloud-aws cloud-gcp cloud-azure”]
[2023/10/31 20:04:29.674 +08:00] [INFO] [lib.rs:93] [“Profile: dist_release”]
[2023/10/31 20:04:29.674 +08:00] [INFO] [mod.rs:80] [“cgroup quota: memory=Some(9223372036854771712), cpu=None, cores={8, 0, 5, 14, 9, 13, 10, 1, 6, 4, 3, 15, 11, 12, 7, 2}”]
[2023/10/31 20:04:29.674 +08:00] [INFO] [mod.rs:87] [“memory limit in bytes: 33547878400, cpu cores quota: 16”]
[2023/10/31 20:04:29.674 +08:00] [WARN] [lib.rs:544] [“environment variable TZ is missing, using /etc/localtime”]
[2023/10/31 20:04:29.674 +08:00] [WARN] [server.rs:1511] [“check: kernel”] [err=“kernel parameters net.core.somaxconn got 128, expect 32768”]
[2023/10/31 20:04:29.674 +08:00] [WARN] [server.rs:1511] [“check: kernel”] [err=“kernel parameters net.ipv4.tcp_syncookies got 1, expect 0”]
[2023/10/31 20:04:29.674 +08:00] [WARN] [server.rs:1511] [“check: kernel”] [err=“kernel parameters vm.swappiness got 30, expect 0”]
[2023/10/31 20:04:29.681 +08:00] [INFO] [util.rs:604] [“connecting to PD endpoint”] [endpoints=192.168.1.201:2379]
[2023/10/31 20:04:29.685 +08:00] [INFO] [] [“TCP_USER_TIMEOUT is available. TCP_USER_TIMEOUT will be used thereafter”]
[2023/10/31 20:04:31.686 +08:00] [INFO] [util.rs:566] [“PD failed to respond”] [err=“Grpc(RpcFailure(RpcStatus { code: 4-DEADLINE_EXCEEDED, message: "Deadline Exceeded", details: [] }))”] [endpoints=192.168.1.201:2379]
[2023/10/31 20:04:31.686 +08:00] [INFO] [util.rs:604] [“connecting to PD endpoint”] [endpoints=192.168.1.202:2379]
[2023/10/31 20:04:31.687 +08:00] [INFO] [] [“subchannel 0x7f9e5864f000 {address=ipv4:192.168.1.202:2379, args=grpc.client_channel_factory=0x7f9e586979b0, grpc.default_authority=192.168.1.202:2379, grpc.initial_reconnect_backoff_ms=1000, grpc.internal.subchannel_pool=0x7f9e586389a0, grpc.keepalive_time_ms=10000, grpc.keepalive_timeout_ms=3000, grpc.max_receive_message_length=-1, grpc.max_reconnect_backoff_ms=5000, grpc.max_send_message_length=-1, grpc.primary_user_agent=grpc-rust/0.10.4, grpc.resource_quota=0x7f9e586b5cc0, grpc.server_uri=dns:///192.168.1.202:2379}: connect failed: {"created":"@1698753871.687002792","description":"Failed to connect to remote host: Connection refused","errno":111,"file":"/rust/registry/src/github.com-1ecc6299db9ec823/grpcio-sys-0.10.3+1.44.0-patched/grpc/src/core/lib/iomgr/tcp_client_posix.cc","file_line":200,"os_error":"Connection refused","syscall":"connect","target_address":"ipv4:192.168.1.202:2379"}”]
[2023/10/31 20:04:31.687 +08:00] [INFO] [] [“subchannel 0x7f9e5864f000 {address=ipv4:192.168.1.202:2379, args=grpc.client_channel_factory=0x7f9e586979b0, grpc.default_authority=192.168.1.202:2379, grpc.initial_reconnect_backoff_ms=1000, grpc.internal.subchannel_pool=0x7f9e586389a0, grpc.keepalive_time_ms=10000, grpc.keepalive_timeout_ms=3000, grpc.max_receive_message_length=-1, grpc.max_reconnect_backoff_ms=5000, grpc.max_send_message_length=-1, grpc.primary_user_agent=grpc-rust/0.10.4, grpc.resource_quota=0x7f9e586b5cc0, grpc.server_uri=dns:///192.168.1.202:2379}: Retry in 999 milliseconds”]
[2023/10/31 20:04:31.687 +08:00] [INFO] [util.rs:566] [“PD failed to respond”] [err=“Grpc(RpcFailure(RpcStatus { code: 14-UNAVAILABLE, message: "failed to connect to all addresses", details: [] }))”] [endpoints=192.168.1.202:2379]
[2023/10/31 20:04:31.687 +08:00] [INFO] [util.rs:604] [“connecting to PD endpoint”] [endpoints=192.168.1.203:2379]
[2023/10/31 20:04:31.688 +08:00] [INFO] [] [“subchannel 0x7f9e5864f400 {address=ipv4:192.168.1.203:2379, args=grpc.client_channel_factory=0x7f9e586979b0, grpc.default_authority=192.168.1.203:2379, grpc.initial_reconnect_backoff_ms=1000, grpc.internal.subchannel_pool=0x7f9e586389a0, grpc.keepalive_time_ms=10000, grpc.keepalive_timeout_ms=3000, grpc.max_receive_message_length=-1, grpc.max_reconnect_backoff_ms=5000, grpc.max_send_message_length=-1, grpc.primary_user_agent=grpc-rust/0.10.4, grpc.resource_quota=0x7f9e586b5cc0, grpc.server_uri=dns:///192.168.1.203:2379}: connect failed: {"created":"@1698753871.688366860","description":"Failed to connect to remote host: Connection refused","errno":111,"file":"/rust/registry/src/github.com-1ecc6299db9ec823/grpcio-sys-0.10.3+1.44.0-patched/grpc/src/core/lib/iomgr/tcp_client_posix.cc","file_line":200,"os_error":"Connection refused","syscall":"connect","target_address":"ipv4:192.168.1.203:2379"}”]
[2023/10/31 20:04:31.688 +08:00] [INFO] [] [“subchannel 0x7f9e5864f400 {address=ipv4:192.168.1.203:2379, args=grpc.client_channel_factory=0x7f9e586979b0, grpc.default_authority=192.168.1.203:2379, grpc.initial_reconnect_backoff_ms=1000, grpc.internal.subchannel_pool=0x7f9e586389a0, grpc.keepalive_time_ms=10000, grpc.keepalive_timeout_ms=3000, grpc.max_receive_message_length=-1, grpc.max_reconnect_backoff_ms=5000, grpc.max_send_message_length=-1, grpc.primary_user_agent=grpc-rust/0.10.4, grpc.resource_quota=0x7f9e586b5cc0, grpc.server_uri=dns:///192.168.1.203:2379}: Retry in 999 milliseconds”]
[2023/10/31 20:04:31.688 +08:00] [INFO] [util.rs:566] [“PD failed to respond”] [err=“Grpc(RpcFailure(RpcStatus { code: 14-UNAVAILABLE, message: "failed to connect to all addresses", details: [] }))”] [endpoints=192.168.1.203:2379]
[2023/10/31 20:04:31.688 +08:00] [WARN] [client.rs:166] [“validate PD endpoints failed”] [err=“Other("[components/pd_client/src/util.rs:599]: PD cluster failed to respond")”]
[2023/10/31 20:04:31.991 +08:00] [INFO] [util.rs:604] [“connecting to PD endpoint”] [endpoints=192.168.1.201:2379]
[2023/10/31 20:04:33.993 +08:00] [INFO] [util.rs:566] [“PD failed to respond”] [err=“Grpc(RpcFailure(RpcStatus { code: 4-DEADLINE_EXCEEDED, message: "Deadline Exceeded", details: [] }))”] [endpoints=192.168.1.201:2379]
[2023/10/31 20:04:33.993 +08:00] [INFO] [util.rs:604] [“connecting to PD endpoint”] [endpoints=192.168.1.202:2379]
[2023/10/31 20:04:33.994 +08:00] [INFO] [] [“subchannel 0x7f9e58791000 {address=ipv4:192.168.1.202:2379, args=grpc.client_channel_factory=0x7f9e586979b0, grpc.default_authority=192.168.1.202:2379, grpc.initial_reconnect_backoff_ms=1000, grpc.internal.subchannel_pool=0x7f9e586389a0, grpc.keepalive_time_ms=10000, grpc.keepalive_timeout_ms=3000, grpc.max_receive_message_length=-1, grpc.max_reconnect_backoff_ms=5000, grpc.max_send_message_length=-1, grpc.primary_user_agent=grpc-rust/0.10.4, grpc.resource_quota=0x7f9e586b5cc0, grpc.server_uri=dns:///192.168.1.202:2379}: connect failed: {"created":"@1698753873.994289323","description":"Failed to connect to remote host: Connection refused","errno":111,"file":"/rust/registry/src/github.com-1ecc6299db9ec823/grpcio-sys-0.10.3+1.44.0-patched/grpc/src/core/lib/iomgr/tcp_client_posix.cc","file_line":200,"os_error":"Connection refused","syscall":"connect","target_address":"ipv4:192.168.1.202:2379"}”]
[2023/10/31 20:04:33.994 +08:00] [INFO] [] [“subchannel 0x7f9e58791000 {address=ipv4:192.168.1.202:2379, args=grpc.client_channel_factory=0x7f9e586979b0, grpc.default_authority=192.168.1.202:2379, grpc.initial_reconnect_backoff_ms=1000, grpc.internal.subchannel_pool=0x7f9e586389a0, grpc.keepalive_time_ms=10000, grpc.keepalive_timeout_ms=3000, grpc.max_receive_message_length=-1, grpc.max_reconnect_backoff_ms=5000, grpc.max_send_message_length=-1, grpc.primary_user_agent=grpc-rust/0.10.4, grpc.resource_quota=0x7f9e586b5cc0, grpc.server_uri=dns:///192.168.1.202:2379}: Retry in 998 milliseconds”]
[2023/10/31 20:04:33.994 +08:00] [INFO] [util.rs:566] [“PD failed to respond”] [err=“Grpc(RpcFailure(RpcStatus { code: 14-UNAVAILABLE, message: "failed to connect to all addresses", details: [] }))”] [endpoints=192.168.1.202:2379]
[2023/10/31 20:04:33.994 +08:00] [INFO] [util.rs:604] [“connecting to PD endpoint”] [endpoints=192.168.1.203:2379]
[2023/10/31 20:04:33.995 +08:00] [INFO] [] [“subchannel 0x7f9e58791400 {address=ipv4:192.168.1.203:2379, args=grpc.client_channel_factory=0x7f9e586979b0, grpc.default_authority=192.168.1.203:2379, grpc.initial_reconnect_backoff_ms=1000, grpc.internal.subchannel_pool=0x7f9e586389a0, grpc.keepalive_time_ms=10000, grpc.keepalive_timeout_ms=3000, grpc.max_receive_message_length=-1, grpc.max_reconnect_backoff_ms=5000, grpc.max_send_message_length=-1, grpc.primary_user_agent=grpc-rust/0.10.4, grpc.resource_quota=0x7f9e586b5cc0, grpc.server_uri=dns:///192.168.1.203:2379}: connect failed: {"created":"@1698753873.995013352","description":"Failed to connect to remote host: Connection refused","errno":111,"file":"/rust/registry/src/github.com-1ecc6299db9ec823/grpcio-sys-0.10.3+1.44.0-patched/grpc/src/core/lib/iomgr/tcp_client_posix.cc","file_line":200,"os_error":"Connection refused","syscall":"connect","target_address":"ipv4:192.168.1.203:2379"}”]
[2023/10/31 20:04:33.995 +08:00] [INFO] [] [“subchannel 0x7f9e58791400 {address=ipv4:192.168.1.203:2379, args=grpc.client_channel_factory=0x7f9e586979b0, grpc.default_authority=192.168.1.203:2379, grpc.initial_reconnect_backoff_ms=1000, grpc.internal.subchannel_pool=0x7f9e586389a0, grpc.keepalive_time_ms=10000, grpc.keepalive_timeout_ms=3000, grpc.max_receive_message_length=-1, grpc.max_reconnect_backoff_ms=5000, grpc.max_send_message_length=-1, grpc.primary_user_agent=grpc-rust/0.10.4, grpc.resource_quota=0x7f9e586b5cc0, grpc.server_uri=dns:///192.168.1.203:2379}: Retry in 1000 milliseconds”]
[2023/10/31 20:04:33.995 +08:00] [INFO] [util.rs:566] [“PD failed to respond”] [err=“Grpc(RpcFailure(RpcStatus { code: 14-UNAVAILABLE, message: "failed to connect to all addresses", details: [] }))”] [endpoints=192.168.1.203:2379]
[2023/10/31 20:04:34.296 +08:00] [INFO] [util.rs:604] [“connecting to PD endpoint”] [endpoints=192.168.1.201:2379]

201上的tikv访问不到203上的pd,看下是不是有啥网络限制
另外,只部署了2个pd吗

看报错日志是因为tikv 无法与pd建立连接。

  • 检查pd 集群是否正常,查看pd 的日志;
  • 检查tikv 与pd 之间的网络连通情况,同时检查整个集群两两节点之间的网络连通情况;

防火墙都关了,可以ping通,不知道为啥连不上,之前都是正常的,今天突然出现问题

防火墙都关了,可以ping通,不知道为啥连不上,之前都是正常的呢

ssh可以连接那个节点吗

其他的ssh,ftp,ping等都正常,就是启动失败

tiup cluster display看看集群状态,2个pd是不是没选出leader?

PD就没起来了吧,检查所有节点时间

检查一下PD和TIKV的端口是否占用呢?

集群启动只有201 203 2个PD,日志里还有脸202:2379的PD ,先看下202这个是怎么没的吧,有做啥操作?

我们进行过缩容,已经执行了缩容,查看集群状态没有这个集群,但是启动还在找这个被去掉的节点

怎么执行的缩容操作, 可以从script目录下的run_tikv.sh 爸202删掉

tiup display 显示的状态是怎么样的

查看集群状态是用tiup cluster display查看的么?可以发一下display的结果

1 个赞

ssh通吗