集群已经缩容成功,但是启动集群提示kv报错,看了kv的日志,请求了一个缩容的pd节点

【 TiDB 使用环境】生产环境
【 TiDB 版本】 v7.1.0
【复现路径】集群启动
【遇到的问题:问题现象及影响】
集群已经缩容成功,但是启动集群提示kv报错,看了kv的日志,请求了一个缩容的pd节点,查看集群状态是不存在这个节点的。



[root@localhost ~]# tiup cluster start sjtidb
tiup is checking updates for component cluster …
Starting component cluster: /root/.tiup/components/cluster/v1.13.1/tiup-cluster start sjtidb
Starting cluster sjtidb…

  • [ Serial ] - SSHKeySet: privateKey=/root/.tiup/storage/cluster/clusters/sjtidb/ssh/id_rsa, publicKey=/root/.tiup/storage/cluster/clusters/sjtidb/ssh/id_rsa.pub
  • [Parallel] - UserSSH: user=root, host=192.168.1.201
  • [Parallel] - UserSSH: user=root, host=192.168.1.201
  • [Parallel] - UserSSH: user=root, host=192.168.1.201
  • [Parallel] - UserSSH: user=root, host=192.168.1.201
  • [Parallel] - UserSSH: user=root, host=192.168.1.203
  • [Parallel] - UserSSH: user=root, host=192.168.1.201
  • [Parallel] - UserSSH: user=root, host=192.168.1.203
  • [ Serial ] - StartCluster
    Starting component pd
    Starting instance 192.168.1.201:2379
    Starting instance 192.168.1.203:2379
    Start instance 192.168.1.201:2379 success
    Start instance 192.168.1.203:2379 success
    Starting component tikv
    Starting instance 192.168.1.203:20160
    Starting instance 192.168.1.201:20160

Error: failed to start tikv: failed to start: 192.168.1.201 tikv-20160.service, please check the instance’s log(/tidb-deploy/tikv-20160/log) for more detail.: timed out waiting for port 20160 to be started after 2m0s

Verbose debug logs has been written to /root/.tiup/logs/tiup-cluster-debug-2023-11-01-11-28-54.log.


[2023/11/01 11:26:53.044 +08:00] [INFO] [lib.rs:88] [“Welcome to TiKV”]
[2023/11/01 11:26:53.045 +08:00] [INFO] [lib.rs:93] [“Release Version: 7.1.0”]
[2023/11/01 11:26:53.045 +08:00] [INFO] [lib.rs:93] [“Edition: Community”]
[2023/11/01 11:26:53.045 +08:00] [INFO] [lib.rs:93] [“Git Commit Hash: 0c34464e386940a60f2a2ce279a4ef18c9c6c45b”]
[2023/11/01 11:26:53.045 +08:00] [INFO] [lib.rs:93] [“Git Commit Branch: heads/refs/tags/v7.1.0”]
[2023/11/01 11:26:53.045 +08:00] [INFO] [lib.rs:93] [“UTC Build Time: Unknown (env var does not exist when building)”]
[2023/11/01 11:26:53.045 +08:00] [INFO] [lib.rs:93] [“Rust Version: rustc 1.67.0-nightly (96ddd32c4 2022-11-14)”]
[2023/11/01 11:26:53.045 +08:00] [INFO] [lib.rs:93] [“Enable Features: pprof-fp jemalloc mem-profiling portable sse test-engine-kv-rocksdb test-engine-raft-raft-engine cloud-aws cloud-gcp cloud-azure”]
[2023/11/01 11:26:53.046 +08:00] [INFO] [lib.rs:93] [“Profile: dist_release”]
[2023/11/01 11:26:53.046 +08:00] [INFO] [mod.rs:80] [“cgroup quota: memory=Some(9223372036854771712), cpu=None, cores={13, 14, 5, 3, 15, 9, 2, 0, 4, 12, 1, 11, 7, 6, 8, 10}”]
[2023/11/01 11:26:53.046 +08:00] [INFO] [mod.rs:87] [“memory limit in bytes: 33547878400, cpu cores quota: 16”]
[2023/11/01 11:26:53.046 +08:00] [WARN] [lib.rs:544] [“environment variable TZ is missing, using /etc/localtime”]
[2023/11/01 11:26:53.046 +08:00] [WARN] [server.rs:1511] [“check: kernel”] [err=“kernel parameters net.core.somaxconn got 128, expect 32768”]
[2023/11/01 11:26:53.046 +08:00] [WARN] [server.rs:1511] [“check: kernel”] [err=“kernel parameters net.ipv4.tcp_syncookies got 1, expect 0”]
[2023/11/01 11:26:53.046 +08:00] [WARN] [server.rs:1511] [“check: kernel”] [err=“kernel parameters vm.swappiness got 30, expect 0”]
[2023/11/01 11:26:53.057 +08:00] [INFO] [util.rs:604] [“connecting to PD endpoint”] [endpoints=192.168.1.201:2379]
[2023/11/01 11:26:53.060 +08:00] [INFO] [] [“TCP_USER_TIMEOUT is available. TCP_USER_TIMEOUT will be used thereafter”]
[2023/11/01 11:26:55.061 +08:00] [INFO] [util.rs:566] [“PD failed to respond”] [err=“Grpc(RpcFailure(RpcStatus { code: 4-DEADLINE_EXCEEDED, message: "Deadline Exceeded", details: [] }))”] [endpoints=192.168.1.201:2379]
[2023/11/01 11:26:55.061 +08:00] [INFO] [util.rs:604] [“connecting to PD endpoint”] [endpoints=192.168.1.202:2379]
[2023/11/01 11:26:55.062 +08:00] [INFO] [] [“subchannel 0x7f39f364f000 {address=ipv4:192.168.1.202:2379, args=grpc.client_channel_factory=0x7f39f36979b0, grpc.default_authority=192.168.1.202:2379, grpc.initial_reconnect_backoff_ms=1000, grpc.internal.subchannel_pool=0x7f39f36389a0, grpc.keepalive_time_ms=10000, grpc.keepalive_timeout_ms=3000, grpc.max_receive_message_length=-1, grpc.max_reconnect_backoff_ms=5000, grpc.max_send_message_length=-1, grpc.primary_user_agent=grpc-rust/0.10.4, grpc.resource_quota=0x7f39f36b5cc0, grpc.server_uri=dns:///192.168.1.202:2379}: connect failed: {"created":"@1698809215.062233286","description":"Failed to connect to remote host: Connection refused","errno":111,"file":"/rust/registry/src/github.com-1ecc6299db9ec823/grpcio-sys-0.10.3+1.44.0-patched/grpc/src/core/lib/iomgr/tcp_client_posix.cc","file_line":200,"os_error":"Connection refused","syscall":"connect","target_address":"ipv4:192.168.1.202:2379"}”]
[2023/11/01 11:26:55.062 +08:00] [INFO] [] [“subchannel 0x7f39f364f000 {address=ipv4:192.168.1.202:2379, args=grpc.client_channel_factory=0x7f39f36979b0, grpc.default_authority=192.168.1.202:2379, grpc.initial_reconnect_backoff_ms=1000, grpc.internal.subchannel_pool=0x7f39f36389a0, grpc.keepalive_time_ms=10000, grpc.keepalive_timeout_ms=3000, grpc.max_receive_message_length=-1, grpc.max_reconnect_backoff_ms=5000, grpc.max_send_message_length=-1, grpc.primary_user_agent=grpc-rust/0.10.4, grpc.resource_quota=0x7f39f36b5cc0, grpc.server_uri=dns:///192.168.1.202:2379}: Retry in 999 milliseconds”]
[2023/11/01 11:26:55.062 +08:00] [INFO] [util.rs:566] [“PD failed to respond”] [err=“Grpc(RpcFailure(RpcStatus { code: 14-UNAVAILABLE, message: "failed to connect to all addresses", details: [] }))”] [endpoints=192.168.1.202:2379]
[2023/11/01 11:26:55.062 +08:00] [INFO] [util.rs:604] [“connecting to PD endpoint”] [endpoints=192.168.1.203:2379]
[2023/11/01 11:26:55.063 +08:00] [INFO] [] [“subchannel 0x7f39f364f400 {address=ipv4:192.168.1.203:2379, args=grpc.client_channel_factory=0x7f39f36979b0, grpc.default_authority=192.168.1.203:2379, grpc.initial_reconnect_backoff_ms=1000, grpc.internal.subchannel_pool=0x7f39f36389a0, grpc.keepalive_time_ms=10000, grpc.keepalive_timeout_ms=3000, grpc.max_receive_message_length=-1, grpc.max_reconnect_backoff_ms=5000, grpc.max_send_message_length=-1, grpc.primary_user_agent=grpc-rust/0.10.4, grpc.resource_quota=0x7f39f36b5cc0, grpc.server_uri=dns:///192.168.1.203:2379}: connect failed: {"created":"@1698809215.063200446","description":"Failed to connect to remote host: Connection refused","errno":111,"file":"/rust/registry/src/github.com-1ecc6299db9ec823/grpcio-sys-0.10.3+1.44.0-patched/grpc/src/core/lib/iomgr/tcp_client_posix.cc","file_line":200,"os_error":"Connection refused","syscall":"connect","target_address":"ipv4:192.168.1.203:2379"}”]
[2023/11/01 11:26:55.063 +08:00] [INFO] [] [“subchannel 0x7f39f364f400 {address=ipv4:192.168.1.203:2379, args=grpc.client_channel_factory=0x7f39f36979b0, grpc.default_authority=192.168.1.203:2379, grpc.initial_reconnect_backoff_ms=1000, grpc.internal.subchannel_pool=0x7f39f36389a0, grpc.keepalive_time_ms=10000, grpc.keepalive_timeout_ms=3000, grpc.max_receive_message_length=-1, grpc.max_reconnect_backoff_ms=5000, grpc.max_send_message_length=-1, grpc.primary_user_agent=grpc-rust/0.10.4, grpc.resource_quota=0x7f39f36b5cc0, grpc.server_uri=dns:///192.168.1.203:2379}: Retry in 1000 milliseconds”]

【资源配置】进入到 TiDB Dashboard -集群信息 (Cluster Info) -主机(Hosts) 截图此页面
【附件:截图/日志/监控】


tidb启动先启动pd,先看pd日志

如何单独启动,集群不是要一起启动的

你这缩容怎么缩的,把pd和tikv都缩容成2个了?怎么整个集群还停了?

缩容要一个一个来,不能批量的

我很好奇你这kv怎么缩成2个的?正常是不会成功的

当然不是一起,有先后顺序的

启动顺序是PD-TIKV-TIDB,
单独启动节点是tiup cluster start <集群名> -R <组件>

可以按类型和端口号单独启动任何一个节点
tiup cluster start [flags]

–init

以安全方式启动集群。推荐在集群第一次启动时使用,该方式会在启动时自动生成 TiDB root 用户的密码,并在命令行界面返回密码。

-N, --node(strings,默认为 [],表示所有节点)

指定要启动的节点,不指定则表示所有节点。该选项的值为以逗号分割的节点 ID 列表,节点 ID 为集群状态表格的第一列。

-R, --role(strings,默认为 [],表示所有角色)

指定要启动的角色,不指定则表示所有角色。该选项的值为以逗号分割的节点角色列表,角色为集群状态表格的第二列。
https://docs.pingcap.com/zh/tidb/stable/tiup-component-cluster-start