[已解决]三个tidb服务,两个无法启动,是否可以升级无法启动的tidb服务?

为提高效率,请提供以下信息,问题描述清晰能够更快得到解决:
【 TiDB 使用环境】
【概述】场景+问题概述
生产环境三个tidb服务,两个无法启动,日志中没有错误,目前只有一个服务承担服务
三个PD,有一个down了
是否可以升级无法启动的tidb、pd服务?
tiup cluster patch test-cluster /tmp/tidb-hotfix.tar.gz -N 172.16.4.5:4000
若可以单独升级,建议升级到哪个版本?
【背景】做过哪些操作
【现象】业务和数据库现象
【业务影响】
【TiDB 版本】
V4.0.0
【附件】

  1. TiUP Cluster Display 信息[tidb@crm30 ~]$ tiup cluster display test-cluster
    Found cluster newer version:

    The latest version: v1.5.6
    Local installed version: v1.3.6
    Update current component: tiup update cluster
    Update all components: tiup update --all

Starting component cluster: /home/tidb/.tiup/components/cluster/v1.3.6/tiup-cluster display test-cluster
Cluster type: tidb
Cluster name: test-cluster
Cluster version: v4.0.0
SSH type: builtin
Dashboard URL: http://192.168.0.244:2379/dashboard
ID Role Host Ports OS/Arch Status Data Dir Deploy Dir


192.168.0.244:9093 alertmanager 192.168.0.244 9093/9094 linux/x86_64 Up /home/tidb/deploy/data.alertmanager /home/tidb/deploy
192.168.0.244:3000 grafana 192.168.0.244 3000 linux/x86_64 Up - /home/tidb/deploy
192.168.0.244:2379 pd 192.168.0.244 2379/2380 linux/x86_64 Up|L|UI /home/tidb/deploy/data.pd /home/tidb/deploy
192.168.0.247:2379 pd 192.168.0.247 2379/2380 linux/x86_64 Up /home/tidb/deploy/data.pd /home/tidb/deploy
192.168.0.248:2379 pd 192.168.0.248 2379/2380 linux/x86_64 Down /home/tidb/deploy/data.pd /home/tidb/deploy
192.168.0.244:9090 prometheus 192.168.0.244 9090 linux/x86_64 Up /home/tidb/deploy/prometheus2.0.0.data.metrics /home/tidb/deploy
192.168.0.247:4000 tidb 192.168.0.247 4000/10080 linux/x86_64 Down - /home/tidb/deploy
192.168.0.248:4000 tidb 192.168.0.248 4000/10080 linux/x86_64 Up - /home/tidb/deploy
192.168.0.244:20160 tikv 192.168.0.244 20160/20180 linux/x86_64 Up /home/tidb/deploy/data /home/tidb/deploy
192.168.0.246:20162 tikv 192.168.0.246 20162/20182 linux/x86_64 Up /home/tidb/deploy3/data /home/tidb/deploy3
192.168.0.247:20160 tikv 192.168.0.247 20160/20180 linux/x86_64 Up /home/tidb/deploy/data /home/tidb/deploy
Total nodes: 11

  1. TiUP Cluster Edit Config 信息

  2. TiDB- Overview 监控

  • 对应模块日志(包含问题前后1小时日志)

麻烦展示一下,没有起来节点的日志信息看看,辅助分析。

#vi tidb_stderr.log
{“level”:“warn”,“ts”:“2021-09-17T09:37:04.818+0800”,“caller”:“clientv3/retry_interceptor.go:61”,“msg”:“retrying of unary invoker failed”,“target”:“endpoint://client-053e66f7-5fbc-4187-9319-2d2c93193206/192.168.0.244:2379”,“attempt”:99,“error”:“rpc error: code = Canceled desc = grpc: the client connection is closing”}
[tidb@crm247 log]$ vi tidb.log

[2021/09/17 09:40:29.671 +08:00] [FATAL] [terror.go:348] [“unexpected error”] [error="[privilege:8049]mysql.db"] [stack=“github.com/pingcap/log.Fatal\ \t/home/jenkins/agent/workspace/tidb_v4.0.0/go/pkg/mod/github.com/pingcap/log@v0.0.0-20200511115504-543df19646ad/global.go:59\ github.com/pingcap/parser/terror.MustNil\ \t/home/jenkins/agent/workspace/tidb_v4.0.0/go/pkg/mod/github.com/pingcap/parser@v0.0.0-20200525110646-f45c2cee1dca/terror/terror.go:348\ main.createStoreAndDomain\ \t/home/jenkins/agent/workspace/tidb_v4.0.0/go/src/github.com/pingcap/tidb/tidb-server/main.go:296\ main.main\ \t/home/jenkins/agent/workspace/tidb_v4.0.0/go/src/github.com/pingcap/tidb/tidb-server/main.go:181\ runtime.main\ \t/usr/local/go/src/runtime/proc.go:203”]
[2021/09/17 09:40:44.768 +08:00] [INFO] [printer.go:42] [“Welcome to TiDB.”] [“Release Version”=v4.0.0] [Edition=Community] [“Git Commit Hash”=689a6b6439ae7835947fcaccf329a3fc303986cb] [“Git Branch”=heads/refs/tags/v4.0.0] [“UTC Build Time”=“2020-05-28 01:37:40”] [GoVersion=go1.13] [“Race Enabled”=false] [“Check Table Before Drop”=false] [“TiKV Min Version”=v3.0.0-60965b006877ca7234adaced7890d7b029ed1306]

tiup cluster patch 是否可以升级下线的服务?目前只有一个tidb服务了,情况危急

不建议单独对某个节点升级。tidb 是无状态的,如果急着恢复可以直接重新扩缩容,pd 节点的日志发一下 。

#vi pd.log
[2021/09/16 09:10:19.422 +08:00] [ERROR] [grpclog.go:75] [“transport: Got too many pings from the client, closing the connection.”]
[2021/09/16 09:10:19.422 +08:00] [ERROR] [grpclog.go:75] [“transport: loopyWriter.run returning. Err: transport: Connection closing”]
[2021/09/16 09:34:17.082 +08:00] [ERROR] [grpclog.go:75] [“transport: Got too many pings from the client, closing the connection.”]
[2021/09/16 09:34:17.082 +08:00] [ERROR] [grpclog.go:75] [“transport: loopyWriter.run returning. Err: transport: Connection closing”]
[2021/09/16 09:51:48.672 +08:00] [ERROR] [grpclog.go:75] [“transport: Got too many pings from the client, closing the connection.”]
[2021/09/16 09:51:48.672 +08:00] [ERROR] [grpclog.go:75] [“transport: loopyWriter.run returning. Err: transport: Connection closing”]
[2021/09/16 09:56:07.005 +08:00] [ERROR] [server.go:242] [“region syncer send data meet error”] [error=“rpc error: code = Unavailable desc = transport is closing”]
[2021/09/16 17:49:12.271 +08:00] [ERROR] [grpclog.go:75] [“transport: Got too many pings from the client, closing the connection.”]
[2021/09/16 17:49:12.271 +08:00] [ERROR] [grpclog.go:75] [“transport: loopyWriter.run returning. Err: transport: Connection closing”]

原因应该是找到了,集群中一台kv用的是20162端口,但防火墙没开导致的问题

此话题已在最后回复的 1 分钟后被自动关闭。不再允许新回复。