Kubernetes中部署的tidb,共有3个pd,其中有一个pd一直失败

使用过程中,因为一次服务器断电,导致3个pd中的2个pd的leveldb的manifest文件崩溃,manifest文件修复后,还是有一个pd老是起不来,下面是一下相关的基本信息。

kubectl get po -njinfan

NAME                                                 READY   STATUS             RESTARTS   AGE
tidb-cluster-1605234515-discovery-86468cbbf8-nshvd   1/1     Running            1          4d1h
tidb-cluster-1605234515-monitor-9b8fc57b5-jvxgv      3/3     Running            3          4d1h
tidb-cluster-1605234515-pd-0                         0/1     CrashLoopBackOff   820        2d22h
tidb-cluster-1605234515-pd-1                         1/1     Running            3          4d
tidb-cluster-1605234515-pd-2                         1/1     Running            0          2d23h
tidb-cluster-1605234515-tidb-0                       2/2     Running            2          4d1h
tidb-cluster-1605234515-tidb-1                       2/2     Running            0          4d1h
tidb-cluster-1605234515-tikv-0                       1/1     Running            1          4d1h
tidb-cluster-1605234515-tikv-1                       1/1     Running            1          4d1h
tidb-cluster-1605234515-tikv-2                       1/1     Running            0          4d1h

kubectl logs -njinfan tidb-cluster-1605234515-pd-0

Name:      tidb-cluster-1605234515-pd-0.tidb-cluster-1605234515-pd-peer.jinfan.svc
Address 1: 10.244.1.226 tidb-cluster-1605234515-pd-0.tidb-cluster-1605234515-pd-peer.jinfan.svc.cluster.local
nslookup domain tidb-cluster-1605234515-pd-0.tidb-cluster-1605234515-pd-peer.jinfan.svc.svc success
test---http://tidb-cluster-1605234515-discovery.jinfan.svc:10261/new/dGlkYi1jbHVzdGVyLTE2MDUyMzQ1MTUtcGQtMC50aWRiLWNsdXN0ZXItMTYwNTIzNDUxNS1wZC1wZWVyLmppbmZhbi5zdmM6MjM4MAo=
starting pd-server ...
/pd-server --data-dir=/var/lib/pd --name=tidb-cluster-1605234515-pd-0 --peer-urls=http://0.0.0.0:2380 --advertise-peer-urls=http://tidb-cluster-1605234515-pd-0.tidb-cluster-1605234515-pd-peer.jinfan.svc:2380 --client-urls=http://0.0.0.0:2379 --advertise-client-urls=http://tidb-cluster-1605234515-pd-0.tidb-cluster-1605234515-pd-peer.jinfan.svc:2379 --config=/etc/pd/pd.toml --join=http://tidb-cluster-1605234515-pd-3.tidb-cluster-1605234515-pd-peer.jinfan.svc:2379,http://tidb-cluster-1605234515-pd-2.tidb-cluster-1605234515-pd-peer.jinfan.svc:2379
[2021/01/11 06:15:05.144 +00:00] [INFO] [util.go:42] ["Welcome to Placement Driver (PD)"]
[2021/01/11 06:15:05.144 +00:00] [INFO] [util.go:43] [PD] [release-version=v4.0.7]
[2021/01/11 06:15:05.144 +00:00] [INFO] [util.go:44] [PD] [edition=Community]
[2021/01/11 06:15:05.144 +00:00] [INFO] [util.go:45] [PD] [git-hash=8b0348f545611d5955e32fdcf3c57a3f73657d77]
[2021/01/11 06:15:05.144 +00:00] [INFO] [util.go:46] [PD] [git-branch=heads/refs/tags/v4.0.7]
[2021/01/11 06:15:05.144 +00:00] [INFO] [util.go:47] [PD] [utc-build-time="2020-09-29 06:52:41"]
[2021/01/11 06:15:05.145 +00:00] [INFO] [metricutil.go:81] ["disable Prometheus push client"]
[2021/01/11 06:15:05.145 +00:00] [ERROR] [join.go:213] ["failed to open directory"] [error="[PD:os:ErrOSOpen]open /var/lib/pd/member: no such file or directory"]
2021/01/11 06:15:05.145 grpclog.go:45: [info] parsed scheme: "endpoint"
2021/01/11 06:15:05.145 grpclog.go:45: [info] ccResolverWrapper: sending new addresses to cc: [{http://tidb-cluster-1605234515-pd-3.tidb-cluster-1605234515-pd-peer.jinfan.svc:2379 0  <nil>} {http://tidb-cluster-1605234515-pd-2.tidb-cluster-1605234515-pd-peer.jinfan.svc:2379 0  <nil>}]
2021/01/11 06:15:05.167 grpclog.go:60: [warning] grpc: addrConn.createTransport failed to connect to {http://tidb-cluster-1605234515-pd-3.tidb-cluster-1605234515-pd-peer.jinfan.svc:2379 0  <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 120.240.95.33:2379: connect: connection refused". Reconnecting...
{"level":"warn","ts":"2021-01-11T06:15:05.193Z","caller":"clientv3/retry_interceptor.go:61","msg":"retrying of unary invoker failed","target":"endpoint://client-edc76111-4902-4530-b3b3-5376c0ceff52/tidb-cluster-1605234515-pd-3.tidb-cluster-1605234515-pd-peer.jinfan.svc:2379","attempt":0,"error":"rpc error: code = Unavailable desc = etcdserver: unhealthy cluster"}
[2021/01/11 06:15:05.193 +00:00] [FATAL] [main.go:94] ["join meet error"] [error="etcdserver: unhealthy cluster"] [stack="github.com/pingcap/log.Fatal\
\t/home/jenkins/agent/workspace/build_pd_multi_branch_v4.0.7/go/pkg/mod/github.com/pingcap/log@v0.0.0-20200511115504-543df19646ad/global.go:59\
main.main\
\t/home/jenkins/agent/workspace/build_pd_multi_branch_v4.0.7/go/src/github.com/pingcap/pd/cmd/pd-server/main.go:94\
runtime.main\
\t/usr/local/go/src/runtime/proc.go:203"]

通过pd-ctl查看的member

{
  "header": {
    "cluster_id": 6894429818718535692
  },
  "members": [
    {
      "name": "tidb-cluster-1605234515-pd-1",
      "member_id": 312349629294863285,
      "peer_urls": [
        "http://tidb-cluster-1605234515-pd-3.tidb-cluster-1605234515-pd-peer.jinfan.svc:2380"
      ],
      "client_urls": [
        "http://tidb-cluster-1605234515-pd-1.tidb-cluster-1605234515-pd-peer.jinfan.svc:2379"
      ],
      "deploy_path": "/",
      "binary_version": "v4.0.7",
      "git_hash": "8b0348f545611d5955e32fdcf3c57a3f73657d77"
    },
    {
      "name": "tidb-cluster-1605234515-pd-2",
      "member_id": 14664505328199855870,
      "peer_urls": [
        "http://tidb-cluster-1605234515-pd-2.tidb-cluster-1605234515-pd-peer.jinfan.svc:2380"
      ],
      "client_urls": [
        "http://tidb-cluster-1605234515-pd-2.tidb-cluster-1605234515-pd-peer.jinfan.svc:2379"
      ],
      "deploy_path": "/",
      "binary_version": "v4.0.7",
      "git_hash": "8b0348f545611d5955e32fdcf3c57a3f73657d77"
    }
  ],
  "leader": {
    "name": "tidb-cluster-1605234515-pd-1",
    "member_id": 312349629294863285,
    "peer_urls": [
      "http://tidb-cluster-1605234515-pd-3.tidb-cluster-1605234515-pd-peer.jinfan.svc:2380"
    ],
    "client_urls": [
      "http://tidb-cluster-1605234515-pd-1.tidb-cluster-1605234515-pd-peer.jinfan.svc:2379"
    ]
  },
  "etcd_leader": {
    "name": "tidb-cluster-1605234515-pd-1",
    "member_id": 312349629294863285,
    "peer_urls": [
      "http://tidb-cluster-1605234515-pd-3.tidb-cluster-1605234515-pd-peer.jinfan.svc:2380"
    ],
    "client_urls": [
      "http://tidb-cluster-1605234515-pd-1.tidb-cluster-1605234515-pd-peer.jinfan.svc:2379"
    ],
    "deploy_path": "/",
    "binary_version": "v4.0.7",
    "git_hash": "8b0348f545611d5955e32fdcf3c57a3f73657d77"
  }
}

当时恢复的时候生成了一个pd3,我改了statefulsets里pd的replicas将pd3删掉了,是不是pd3还有哪边没有删除干净。但通过pd-ctl看member里已经是没有pd3了

这个是一个问题吧,你用pd-recover 恢复的?

恩,同样已经用pd-recover解决了,谢谢

:+1: 多谢。 原贴 Kubernetes里部署的tidb,对pd先进行缩容,再扩容后无法扩容成功