【紧急!!!】k8s pd启动失败(非tiup)

前言:执行了 crd.yaml 和 operator.yaml 后,重建了 crd。正常运行20 小时后,pd 挂掉。
请教:如何定位为什么 pd 间的通信为什么失败 和 解决方案。不要推荐 tiup 和 删掉 operator 再次重建方式,代价太大了,555。 -.-

【 TiDB 使用环境】线上
【 TiDB 版本】v5.2.1
【遇到的问题】
【复现路径】做过哪些操作出现的问题
【问题现象及影响】

2022/08/25 07:20:22.565 +00:00] [WARN] [stream.go:277] ["established TCP streaming connection with remote peer"] [stream-writer-type="stream Message"] [local-member-id=caab82c67f3f4ad1] [remote-peer-id=6b27cfc0d7490063]
[2022/08/25 07:20:22.565 +00:00] [INFO] [stream.go:250] ["set message encoder"] [from=caab82c67f3f4ad1] [to=caab82c67f3f4ad1] [stream-type="stream MsgApp v2"]
[2022/08/25 07:20:22.565 +00:00] [WARN] [stream.go:277] ["established TCP streaming connection with remote peer"] [stream-writer-type="stream MsgApp v2"] [local-member-id=caab82c67f3f4ad1] [remote-peer-id=6b27cfc0d7490063]
2022/08/25 07:20:22.573 log.go:85: [warning] etcdserver: [could not get cluster response from http://basic-pd-1.basic-pd-peer.tidb-cluster.svc:2380: Get "http://basic-pd-1.basic-pd-peer.tidb-cluster.svc:2380/members": dial tcp 10.0.3.16:2380: connect: connection refused]
[2022/08/25 07:20:22.573 +00:00] [ERROR] [etcdutil.go:70] ["failed to get cluster from remote"] [error="[PD:etcd:ErrEtcdGetCluster]could not retrieve cluster information from the given URLs"]
[2022/08/25 07:20:22.767 +00:00] [PANIC] [cluster.go:460] ["failed to update; member unknown"] [cluster-id=d9e392fb342bfa96] [local-member-id=caab82c67f3f4ad1] [unknown-remote-peer-id=2b86c59db64a77fc]
panic: failed to update; member unknown
goroutine 450 [running]:
go.uber.org/zap/zapcore.(*CheckedEntry).Write(0xc000750300, 0xc00067e0c0, 0x3, 0x3)
        /nfs/cache/mod/go.uber.org/zap@v1.16.0/zapcore/entry.go:234 +0x58d
go.uber.org/zap.(*Logger).Panic(0xc000276360, 0x2759a56, 0x20, 0xc00067e0c0, 0x3, 0x3)
        /nfs/cache/mod/go.uber.org/zap@v1.16.0/logger.go:226 +0x85
go.etcd.io/etcd/etcdserver/api/membership.(*RaftCluster).UpdateAttributes(0xc0006e0070, 0x2b86c59db64a77fc, 0xc005d8e630, 0xa, 0xc005dba940, 0x1, 0x4)
        /nfs/cache/mod/go.etcd.io/etcd@v0.5.0-alpha.5.0.20191023171146-3cf2f69b5738/etcdserver/api/membership/cluster.go:460 +0x9d1
go.etcd.io/etcd/etcdserver.(*applierV2store).Put(0xc001c4a540, 0xc005dc2580, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0)
        /nfs/cache/mod/go.etcd.io/etcd@v0.5.0-alpha.5.0.20191023171146-3cf2f69b5738/etcdserver/apply_v2.go:89 +0x966
go.etcd.io/etcd/etcdserver.(*EtcdServer).applyV2Request(0xc00017c680, 0xc005dc2580, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0)
        /nfs/cache/mod/go.etcd.io/etcd@v0.5.0-alpha.5.0.20191023171146-3cf2f69b5738/etcdserver/apply_v2.go:123 +0x248
go.etcd.io/etcd/etcdserver.(*EtcdServer).applyEntryNormal(0xc00017c680, 0xc0005e14d8)
        /nfs/cache/mod/go.etcd.io/etcd@v0.5.0-alpha.5.0.20191023171146-3cf2f69b5738/etcdserver/server.go:2178 +0xad4
go.etcd.io/etcd/etcdserver.(*EtcdServer).apply(0xc00017c680, 0xc004aef8e0, 0x240, 0x252, 0xc0001fc0a0, 0x0, 0xf3d34e, 0xc0005e1640)
        /nfs/cache/mod/go.etcd.io/etcd@v0.5.0-alpha.5.0.20191023171146-3cf2f69b5738/etcdserver/server.go:2117 +0x579
go.etcd.io/etcd/etcdserver.(*EtcdServer).applyEntries(0xc00017c680, 0xc0001fc0a0, 0xc001a1e200)
        /nfs/cache/mod/go.etcd.io/etcd@v0.5.0-alpha.5.0.20191023171146-3cf2f69b5738/etcdserver/server.go:1369 +0xe5
go.etcd.io/etcd/etcdserver.(*EtcdServer).applyAll(0xc00017c680, 0xc0001fc0a0, 0xc001a1e200)
        /nfs/cache/mod/go.etcd.io/etcd@v0.5.0-alpha.5.0.20191023171146-3cf2f69b5738/etcdserver/server.go:1093 +0x88
go.etcd.io/etcd/etcdserver.(*EtcdServer).run.func8(0x30f6530, 0xc001c20040)
        /nfs/cache/mod/go.etcd.io/etcd@v0.5.0-alpha.5.0.20191023171146-3cf2f69b5738/etcdserver/server.go:1038 +0x3c
go.etcd.io/etcd/pkg/schedule.(*fifo).run(0xc001c14000)
        /nfs/cache/mod/go.etcd.io/etcd@v0.5.0-alpha.5.0.20191023171146-3cf2f69b5738/pkg/schedule/schedule.go:157 +0xf3
created by go.etcd.io/etcd/pkg/schedule.NewFIFOScheduler
        /nfs/cache/mod/go.etcd.io/etcd@v0.5.0-alpha.5.0.20191023171146-3cf2f69b5738/pkg/schedule/schedule.go:70 +0x13b

【附件】

请提供各个组件的 version 信息,如 cdc/tikv,可通过执行 cdc version/tikv-server --version 获取。

网络不通?开防火墙了?
image

[2022/08/25 07:20:22.573 +00:00] [ERROR] [etcdutil.go:70] [“failed to get cluster from remote”] [error="[PD:etcd:ErrEtcdGetCluster]could not retrieve cluster information from the given URLs"]
[2022/08/25 07:20:22.767 +00:00] [PANIC] [cluster.go:460] [“failed to update; member unknown”] [cluster-id=d9e392fb342bfa96] [local-member-id=caab82c67f3f4ad1]

建立连接失败,集群服务器直接的连通性、IP 、ssh 等检查一下

不是。我人为在 pd 容器所在的节点间建立互信,还是一样的报错

用的 service 层来做的统一的网络,但是日志中有明显的互通问题

2022/08/25 07:20:22.573 log.go:85: [warning] etcdserver: [could not get cluster response from http://basic-pd-1.basic-pd-peer.tidb-cluster.svc:2380: Get “http://basic-pd-1.basic-pd-peer.tidb-cluster.svc:2380/members”: dial tcp 10.0.3.16:2380: connect: connection refused]

service 层打通会有 loadbalancer 的方式,这个比较合适 tidb

POD 层之间的网络是通的么?

K8S 上 部署 tidb,推荐 tidb operator
可以参考这个文档:
https://docs.pingcap.com/zh/tidb-in-kubernetes/stable/get-started

也可以通过商店的方式直接安装…

1 个赞

顶顶顶顶。

再顶顶顶顶顶顶

从 唯一可运行的 pod ping 另外两个 failed pod 的 ip 是通的

比对下basic-pd-0.yaml和basic-pd-2.yaml

给pod0加个nodeSelector调度到node14上。

spec 下几乎一致。而且都是从 statefulset 出来的,不太可能不一样

statefulset扩展一下–replicas=4,加一个pod,看看新的Pod报不报错

会的,就是从5缩容到3的。

5个的话,好像是2个可用;3个的话,是1个可用(是不是看上去像是脑裂)

kubectl describe pod basic-pd-0 -n tidb-cluster 这个看看啥结果

方案1: 先看 pd-0, pd-1 crash 的原因并修复,然后看 pd 的 3 副本能否恢复正常
方案2: 目前已故障了多数副本,集群应该已不可用,可参考 operator 的 pd-recover 进行灾难恢复

为啥 PD 只有一个存活的?

另外两个是 crashloopbackoff 了?

我建议你排查下具体的原因,可以参考这个博客:

忘记贴这个输出了,现在补上。
对这行感到陌生

Annotations:  kubernetes.io/limit-ranger: LimitRanger plugin set: memory request for container pd; memory limit for container pd
Name:         basic-pd-0
Namespace:    tidb-cluster
Priority:     0
Node:         node193.169.203.15/193.169.203.15
Start Time:   Thu, 25 Aug 2022 17:30:40 +0800
Labels:       app.kubernetes.io/component=pd
              app.kubernetes.io/instance=basic
              app.kubernetes.io/managed-by=tidb-operator
              app.kubernetes.io/name=tidb-cluster
              controller-revision-hash=basic-pd-766b5cb86
              statefulset.kubernetes.io/pod-name=basic-pd-0
Annotations:  kubernetes.io/limit-ranger: LimitRanger plugin set: memory request for container pd; memory limit for container pd
              prometheus.io/path: /metrics
              prometheus.io/port: 2379
              prometheus.io/scrape: true
Status:       Running
IP:           10.0.5.139
IPs:
  IP:           10.0.5.139
Controlled By:  StatefulSet/basic-pd
Containers:
  pd:
    Container ID:  docker://fc892a76a59b1eef82f45e9c54281aeee3495601e874d3ca3ad9ff6f2dafe597
    Image:         pingcap/pd:v5.2.1
    Image ID:      docker-pullable://pingcap/pd@sha256:e9766de6a85d3f262ac016e9a2421c8099f445eb70aba0741fbf8b7932ea117d
    Ports:         2380/TCP, 2379/TCP
    Host Ports:    0/TCP, 0/TCP
    Command:
      /bin/sh
      /usr/local/bin/pd_start_script.sh
    State:          Waiting
      Reason:       CrashLoopBackOff
    Last State:     Terminated
      Reason:       Error
      Exit Code:    2
      Started:      Thu, 25 Aug 2022 22:17:37 +0800
      Finished:     Thu, 25 Aug 2022 22:17:52 +0800
    Ready:          False
    Restart Count:  58
    Limits:
      memory:  24Gi
    Requests:
      memory:  4Gi
    Environment:
      NAMESPACE:          tidb-cluster (v1:metadata.namespace)
      PEER_SERVICE_NAME:  basic-pd-peer
      SERVICE_NAME:       basic-pd
      SET_NAME:           basic-pd
      TZ:                 UTC
    Mounts:
      /etc/pd from config (ro)
      /etc/podinfo from annotations (ro)
      /usr/local/bin from startup-script (ro)
      /var/lib/pd from pd (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from default-token-8ntz6 (ro)
Conditions:
  Type              Status
  Initialized       True 
  Ready             False 
  ContainersReady   False 
  PodScheduled      True 
Volumes:
  pd:
    Type:       PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace)
    ClaimName:  pd-basic-pd-0
    ReadOnly:   false
  annotations:
    Type:  DownwardAPI (a volume populated by information about the pod)
    Items:
      metadata.annotations -> annotations
  config:
    Type:      ConfigMap (a volume populated by a ConfigMap)
    Name:      basic-pd-3731616
    Optional:  false
  startup-script:
    Type:      ConfigMap (a volume populated by a ConfigMap)
    Name:      basic-pd-3731616
    Optional:  false
  default-token-8ntz6:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  default-token-8ntz6
    Optional:    false
QoS Class:       Burstable
Node-Selectors:  <none>
Tolerations:     node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                 node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
  Type     Reason   Age                       From     Message
  ----     ------   ----                      ----     -------
  Normal   Pulled   38m (x52 over 4h48m)      kubelet  Container image "pingcap/pd:v5.2.1" already present on machine
  Warning  BackOff  3m48s (x1242 over 4h48m)  kubelet  Back-off restarting failed container

你pd 服务器目前内存使用情况如何? 你pd server内存参数如何设置的?

那是资源限制的注解,下线4G,上线24G,
kubelctl delete pod basic-pd-0 -n tidb-cluster
可以删除这个pod,看看重新生成的什么情况