【紧急!!!】k8s pd启动失败(非tiup)

不是。我人为在 pd 容器所在的节点间建立互信,还是一样的报错

用的 service 层来做的统一的网络,但是日志中有明显的互通问题

2022/08/25 07:20:22.573 log.go:85: [warning] etcdserver: [could not get cluster response from http://basic-pd-1.basic-pd-peer.tidb-cluster.svc:2380: Get “http://basic-pd-1.basic-pd-peer.tidb-cluster.svc:2380/members”: dial tcp 10.0.3.16:2380: connect: connection refused]

service 层打通会有 loadbalancer 的方式,这个比较合适 tidb

POD 层之间的网络是通的么?

K8S 上 部署 tidb,推荐 tidb operator
可以参考这个文档:
https://docs.pingcap.com/zh/tidb-in-kubernetes/stable/get-started

也可以通过商店的方式直接安装…

1 Like

顶顶顶顶。

再顶顶顶顶顶顶

从 唯一可运行的 pod ping 另外两个 failed pod 的 ip 是通的

比对下basic-pd-0.yaml和basic-pd-2.yaml

给pod0加个nodeSelector调度到node14上。

spec 下几乎一致。而且都是从 statefulset 出来的,不太可能不一样

statefulset扩展一下–replicas=4,加一个pod,看看新的Pod报不报错

会的,就是从5缩容到3的。

5个的话,好像是2个可用;3个的话,是1个可用(是不是看上去像是脑裂)

kubectl describe pod basic-pd-0 -n tidb-cluster 这个看看啥结果

方案1: 先看 pd-0, pd-1 crash 的原因并修复,然后看 pd 的 3 副本能否恢复正常
方案2: 目前已故障了多数副本,集群应该已不可用,可参考 operator 的 pd-recover 进行灾难恢复

为啥 PD 只有一个存活的?

另外两个是 crashloopbackoff 了?

我建议你排查下具体的原因,可以参考这个博客:

忘记贴这个输出了,现在补上。
对这行感到陌生

Annotations:  kubernetes.io/limit-ranger: LimitRanger plugin set: memory request for container pd; memory limit for container pd
Name:         basic-pd-0
Namespace:    tidb-cluster
Priority:     0
Node:         node193.169.203.15/193.169.203.15
Start Time:   Thu, 25 Aug 2022 17:30:40 +0800
Labels:       app.kubernetes.io/component=pd
              app.kubernetes.io/instance=basic
              app.kubernetes.io/managed-by=tidb-operator
              app.kubernetes.io/name=tidb-cluster
              controller-revision-hash=basic-pd-766b5cb86
              statefulset.kubernetes.io/pod-name=basic-pd-0
Annotations:  kubernetes.io/limit-ranger: LimitRanger plugin set: memory request for container pd; memory limit for container pd
              prometheus.io/path: /metrics
              prometheus.io/port: 2379
              prometheus.io/scrape: true
Status:       Running
IP:           10.0.5.139
IPs:
  IP:           10.0.5.139
Controlled By:  StatefulSet/basic-pd
Containers:
  pd:
    Container ID:  docker://fc892a76a59b1eef82f45e9c54281aeee3495601e874d3ca3ad9ff6f2dafe597
    Image:         pingcap/pd:v5.2.1
    Image ID:      docker-pullable://pingcap/pd@sha256:e9766de6a85d3f262ac016e9a2421c8099f445eb70aba0741fbf8b7932ea117d
    Ports:         2380/TCP, 2379/TCP
    Host Ports:    0/TCP, 0/TCP
    Command:
      /bin/sh
      /usr/local/bin/pd_start_script.sh
    State:          Waiting
      Reason:       CrashLoopBackOff
    Last State:     Terminated
      Reason:       Error
      Exit Code:    2
      Started:      Thu, 25 Aug 2022 22:17:37 +0800
      Finished:     Thu, 25 Aug 2022 22:17:52 +0800
    Ready:          False
    Restart Count:  58
    Limits:
      memory:  24Gi
    Requests:
      memory:  4Gi
    Environment:
      NAMESPACE:          tidb-cluster (v1:metadata.namespace)
      PEER_SERVICE_NAME:  basic-pd-peer
      SERVICE_NAME:       basic-pd
      SET_NAME:           basic-pd
      TZ:                 UTC
    Mounts:
      /etc/pd from config (ro)
      /etc/podinfo from annotations (ro)
      /usr/local/bin from startup-script (ro)
      /var/lib/pd from pd (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from default-token-8ntz6 (ro)
Conditions:
  Type              Status
  Initialized       True 
  Ready             False 
  ContainersReady   False 
  PodScheduled      True 
Volumes:
  pd:
    Type:       PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace)
    ClaimName:  pd-basic-pd-0
    ReadOnly:   false
  annotations:
    Type:  DownwardAPI (a volume populated by information about the pod)
    Items:
      metadata.annotations -> annotations
  config:
    Type:      ConfigMap (a volume populated by a ConfigMap)
    Name:      basic-pd-3731616
    Optional:  false
  startup-script:
    Type:      ConfigMap (a volume populated by a ConfigMap)
    Name:      basic-pd-3731616
    Optional:  false
  default-token-8ntz6:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  default-token-8ntz6
    Optional:    false
QoS Class:       Burstable
Node-Selectors:  <none>
Tolerations:     node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                 node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
  Type     Reason   Age                       From     Message
  ----     ------   ----                      ----     -------
  Normal   Pulled   38m (x52 over 4h48m)      kubelet  Container image "pingcap/pd:v5.2.1" already present on machine
  Warning  BackOff  3m48s (x1242 over 4h48m)  kubelet  Back-off restarting failed container

你pd 服务器目前内存使用情况如何? 你pd server内存参数如何设置的?

那是资源限制的注解,下线4G,上线24G,
kubelctl delete pod basic-pd-0 -n tidb-cluster
可以删除这个pod,看看重新生成的什么情况

还是聚焦在 etcd 那块吧。crash 的 Log 输出这些栈后就挂掉了。所以切入点应该是 选主 这方便

[2022/08/26 02:03:13.164 +00:00] [INFO] [stream.go:250] ["set message encoder"] [from=caab82c67f3f4ad1] [to=caab82c67f3f4ad1] [stream-type="stream MsgApp v2"]
[2022/08/26 02:03:13.164 +00:00] [WARN] [stream.go:277] ["established TCP streaming connection with remote peer"] [stream-writer-type="stream MsgApp v2"] [local-member-id=caab82c67f3f4ad1] [remote-peer-id=6b27cfc0d7490063]
[2022/08/26 02:03:13.177 +00:00] [ERROR] [etcdutil.go:70] ["failed to get cluster from remote"] [error="[PD:etcd:ErrEtcdGetCluster]could not retrieve cluster information from the given URLs"]
2022/08/26 02:03:13.177 log.go:85: [warning] etcdserver: [could not get cluster response from http://basic-pd-1.basic-pd-peer.tidb-cluster.svc:2380: Get "http://basic-pd-1.basic-pd-peer.tidb-cluster.svc:2380/members": dial tcp 10.0.3.33:2380: connect: connection refused]
[2022/08/26 02:03:13.390 +00:00] [PANIC] [cluster.go:460] ["failed to update; member unknown"] [cluster-id=d9e392fb342bfa96] [local-member-id=caab82c67f3f4ad1] [unknown-remote-peer-id=2b86c59db64a77fc]
panic: failed to update; member unknown
goroutine 418 [running]:
go.uber.org/zap/zapcore.(*CheckedEntry).Write(0xc0008260c0, 0xc000826000, 0x3, 0x3)
        /nfs/cache/mod/go.uber.org/zap@v1.16.0/zapcore/entry.go:234 +0x58d
go.uber.org/zap.(*Logger).Panic(0xc0000f64e0, 0x2759a56, 0x20, 0xc000826000, 0x3, 0x3)
        /nfs/cache/mod/go.uber.org/zap@v1.16.0/logger.go:226 +0x85
...

目前看只有basi-pd-2这个pod是可用的,basi-pd-1这个pod没起来,所以会报 http://basic-pd-1.basic-pd-peer.tidb-cluster.svc:2380这个错误。。。

根据 log 看,pd-2是第一个启动的 pd,pd-1其次加入集群,pd-0最后加入集群。

此话题已在最后回复的 60 天后被自动关闭。不再允许新回复。