不是。我人为在 pd 容器所在的节点间建立互信,还是一样的报错
用的 service 层来做的统一的网络,但是日志中有明显的互通问题
2022/08/25 07:20:22.573 log.go:85: [warning] etcdserver: [could not get cluster response from http://basic-pd-1.basic-pd-peer.tidb-cluster.svc:2380: Get “http://basic-pd-1.basic-pd-peer.tidb-cluster.svc:2380/members”: dial tcp 10.0.3.16:2380: connect: connection refused]
service 层打通会有 loadbalancer 的方式,这个比较合适 tidb
POD 层之间的网络是通的么?
K8S 上 部署 tidb,推荐 tidb operator
可以参考这个文档:
https://docs.pingcap.com/zh/tidb-in-kubernetes/stable/get-started
也可以通过商店的方式直接安装…
顶顶顶顶。
再顶顶顶顶顶顶
比对下basic-pd-0.yaml和basic-pd-2.yaml
给pod0加个nodeSelector调度到node14上。
spec 下几乎一致。而且都是从 statefulset 出来的,不太可能不一样
statefulset扩展一下–replicas=4,加一个pod,看看新的Pod报不报错
会的,就是从5缩容到3的。
5个的话,好像是2个可用;3个的话,是1个可用(是不是看上去像是脑裂)
kubectl describe pod basic-pd-0 -n tidb-cluster 这个看看啥结果
方案1: 先看 pd-0, pd-1 crash 的原因并修复,然后看 pd 的 3 副本能否恢复正常
方案2: 目前已故障了多数副本,集群应该已不可用,可参考 operator 的 pd-recover 进行灾难恢复
忘记贴这个输出了,现在补上。
对这行感到陌生
Annotations: kubernetes.io/limit-ranger: LimitRanger plugin set: memory request for container pd; memory limit for container pd
Name: basic-pd-0
Namespace: tidb-cluster
Priority: 0
Node: node193.169.203.15/193.169.203.15
Start Time: Thu, 25 Aug 2022 17:30:40 +0800
Labels: app.kubernetes.io/component=pd
app.kubernetes.io/instance=basic
app.kubernetes.io/managed-by=tidb-operator
app.kubernetes.io/name=tidb-cluster
controller-revision-hash=basic-pd-766b5cb86
statefulset.kubernetes.io/pod-name=basic-pd-0
Annotations: kubernetes.io/limit-ranger: LimitRanger plugin set: memory request for container pd; memory limit for container pd
prometheus.io/path: /metrics
prometheus.io/port: 2379
prometheus.io/scrape: true
Status: Running
IP: 10.0.5.139
IPs:
IP: 10.0.5.139
Controlled By: StatefulSet/basic-pd
Containers:
pd:
Container ID: docker://fc892a76a59b1eef82f45e9c54281aeee3495601e874d3ca3ad9ff6f2dafe597
Image: pingcap/pd:v5.2.1
Image ID: docker-pullable://pingcap/pd@sha256:e9766de6a85d3f262ac016e9a2421c8099f445eb70aba0741fbf8b7932ea117d
Ports: 2380/TCP, 2379/TCP
Host Ports: 0/TCP, 0/TCP
Command:
/bin/sh
/usr/local/bin/pd_start_script.sh
State: Waiting
Reason: CrashLoopBackOff
Last State: Terminated
Reason: Error
Exit Code: 2
Started: Thu, 25 Aug 2022 22:17:37 +0800
Finished: Thu, 25 Aug 2022 22:17:52 +0800
Ready: False
Restart Count: 58
Limits:
memory: 24Gi
Requests:
memory: 4Gi
Environment:
NAMESPACE: tidb-cluster (v1:metadata.namespace)
PEER_SERVICE_NAME: basic-pd-peer
SERVICE_NAME: basic-pd
SET_NAME: basic-pd
TZ: UTC
Mounts:
/etc/pd from config (ro)
/etc/podinfo from annotations (ro)
/usr/local/bin from startup-script (ro)
/var/lib/pd from pd (rw)
/var/run/secrets/kubernetes.io/serviceaccount from default-token-8ntz6 (ro)
Conditions:
Type Status
Initialized True
Ready False
ContainersReady False
PodScheduled True
Volumes:
pd:
Type: PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace)
ClaimName: pd-basic-pd-0
ReadOnly: false
annotations:
Type: DownwardAPI (a volume populated by information about the pod)
Items:
metadata.annotations -> annotations
config:
Type: ConfigMap (a volume populated by a ConfigMap)
Name: basic-pd-3731616
Optional: false
startup-script:
Type: ConfigMap (a volume populated by a ConfigMap)
Name: basic-pd-3731616
Optional: false
default-token-8ntz6:
Type: Secret (a volume populated by a Secret)
SecretName: default-token-8ntz6
Optional: false
QoS Class: Burstable
Node-Selectors: <none>
Tolerations: node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Pulled 38m (x52 over 4h48m) kubelet Container image "pingcap/pd:v5.2.1" already present on machine
Warning BackOff 3m48s (x1242 over 4h48m) kubelet Back-off restarting failed container
你pd 服务器目前内存使用情况如何? 你pd server内存参数如何设置的?
那是资源限制的注解,下线4G,上线24G,
kubelctl delete pod basic-pd-0 -n tidb-cluster
可以删除这个pod,看看重新生成的什么情况
还是聚焦在 etcd 那块吧。crash 的 Log 输出这些栈后就挂掉了。所以切入点应该是 选主 这方便
[2022/08/26 02:03:13.164 +00:00] [INFO] [stream.go:250] ["set message encoder"] [from=caab82c67f3f4ad1] [to=caab82c67f3f4ad1] [stream-type="stream MsgApp v2"]
[2022/08/26 02:03:13.164 +00:00] [WARN] [stream.go:277] ["established TCP streaming connection with remote peer"] [stream-writer-type="stream MsgApp v2"] [local-member-id=caab82c67f3f4ad1] [remote-peer-id=6b27cfc0d7490063]
[2022/08/26 02:03:13.177 +00:00] [ERROR] [etcdutil.go:70] ["failed to get cluster from remote"] [error="[PD:etcd:ErrEtcdGetCluster]could not retrieve cluster information from the given URLs"]
2022/08/26 02:03:13.177 log.go:85: [warning] etcdserver: [could not get cluster response from http://basic-pd-1.basic-pd-peer.tidb-cluster.svc:2380: Get "http://basic-pd-1.basic-pd-peer.tidb-cluster.svc:2380/members": dial tcp 10.0.3.33:2380: connect: connection refused]
[2022/08/26 02:03:13.390 +00:00] [PANIC] [cluster.go:460] ["failed to update; member unknown"] [cluster-id=d9e392fb342bfa96] [local-member-id=caab82c67f3f4ad1] [unknown-remote-peer-id=2b86c59db64a77fc]
panic: failed to update; member unknown
goroutine 418 [running]:
go.uber.org/zap/zapcore.(*CheckedEntry).Write(0xc0008260c0, 0xc000826000, 0x3, 0x3)
/nfs/cache/mod/go.uber.org/zap@v1.16.0/zapcore/entry.go:234 +0x58d
go.uber.org/zap.(*Logger).Panic(0xc0000f64e0, 0x2759a56, 0x20, 0xc000826000, 0x3, 0x3)
/nfs/cache/mod/go.uber.org/zap@v1.16.0/logger.go:226 +0x85
...
目前看只有basi-pd-2这个pod是可用的,basi-pd-1这个pod没起来,所以会报 http://basic-pd-1.basic-pd-peer.tidb-cluster.svc:2380这个错误。。。
此话题已在最后回复的 60 天后被自动关闭。不再允许新回复。