Bug 反馈
集群个别服务器节点故障恢复等情况下,PD或TIKV的PVC与PV绑定错误,导致PD或TIKV一直pending,无法启动。
【 TiDB 版本】 6.5.8
【 Bug 的影响】 出现此故障后集群部分节点不可用
【可能的问题复现步骤】
1、集群运行过程中个别服务器节点故障离线或网络故障,运维修复后节点重新上线。
2、集群运行一段时间,个别服务器节点物理损坏,新加入新节点。
【看到的非预期行为】 故障节点对应的PV绑定了错误的PVC,猜测为TiOperator对PVC做了操作,导致原来的PVC被删除,但是PV又没有正确释放,仍然处于released状态;而新的PVC没有办法绑定,导致对应的POD一直pending。该情况出现在PD和TIKV;常见的其他三方开源的组件,同样使用statefulset部署,仅依赖K8S管理,却没有这个问题。
【期望看到的行为】 根据对其他三方开源组件的使用判断,TiOperator不用对PVC做任何操作,保持绑定,就不会出现绑定错乱。
【相关组件及具体版本】 TiOperator 1.4、1.5
【其他背景信息或者截图】
node-1:/ # kubectl get pv -n namespace | grep pd
pv-local-tidb-pd-namespace-node-1 1Gi RWO Retain Bound namespace/pd-basic-pd-2 tidb-pd-storage-namespace 30h
pv-local-tidb-pd-namespace-node-2 1Gi RWO Retain Bound namespace/pd-basic-pd-1 tidb-pd-storage-namespace 30h
pv-local-tidb-pd-namespace-node-3 1Gi RWO Retain Released namespace/pd-basic-pd-0 tidb-pd-storage-namespace 30h
node-1:/ # kubectl get pvc -n namespace | grep pd
pd-basic-pd-0 Pending tidb-pd-storage-namespace 29h
pd-basic-pd-1 Bound pv-local-tidb-pd-namespace-node-2 1Gi RWO tidb-pd-storage-namespace 30h
pd-basic-pd-2 Bound pv-local-tidb-pd-namespace-node-1 1Gi RWO tidb-pd-storage-namespace 30h
pd-basic-pd-3 Pending tidb-pd-storage-namespace 29h
node-1:/ # kubectl describe pv -n namespace pv-local-tidb-pd-namespace-node-3
Name: pv-local-tidb-pd-namespace-node-3
Labels: app.kubernetes.io/component=pd
app.kubernetes.io/instance=basic
app.kubernetes.io/managed-by=tidb-operator
app.kubernetes.io/name=tidb-cluster
app.kubernetes.io/namespace=namespace
ns=namespace
tidb.pingcap.com/cluster-id=7413947050473649857
tidb.pingcap.com/member-id=14365345130030816002
Annotations: pv.kubernetes.io/bound-by-controller: yes
tidb.pingcap.com/pod-name: basic-pd-0
Finalizers: [kubernetes.io/pv-protection]
StorageClass: tidb-pd-storage-namespace
Status: Released
Claim: namespace/pd-basic-pd-0
Reclaim Policy: Retain
Access Modes: RWO
VolumeMode: Filesystem
Capacity: 1Gi
Node Affinity:
Required Terms:
Term 0: kubernetes.io/hostname in [node-3]
Message:
Source:
Type: LocalVolume (a persistent volume backed by local storage on a node)
Path: /home/datas/namespace/tidb/pd
Events:
node-1:/ # kubectl describe pvc -n namespace pd-basic-pd-0
Name: pd-basic-pd-0
Namespace: namespace
StorageClass: tidb-pd-storage-namespace
Status: Pending
Volume:
Labels: app.kubernetes.io/component=pd
app.kubernetes.io/instance=basic
app.kubernetes.io/managed-by=tidb-operator
app.kubernetes.io/name=tidb-cluster
tidb.pingcap.com/cluster-id=7413947050473649857
tidb.pingcap.com/pod-name=basic-pd-0
Annotations: tidb.pingcap.com/pod-name: basic-pd-0
Finalizers: [kubernetes.io/pvc-protection]
Capacity:
Access Modes:
VolumeMode: Filesystem
Used By: basic-pd-0
Events:
Type Reason Age From Message
Normal WaitForPodScheduled 2m44s (x6964 over 29h) persistentvolume-controller waiting for pod basic-pd-0 to be scheduled
node-1:/ # kubectl describe pvc -n namespace pd-basic-pd-3
Name: pd-basic-pd-3
Namespace: namespace
StorageClass: tidb-pd-storage-namespace
Status: Pending
Volume:
Labels: app.kubernetes.io/component=pd
app.kubernetes.io/instance=basic
app.kubernetes.io/managed-by=tidb-operator
app.kubernetes.io/name=tidb-cluster
tidb.pingcap.com/cluster-id=7413947050473649857
tidb.pingcap.com/pod-name=basic-pd-3
Annotations: tidb.pingcap.com/pod-name: basic-pd-3
Finalizers: [kubernetes.io/pvc-protection]
Capacity:
Access Modes:
VolumeMode: Filesystem
Used By: basic-pd-3
Events:
Type Reason Age From Message
Normal WaitForPodScheduled 2m21s (x6984 over 29h) persistentvolume-controller waiting for pod basic-pd-3 to be scheduled
相关pv-local-tidb-pd-namespace-node-3的信息如下:
apiVersion: v1
kind: PersistentVolume
metadata:
annotations:
kubectl.kubernetes.io/last-applied-configuration: |
{“apiVersion”:“v1”,“kind”:“PersistentVolume”,“metadata”:{“annotations”:{},“labels”:{“ns”:“namespace”},“name”:“pv-local-tidb-pd-namespace-node-3”},“spec”:{“accessModes”:[“ReadWriteOnce”],“capacity”:{“storage”:“1Gi”},“local”:{“path”:“/home/datas/namespace/tidb/pd”},“nodeAffinity”:{“required”:{“nodeSelectorTerms”:[{“matchExpressions”:[{“key”:“kubernetes.io/hostname",“operator”:“In”,“values”:[“node-3”]}]}]}},“persistentVolumeReclaimPolicy”:“Retain”,“storageClassName”:“tidb-pd-storage-namespace”,“volumeMode”:"Filesystem”}}
pv.kubernetes.io/bound-by-controller: “yes”
tidb.pingcap.com/pod-name: basic-pd-0
creationTimestamp: “2024-09-13T02:20:39Z”
finalizers:
- kubernetes.io/pv-protection
labels:
app.kubernetes.io/component: pd
app.kubernetes.io/instance: basic
app.kubernetes.io/managed-by: tidb-operator
app.kubernetes.io/name: tidb-cluster
app.kubernetes.io/namespace: namespace
ns: namespace
tidb.pingcap.com/cluster-id: “7413947050473649857”
tidb.pingcap.com/member-id: “14365345130030816002”
name: pv-local-tidb-pd-namespace-node-3
resourceVersion: “23887”
uid: 3f26a940-358a-48d1-8a5e-d36a05144fac
spec:
accessModes: - ReadWriteOnce
capacity:
storage: 1Gi
claimRef:
apiVersion: v1
kind: PersistentVolumeClaim
name: pd-basic-pd-0
namespace: namespace
resourceVersion: “3257”
uid: b7849fb3-ad1a-42b9-ad11-a678287aadfe
local:
path: /home/datas/namespace/tidb/pd
nodeAffinity:
required:
nodeSelectorTerms:- matchExpressions:
- key: kubernetes.io/hostname
operator: In
values:- node-3
persistentVolumeReclaimPolicy: Retain
storageClassName: tidb-pd-storage-namespace
volumeMode: Filesystem
status:
phase: Released
- node-3
- key: kubernetes.io/hostname
- matchExpressions:
但是对应的pvc pd-basic-pd-0 信息如下:
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
annotations:
tidb.pingcap.com/pod-name: basic-pd-0
creationTimestamp: “2024-09-13T03:29:05Z”
finalizers:
- kubernetes.io/pvc-protection
labels:
app.kubernetes.io/component: pd
app.kubernetes.io/instance: basic
app.kubernetes.io/managed-by: tidb-operator
app.kubernetes.io/name: tidb-cluster
tidb.pingcap.com/cluster-id: “7413947050473649857”
tidb.pingcap.com/pod-name: basic-pd-0
name: pd-basic-pd-0
namespace: namespace
resourceVersion: “24071”
uid: d3c1ce79-2fc7-48dc-8323-a2321fb3454d
spec:
accessModes: - ReadWriteOnce
resources:
requests:
storage: 1Gi
storageClassName: tidb-pd-storage-namespace
volumeMode: Filesystem
status:
phase: Pending