Kubernetes里部署的tidb，对pd先进行缩容，再扩容后无法扩容成功

feitianyu · 2021 年1 月 9 日 01:45

为提高效率，提问时请提供以下信息，问题描述清晰可优先响应。

【TiDB 版本】：
【问题描述】：

若提问为性能优化、故障排查类问题，请下载脚本运行。终端输出的打印结果，请务必全选并复制粘贴上传。

feitianyu · 2021 年1 月 9 日 01:48

feitianyu · 2021 年1 月 9 日 01:53

缩容后，删除的pd在tidbcluster中变成了unjoinedMembers，现在我想再进行扩容，发现扩容不了，即使把pv和pvc删掉，重新格式化后扩容也不行了。

yilong · 2021 年1 月 11 日 05:32

pd-ctl 看下当前 member 信息是什么？
kubectl get pod -n -o wide 反馈下信息
kubectl logs -n 查看下报错信息

feitianyu · 2021 年1 月 11 日 06:04

1.pd-ctl 看当前 member 信息

{
  "header": {
    "cluster_id": 6916490224968328765
  },
  "members": [
    {
      "name": "tidb-cluster-1608600848-pd-0",
      "member_id": 11528851186513975627,
      "peer_urls": [
        "http://tidb-cluster-1608600848-pd-1.tidb-cluster-1608600848-pd-peer.jinfan.svc:2380"
      ],
      "client_urls": [
        "http://tidb-cluster-1608600848-pd-0.tidb-cluster-1608600848-pd-peer.jinfan.svc:2379"
      ],
      "deploy_path": "/",
      "binary_version": "v4.0.7",
      "git_hash": "8b0348f545611d5955e32fdcf3c57a3f73657d77"
    }
  ],
  "leader": {
    "name": "tidb-cluster-1608600848-pd-0",
    "member_id": 11528851186513975627,
    "peer_urls": [
      "http://tidb-cluster-1608600848-pd-1.tidb-cluster-1608600848-pd-peer.jinfan.svc:2380"
    ],
    "client_urls": [
      "http://tidb-cluster-1608600848-pd-0.tidb-cluster-1608600848-pd-peer.jinfan.svc:2379"
    ]
  },
  "etcd_leader": {
    "name": "tidb-cluster-1608600848-pd-0",
    "member_id": 11528851186513975627,
    "peer_urls": [
      "http://tidb-cluster-1608600848-pd-1.tidb-cluster-1608600848-pd-peer.jinfan.svc:2380"
    ],
    "client_urls": [
      "http://tidb-cluster-1608600848-pd-0.tidb-cluster-1608600848-pd-peer.jinfan.svc:2379"
    ],
    "deploy_path": "/",
    "binary_version": "v4.0.7",
    "git_hash": "8b0348f545611d5955e32fdcf3c57a3f73657d77"
  }
}

2.kubectl get pod -n -o wide

NAME                                                 READY   STATUS    RESTARTS   AGE     IP            NODE           NOMINATED NODE   READINESS GATES
tidb-cluster-1608600848-discovery-5df69cc7cd-vf7mk   1/1     Running   0          3h8m    10.244.2.41   k8s-node02     <none>           <none>
tidb-cluster-1608600848-monitor-699bbf5875-rdl49     3/3     Running   0          3h8m    10.244.1.55   k8s-node01     <none>           <none>
tidb-cluster-1608600848-pd-0                         1/1     Running   0          19m     10.244.1.66   k8s-node01     <none>           <none>
tidb-cluster-1608600848-pd-1                         1/1     Running   5          5m58s   10.244.2.51   k8s-node02     <none>           <none>
tidb-cluster-1608600848-tidb-0                       2/2     Running   0          174m    10.244.1.62   k8s-node01     <none>           <none>
tidb-cluster-1608600848-tidb-1                       2/2     Running   0          173m    10.244.0.51   k8s-master01   <none>           <none>
tidb-cluster-1608600848-tikv-0                       1/1     Running   0          3h4m    10.244.2.47   k8s-node02     <none>           <none>
tidb-cluster-1608600848-tikv-1                       1/1     Running   0          3h1m    10.244.1.61   k8s-node01     <none>           <none>
tidb-cluster-1608600848-tikv-2                       1/1     Running   0          3h1m    10.244.0.50   k8s-master01   <none>           <none>

3.kubectl logs -njinfan tidb-cluster-1608600848-pd-1

Name:      tidb-cluster-1608600848-pd-1.tidb-cluster-1608600848-pd-peer.jinfan.svc
Address 1: 10.244.2.51 tidb-cluster-1608600848-pd-1.tidb-cluster-1608600848-pd-peer.jinfan.svc.cluster.local
nslookup domain tidb-cluster-1608600848-pd-1.tidb-cluster-1608600848-pd-peer.jinfan.svc.svc success
starting pd-server ...
/pd-server --data-dir=/var/lib/pd --name=tidb-cluster-1608600848-pd-1 --peer-urls=http://0.0.0.0:2380 --advertise-peer-urls=http://tidb-cluster-1608600848-pd-1.tidb-cluster-1608600848-pd-peer.jinfan.svc:2380 --client-urls=http://0.0.0.0:2379 --advertise-client-urls=http://tidb-cluster-1608600848-pd-1.tidb-cluster-1608600848-pd-peer.jinfan.svc:2379 --config=/etc/pd/pd.toml --join=http://tidb-cluster-1608600848-pd-1.tidb-cluster-1608600848-pd-peer.jinfan.svc:2379
[2021/01/11 18:03:22.938 +00:00] [INFO] [util.go:42] ["Welcome to Placement Driver (PD)"]
[2021/01/11 18:03:22.938 +00:00] [INFO] [util.go:43] [PD] [release-version=v4.0.7]
[2021/01/11 18:03:22.938 +00:00] [INFO] [util.go:44] [PD] [edition=Community]
[2021/01/11 18:03:22.938 +00:00] [INFO] [util.go:45] [PD] [git-hash=8b0348f545611d5955e32fdcf3c57a3f73657d77]
[2021/01/11 18:03:22.938 +00:00] [INFO] [util.go:46] [PD] [git-branch=heads/refs/tags/v4.0.7]
[2021/01/11 18:03:22.938 +00:00] [INFO] [util.go:47] [PD] [utc-build-time="2020-09-29 06:52:41"]
[2021/01/11 18:03:22.938 +00:00] [INFO] [metricutil.go:81] ["disable Prometheus push client"]
[2021/01/11 18:03:22.938 +00:00] [FATAL] [main.go:94] ["join meet error"] [error="join self is forbidden"] [stack="github.com/pingcap/log.Fatal\
\t/home/jenkins/agent/workspace/build_pd_multi_branch_v4.0.7/go/pkg/mod/github.com/pingcap/log@v0.0.0-20200511115504-543df19646ad/global.go:59\
main.main\
\t/home/jenkins/agent/workspace/build_pd_multi_branch_v4.0.7/go/src/github.com/pingcap/pd/cmd/pd-server/main.go:94\
runtime.main\
\t/usr/local/go/src/runtime/proc.go:203"]

yilong · 2021 年1 月 11 日 06:18

请问 pd 当前的 replicas 是多少？
麻烦查看下 tidb-admin 中 controller 和 scheduler 日志是否有报错，多谢。
pd-1 目前是 running 状态？是不是还在重新添加？状态在不断变化，最终是失败状态？
再和您确认下，扩容是增大pd.spec.relicas ，缩容减少这个值，其他的没有改变是吧？

feitianyu · 2021 年1 月 11 日 07:01

1.pd的replicas是3
2.有报错日志，比较多，我截取最后一部分吧

kubectl logs -ntidb-admin tidb-controller-manager-6fc84b4c-rz8lp

E0111 18:53:04.523910       1 tidb_cluster_controller.go:123] TidbCluster: jinfan/tidb-cluster-1608600848, sync failed TidbCluster: jinfan/tidb-cluster-1608600848's pd 1/2 is ready, can't scale out now, requeuing
I0111 18:54:53.898897       1 pd_scaler.go:55] scaling out pd statefulset jinfan/tidb-cluster-1608600848-pd, ordinal: 2 (replicas: 3, delete slots: [])
I0111 18:54:53.898983       1 scaler.go:74] Cluster jinfan/tidb-cluster-1608600848 list pvc not found, selector: app.kubernetes.io/instance=tidb-cluster-1608600848,app.kubernetes.io/managed-by=tidb-operator,app.kubernetes.io/name=tidb-cluster,tidb.pingcap.com/pod-name=tidb-cluster-1608600848-pd-2
I0111 18:54:53.917769       1 tidbcluster_control.go:66] TidbCluster: [jinfan/tidb-cluster-1608600848] updated successfully
E0111 18:54:53.917807       1 tidb_cluster_controller.go:123] TidbCluster: jinfan/tidb-cluster-1608600848, sync failed TidbCluster: jinfan/tidb-cluster-1608600848's pd 1/2 is ready, can't scale out now, requeuing
I0111 18:54:53.982001       1 pd_scaler.go:55] scaling out pd statefulset jinfan/tidb-cluster-1608600848-pd, ordinal: 2 (replicas: 3, delete slots: [])
I0111 18:54:53.982093       1 scaler.go:74] Cluster jinfan/tidb-cluster-1608600848 list pvc not found, selector: app.kubernetes.io/instance=tidb-cluster-1608600848,app.kubernetes.io/managed-by=tidb-operator,app.kubernetes.io/name=tidb-cluster,tidb.pingcap.com/pod-name=tidb-cluster-1608600848-pd-2
I0111 18:54:54.004512       1 tidbcluster_control.go:66] TidbCluster: [jinfan/tidb-cluster-1608600848] updated successfully
E0111 18:54:54.004553       1 tidb_cluster_controller.go:123] TidbCluster: jinfan/tidb-cluster-1608600848, sync failed TidbCluster: jinfan/tidb-cluster-1608600848's pd 1/2 is ready, can't scale out now, requeuing

kubectl logs -ntidb-admin tidb-scheduler-594c78f746-wsx59 tidb-scheduler

I0111 17:41:18.754090       1 ha.go:357] update PVC: [jinfan/pd-tidb-cluster-1608600848-pd-0] successfully, TidbCluster: tidb-cluster-1608600848
I0111 17:41:18.754121       1 ha.go:324] ha: delete pvc jinfan/pd-tidb-cluster-1608600848-pd-0 annotation tidb.pingcap.com/pod-scheduling successfully
I0111 17:41:18.757541       1 ha.go:357] update PVC: [jinfan/pd-tidb-cluster-1608600848-pd-1] successfully, TidbCluster: tidb-cluster-1608600848
I0111 17:41:18.757567       1 ha.go:401] ha: set pvc jinfan/pd-tidb-cluster-1608600848-pd-1 annotation tidb.pingcap.com/pod-scheduling to 2021-01-11T17:41:18Z successfully
I0111 17:41:18.761652       1 scheduler.go:131] leaving predicate: HAScheduling, nodes: [k8s-master01]
I0111 17:43:26.694870       1 scheduler.go:126] scheduling pod: jinfan/tidb-cluster-1608600848-pd-0
I0111 17:43:26.694912       1 scheduler.go:129] entering predicate: HAScheduling, nodes: [k8s-node01]
I0111 17:43:26.712275       1 ha.go:357] update PVC: [jinfan/pd-tidb-cluster-1608600848-pd-0] successfully, TidbCluster: tidb-cluster-1608600848
I0111 17:43:26.712301       1 ha.go:401] ha: set pvc jinfan/pd-tidb-cluster-1608600848-pd-0 annotation tidb.pingcap.com/pod-scheduling to 2021-01-11T17:43:26Z successfully
I0111 17:43:26.714641       1 scheduler.go:131] leaving predicate: HAScheduling, nodes: [k8s-node01]
I0111 17:59:47.470556       1 scheduler.go:126] scheduling pod: jinfan/tidb-cluster-1608600848-pd-1
I0111 17:59:47.470593       1 scheduler.go:129] entering predicate: HAScheduling, nodes: [k8s-master01 k8s-node02]
I0111 17:59:47.487878       1 ha.go:357] update PVC: [jinfan/pd-tidb-cluster-1608600848-pd-1] successfully, TidbCluster: tidb-cluster-1608600848
I0111 17:59:47.487905       1 ha.go:401] ha: set pvc jinfan/pd-tidb-cluster-1608600848-pd-1 annotation tidb.pingcap.com/pod-scheduling to 2021-01-11T17:59:47Z successfully
I0111 17:59:47.500117       1 ha.go:123] ha: tidbcluster jinfan/tidb-cluster-1608600848 component pd replicas 3
I0111 17:59:47.500141       1 ha.go:131] current topology key: kubernetes.io/hostname
I0111 17:59:47.502830       1 ha.go:169] pod tidb-cluster-1608600848-pd-0 is not bind
I0111 17:59:47.502851       1 ha.go:169] pod tidb-cluster-1608600848-pd-1 is not bind
I0111 17:59:47.502872       1 scheduler.go:131] leaving predicate: HAScheduling, nodes: [k8s-master01 k8s-node02]

kubectl logs -ntidb-admin tidb-scheduler-594c78f746-wsx59 kube-scheduler

I0111 17:01:47.253897       1 factory.go:445] "Unable to schedule pod; no fit; waiting" pod="jinfan/tidb-cluster-1608600848-pd-1" err="0/3 nodes are available: 3 pod has unbound immediate PersistentVolumeClaims."
I0111 17:01:47.264478       1 factory.go:445] "Unable to schedule pod; no fit; waiting" pod="jinfan/tidb-cluster-1608600848-pd-1" err="0/3 nodes are available: 3 pod has unbound immediate PersistentVolumeClaims."
I0111 17:01:50.076275       1 scheduler.go:597] "Successfully bound pod to node" pod="jinfan/tidb-cluster-1608600848-pd-1" node="k8s-master01" evaluatedNodes=3 feasibleNodes=1
I0111 17:38:36.720647       1 scheduler.go:597] "Successfully bound pod to node" pod="jinfan/tidb-cluster-1608600848-pd-0" node="k8s-node01" evaluatedNodes=3 feasibleNodes=1
I0111 17:41:18.766745       1 scheduler.go:597] "Successfully bound pod to node" pod="jinfan/tidb-cluster-1608600848-pd-1" node="k8s-master01" evaluatedNodes=3 feasibleNodes=1
I0111 17:43:26.719619       1 scheduler.go:597] "Successfully bound pod to node" pod="jinfan/tidb-cluster-1608600848-pd-0" node="k8s-node01" evaluatedNodes=3 feasibleNodes=1
I0111 17:57:22.566480       1 factory.go:445] "Unable to schedule pod; no fit; waiting" pod="jinfan/tidb-cluster-1608600848-pd-1" err="0/3 nodes are available: 3 pod has unbound immediate PersistentVolumeClaims."
I0111 17:57:22.644277       1 factory.go:445] "Unable to schedule pod; no fit; waiting" pod="jinfan/tidb-cluster-1608600848-pd-1" err="0/3 nodes are available: 3 pod has unbound immediate PersistentVolumeClaims."
I0111 17:58:31.174920       1 factory.go:445] "Unable to schedule pod; no fit; waiting" pod="jinfan/tidb-cluster-1608600848-pd-1" err="0/3 nodes are available: 3 pod has unbound immediate PersistentVolumeClaims."
I0111 17:59:14.236038       1 factory.go:445] "Unable to schedule pod; no fit; waiting" pod="jinfan/tidb-cluster-1608600848-pd-1" err="0/3 nodes are available: 3 pod has unbound immediate PersistentVolumeClaims."
I0111 17:59:37.339404       1 factory.go:445] "Unable to schedule pod; no fit; waiting" pod="jinfan/tidb-cluster-1608600848-pd-1" err="0/3 nodes are available: 3 node(s) didn't find available persistent volumes to bind."
I0111 17:59:37.350504       1 factory.go:445] "Unable to schedule pod; no fit; waiting" pod="jinfan/tidb-cluster-1608600848-pd-1" err="0/3 nodes are available: 3 node(s) didn't find available persistent volumes to bind."
I0111 17:59:47.506483       1 scheduler_binder.go:493] claim "jinfan/pd-tidb-cluster-1608600848-pd-1" bound to volume "local-pv-9e9eb2ef"
I0111 17:59:48.516665       1 scheduler.go:597] "Successfully bound pod to node" pod="jinfan/tidb-cluster-1608600848-pd-1" node="k8s-node02" evaluatedNodes=3 feasibleNodes=2

3.pd-1de running只能维持一段时间后CrashLoopBackOff。

4.一开始是只改了pd.spec.relicas，后来因为有问题，我还尝试了改pd.spec.maxFailoverCount，还有改过pd在statefulsets中的replicas，statefulsets中的replicas是不是不能随意改动。

yilong · 2021 年1 月 12 日 06:08

抱歉，目前看很多都不一致，sts之类的是不能修改的，感觉没有什么好的方法。如果可以的话，试试能否导出数据，重建集群，再导入，多谢。

feitianyu · 2021 年1 月 12 日 06:25

自己弄好了，按照这里的方法重新恢复了pd
https://docs.pingcap.com/zh/tidb-in-kubernetes/stable/pd-recover#获取-cluster-id

yilong · 2021 年1 月 12 日 06:28