tidb-operator被卡,无法修复。

复制链接完成认证,获得“加急”处理问题的权限,方便您更快速地解决问题。

为提高效率,请提供以下信息,问题描述清晰能够更快得到解决:

【概述】:tidb operator一直在想扩节点,但是不能完成。没找到办法停止。

【背景】:服务器掉电之后,自动扩展tikv,但是由于没pv导致无法扩展完成。掉电恢复之后,kv上线,但是扩展的节点一直在pending状态无法修复,于是手动修改statefulSet,减少kv数量,并且手动删除pending的po。

【问题】:

执行以下日志发现operator依然想缩容

kubectl logs tidb-controller-manager-67d596c978-28nhk -n tidb-admin
I0529 23:52:04.042789       1 scaler.go:163] scale statefulset: push-namespace/push-tidb-tikv replicas from 5 to 6
I0529 23:52:05.722662       1 tidbcluster_control.go:66] TidbCluster: [pay-back/pay-bk] updated successfully
I0529 23:52:06.720871       1 tidbcluster_control.go:66] TidbCluster: [push-namespace/push-tidb] updated successfully
I0529 23:52:08.444044       1 tikv_scaler.go:61] scaling out tikv statefulset push-namespace/push-tidb-tikv, ordinal: 5 (replicas: 6, delete slots: [])
I0529 23:52:08.444372       1 scaler.go:163] scale statefulset: push-namespace/push-tidb-tikv replicas from 5 to 6
I0529 23:52:09.519014       1 tidbcluster_control.go:66] TidbCluster: [pay-back/pay-bk] updated successfully
I0529 23:52:10.113590       1 tidbcluster_control.go:66] TidbCluster: [push-namespace/push-tidb] updated successfully
I0529 23:52:11.842568       1 tikv_scaler.go:61] scaling out tikv statefulset push-namespace/push-tidb-tikv, ordinal: 5 (replicas: 6, delete slots: [])
I0529 23:52:11.842706       1 scaler.go:163] scale statefulset: push-namespace/push-tidb-tikv replicas from 5 to 6
I0529 23:52:12.482600       1 tidbcluster_control.go:66] TidbCluster: [pay-back/pay-bk] updated successfully
I0529 23:52:13.528035       1 tidbcluster_control.go:66] TidbCluster: [push-namespace/push-tidb] updated successfully
I0529 23:52:15.242257       1 tikv_scaler.go:61] scaling out tikv statefulset push-namespace/push-tidb-tikv, ordinal: 5 (replicas: 6, delete slots: [])
I0529 23:52:15.242394       1 scaler.go:163] scale statefulset: push-namespace/push-tidb-tikv replicas from 5 to 6
I0529 23:52:15.482759       1 tidbcluster_control.go:66] TidbCluster: [pay-back/pay-bk] updated successfully
I0529 23:52:16.915460       1 tidbcluster_control.go:66] TidbCluster: [push-namespace/push-tidb] updated successfully
I0529 23:52:18.443082       1 tikv_scaler.go:61] scaling out tikv statefulset push-namespace/push-tidb-tikv, ordinal: 5 (replicas: 6, delete slots: [])
I0529 23:52:18.443308       1 scaler.go:163] scale statefulset: push-namespace/push-tidb-tikv replicas from 5 to 6
I0529 23:52:18.684710       1 tidbcluster_control.go:66] TidbCluster: [pay-back/pay-bk] updated successfully
I0529 23:52:20.106858       1 tidbcluster_control.go:66] TidbCluster: [push-namespace/push-tidb] updated successfully
I0529 23:52:21.642483       1 tikv_scaler.go:61] scaling out tikv statefulset push-namespace/push-tidb-tikv, ordinal: 5 (replicas: 6, delete slots: [])
I0529 23:52:21.642580       1 scaler.go:163] scale statefulset: push-namespace/push-tidb-tikv replicas from 5 to 6
I0529 23:52:21.883085       1 tidbcluster_control.go:66] TidbCluster: [pay-back/pay-bk] updated successfully
I0529 23:52:23.312042       1 tidbcluster_control.go:66] TidbCluster: [push-namespace/push-tidb] updated successfully
I0529 23:52:24.885513       1 tidbcluster_control.go:66] TidbCluster: [pay-back/pay-bk] updated successfully
 tikv:
    failureStores:
      "736542":
        createdAt: "2021-05-24T07:52:29Z"
        podName: push-tidb-tikv-4
        storeID: "736542"
    image: harbor.fcbox.com/tidb/pingcap/tikv:v4.0.9
    phase: Scale
    statefulSet:
      collisionCount: 0
      currentReplicas: 5
      currentRevision: push-tidb-tikv-8db8bc99d
      observedGeneration: 22
      readyReplicas: 5
      replicas: 5
      updateRevision: push-tidb-tikv-8db8bc99d
      updatedReplicas: 5
 stores:
      "1":
        id: "1"
        ip: push-tidb-tikv-0.push-tidb-tikv-peer.push-namespace.svc
        lastHeartbeatTime: "2021-06-03T15:48:15Z"
        lastTransitionTime: "2021-01-13T03:23:07Z"
        leaderCount: 3458
        podName: push-tidb-tikv-0
        state: Up
      "1362":
        id: "1362"
        ip: push-tidb-tikv-2.push-tidb-tikv-peer.push-namespace.svc
        lastHeartbeatTime: "2021-06-03T15:48:13Z"
        lastTransitionTime: "2021-01-13T03:19:30Z"
        leaderCount: 3463
        podName: push-tidb-tikv-2
        state: Up
      "1363":
        id: "1363"
        ip: push-tidb-tikv-1.push-tidb-tikv-peer.push-namespace.svc
        lastHeartbeatTime: "2021-06-03T15:48:15Z"
        lastTransitionTime: "2021-01-13T03:21:20Z"
        leaderCount: 3467
        podName: push-tidb-tikv-1
        state: Up
      "724701":
        id: "724701"
        ip: push-tidb-tikv-3.push-tidb-tikv-peer.push-namespace.svc
        lastHeartbeatTime: "2021-06-03T15:48:18Z"
        lastTransitionTime: "2021-04-28T02:51:28Z"
        leaderCount: 3461
        podName: push-tidb-tikv-3
        state: Up
      "736542":
        id: "736542"
        ip: push-tidb-tikv-4.push-tidb-tikv-peer.push-namespace.svc
        lastHeartbeatTime: "2021-06-03T15:48:17Z"
        lastTransitionTime: "2021-05-24T11:52:28Z"
        leaderCount: 3468
        podName: push-tidb-tikv-4
        state: Up
    synced: true

【业务影响】:无法调整配置

【TiDB 版本】:v4.0.9

【TiDB Operator 版本】:v1.1.4

【K8s 版本】:v1.18.8

【附件】:

Name:               push-tidb-tikv
Namespace:          push-namespace
CreationTimestamp:  Thu, 24 Sep 2020 15:15:31 +0800
Selector:           app.kubernetes.io/component=tikv,app.kubernetes.io/instance=push-tidb,app.kubernetes.io/managed-by=tidb-operator,app.kubernetes.io/name=tidb-cluster
Labels:             app.kubernetes.io/component=tikv
                    app.kubernetes.io/instance=push-tidb
                    app.kubernetes.io/managed-by=tidb-operator
                    app.kubernetes.io/name=tidb-cluster
Annotations:        pingcap.com/last-applied-configuration:
                      {"replicas":6,"selector":{"matchLabels":{"app.kubernetes.io/component":"tikv","app.kubernetes.io/instance":"push-tidb","app.kubernetes.io/...
Replicas:           5 desired | 5 total
Update Strategy:    RollingUpdate
  Partition:        6
Pods Status:        5 Running / 0 Waiting / 0 Succeeded / 0 Failed
Pod Template:
  Labels:       app.kubernetes.io/component=tikv
                app.kubernetes.io/instance=push-tidb
                app.kubernetes.io/managed-by=tidb-operator
                app.kubernetes.io/name=tidb-cluster
  Annotations:  prometheus.io/path: /metrics
                prometheus.io/port: 20180
                prometheus.io/scrape: true
  Containers:
   tikv:
    Image:      harbor.fcbox.com/tidb/pingcap/tikv:v4.0.9
    Port:       20160/TCP
    Host Port:  0/TCP
    Command:
      /bin/sh
      /usr/local/bin/tikv_start_script.sh
    Requests:
      cpu:     8
      memory:  45Gi
    Environment:
      NAMESPACE:               (v1:metadata.namespace)
      CLUSTER_NAME:           push-tidb
      HEADLESS_SERVICE_NAME:  push-tidb-tikv-peer
      CAPACITY:               0
      TZ:                     Asia/Shanghai
    Mounts:
      /etc/podinfo from annotations (ro)
      /etc/tikv from config (ro)
      /usr/local/bin from startup-script (ro)
 Containers:
   tikv:
    Image:      harbor.fcbox.com/tidb/pingcap/tikv:v4.0.9
    Port:       20160/TCP
    Host Port:  0/TCP
    Command:
      /bin/sh
      /usr/local/bin/tikv_start_script.sh
    Requests:
      cpu:     8
      memory:  45Gi
    Environment:
      NAMESPACE:               (v1:metadata.namespace)
      CLUSTER_NAME:           push-tidb
      HEADLESS_SERVICE_NAME:  push-tidb-tikv-peer
      CAPACITY:               0
      TZ:                     Asia/Shanghai
    Mounts:
      /etc/podinfo from annotations (ro)
      /etc/tikv from config (ro)
      /usr/local/bin from startup-script (ro)
      /var/lib/tikv from tikv (rw)
  Volumes:
   annotations:
    Type:  DownwardAPI (a volume populated by information about the pod)
    Items:
      metadata.annotations -> annotations
   config:
    Type:      ConfigMap (a volume populated by a ConfigMap)
    Name:      push-tidb-tikv
    Optional:  false
   startup-script:
    Type:      ConfigMap (a volume populated by a ConfigMap)
    Name:      push-tidb-tikv
    Optional:  false
Volume Claims:
  Name:          tikv
  StorageClass:  kv-storage
  Labels:        <none>
  Annotations:   <none>
  Capacity:      50Gi
  Access Modes:  [ReadWriteOnce]
Events:          <none>

在TC中的状态也很奇怪,为啥会判断push-tidb-tikv-4是挂的呢。

 tikv:
    failureStores:
      "736542":
        createdAt: "2021-05-24T07:52:29Z"
        podName: push-tidb-tikv-4
        storeID: "736542"
    image: harbor.fcbox.com/tidb/pingcap/tikv:v4.0.9
    phase: Scale
    statefulSet:
      collisionCount: 0
      currentReplicas: 5
      currentRevision: push-tidb-tikv-8db8bc99d
      observedGeneration: 22
      readyReplicas: 5
      replicas: 5
      updateRevision: push-tidb-tikv-8db8bc99d
      updatedReplicas: 5
    stores:
      "1":
        id: "1"
        ip: push-tidb-tikv-0.push-tidb-tikv-peer.push-namespace.svc
        lastHeartbeatTime: "2021-06-04T03:27:41Z"
        lastTransitionTime: "2021-01-13T03:23:07Z"
        leaderCount: 3571
        podName: push-tidb-tikv-0
        state: Up
      "1362":
        id: "1362"
        ip: push-tidb-tikv-2.push-tidb-tikv-peer.push-namespace.svc
        lastHeartbeatTime: "2021-06-04T03:27:38Z"
        lastTransitionTime: "2021-01-13T03:19:30Z"
        leaderCount: 3571
        podName: push-tidb-tikv-2
        state: Up
      "1363":
        id: "1363"
        ip: push-tidb-tikv-1.push-tidb-tikv-peer.push-namespace.svc
        lastHeartbeatTime: "2021-06-04T03:27:40Z"
        lastTransitionTime: "2021-01-13T03:21:20Z"
        leaderCount: 3570
        podName: push-tidb-tikv-1
        state: Up
      "724701":
        id: "724701"
        ip: push-tidb-tikv-3.push-tidb-tikv-peer.push-namespace.svc
        lastHeartbeatTime: "2021-06-04T03:27:43Z"
        lastTransitionTime: "2021-04-28T02:51:28Z"
        leaderCount: 3561
        podName: push-tidb-tikv-3
        state: Up
      "736542":
        id: "736542"
        ip: push-tidb-tikv-4.push-tidb-tikv-peer.push-namespace.svc
        lastHeartbeatTime: "2021-06-04T03:27:43Z"
        lastTransitionTime: "2021-05-24T11:52:28Z"
        leaderCount: 3569
        podName: push-tidb-tikv-4
        state: Up

sts 改回原本的配置,增加参数 spec.tikv.recoverFailover: true 试试

https://docs.pingcap.com/zh/tidb-in-kubernetes/stable/use-auto-failover#tikv-故障转移策略

failureStores:
  "736542":
    createdAt: "2021-05-24T07:52:29Z"
    podName: push-tidb-tikv-4
    storeID: "736542"

这里检查到status中的状态有问题导致,该kv已经在系统中启动了。

已经正常了吗?

没有,目前还在找方法。statsfulset能自己调么?比如删除这个会有啥问题。

  1. k8s 感觉一般不会去动sts呀,你改这个目的是什么? 当时就算不正常,有足够的资源有没有补充tikv pod。 如果现在是 4 个 tikv pod,那么就参考上面的文档,看看failure能不能自动缩容异常节点。如果不行,再考虑修改 sts 等其他方式。
  2. 反馈下当前的信息
    kubectl get all -n -o wide
  3. 错误节点的日志反馈下吧。

目前查到是operator卡死了,无法进行集群维护,其他地方看下哪里还要log,我拉下。
目前是是要求5个,实际上也有5个,但是failureStores的有一个736542,但是这个又在状态里出现是state: Up

[root@dcn-tidb-k8s-p-l-11:/home/appdeploy]#kubectl get all -n push-namespace  -o wide
NAME                                             READY   STATUS    RESTARTS   AGE    IP             NODE           NOMINATED NODE   READINESS GATES
pod/push-monitor-meta-monitor-674c9d68f9-vhkcl   3/3     Running   0          13d    172.32.91.7    10.204.11.91   <none>           <none>
pod/push-tidb-discovery-657cc7b98-5rs9j          1/1     Running   0          255d   172.32.90.9    10.204.11.90   <none>           <none>
pod/push-tidb-pd-0                               1/1     Running   3          144d   172.32.91.11   10.204.11.91   <none>           <none>
pod/push-tidb-pd-1                               1/1     Running   0          144d   172.32.92.14   10.204.11.92   <none>           <none>
pod/push-tidb-pd-2                               1/1     Running   0          144d   172.32.90.13   10.204.11.90   <none>           <none>
pod/push-tidb-pump-0                             1/1     Running   0          144d   172.32.92.13   10.204.11.92   <none>           <none>
pod/push-tidb-pump-1                             1/1     Running   0          144d   172.32.91.14   10.204.11.91   <none>           <none>
pod/push-tidb-pump-2                             1/1     Running   0          144d   172.32.90.14   10.204.11.90   <none>           <none>
pod/push-tidb-tidb-0                             2/2     Running   0          144d   172.32.89.6    10.204.11.89   <none>           <none>
pod/push-tidb-tidb-1                             2/2     Running   0          144d   172.32.91.15   10.204.11.91   <none>           <none>
pod/push-tidb-tidb-2                             2/2     Running   0          13d    172.32.85.6    10.204.11.85   <none>           <none>
pod/push-tidb-tikv-0                             1/1     Running   0          144d   172.32.92.12   10.204.11.92   <none>           <none>
pod/push-tidb-tikv-1                             1/1     Running   0          144d   172.32.90.12   10.204.11.90   <none>           <none>
pod/push-tidb-tikv-2                             1/1     Running   0          144d   172.32.91.12   10.204.11.91   <none>           <none>
pod/push-tidb-tikv-3                             1/1     Running   0          39d    172.32.89.8    10.204.11.89   <none>           <none>
pod/push-tidb-tikv-4                             1/1     Running   0          13d    172.32.85.7    10.204.11.85   <none>           <none>

NAME                                         TYPE        CLUSTER-IP        EXTERNAL-IP   PORT(S)               AGE    SELECTOR
service/push-monitor-meta-grafana            ClusterIP   172.202.44.73     <none>        3000/TCP              255d   app.kubernetes.io/component=monitor,app.kubernetes.io/instance=push-monitor-meta,app.kubernetes.io/name=tidb-cluster
service/push-monitor-meta-monitor-reloader   ClusterIP   172.202.83.95     <none>        9089/TCP              255d   app.kubernetes.io/component=monitor,app.kubernetes.io/instance=push-monitor-meta,app.kubernetes.io/name=tidb-cluster
service/push-monitor-meta-prometheus         NodePort    172.202.129.174   <none>        9090:38187/TCP        255d   app.kubernetes.io/component=monitor,app.kubernetes.io/instance=push-monitor-meta,app.kubernetes.io/name=tidb-cluster
service/push-tidb-discovery                  ClusterIP   172.202.250.216   <none>        10261/TCP,10262/TCP   255d   app.kubernetes.io/component=discovery,app.kubernetes.io/instance=push-tidb,app.kubernetes.io/managed-by=tidb-operator,app.kubernetes.io/name=tidb-cluster
service/push-tidb-pd                         ClusterIP   172.202.122.153   <none>        2379/TCP              255d   app.kubernetes.io/component=pd,app.kubernetes.io/instance=push-tidb,app.kubernetes.io/managed-by=tidb-operator,app.kubernetes.io/name=tidb-cluster
service/push-tidb-pd-peer                    ClusterIP   None              <none>        2380/TCP              255d   app.kubernetes.io/component=pd,app.kubernetes.io/instance=push-tidb,app.kubernetes.io/managed-by=tidb-operator,app.kubernetes.io/name=tidb-cluster
service/push-tidb-pump                       ClusterIP   None              <none>        8250/TCP              255d   app.kubernetes.io/component=pump,app.kubernetes.io/instance=push-tidb,app.kubernetes.io/managed-by=tidb-operator,app.kubernetes.io/name=tidb-cluster
service/push-tidb-tidb                       ClusterIP   172.202.111.125   <none>        4000/TCP,10080/TCP    255d   app.kubernetes.io/component=tidb,app.kubernetes.io/instance=push-tidb,app.kubernetes.io/managed-by=tidb-operator,app.kubernetes.io/name=tidb-cluster
service/push-tidb-tidb-peer                  ClusterIP   None              <none>        10080/TCP             255d   app.kubernetes.io/component=tidb,app.kubernetes.io/instance=push-tidb,app.kubernetes.io/managed-by=tidb-operator,app.kubernetes.io/name=tidb-cluster
service/push-tidb-tikv-peer                  ClusterIP   None              <none>        20160/TCP             255d   app.kubernetes.io/component=tikv,app.kubernetes.io/instance=push-tidb,app.kubernetes.io/managed-by=tidb-operator,app.kubernetes.io/name=tidb-cluster

NAME                                        READY   UP-TO-DATE   AVAILABLE   AGE    CONTAINERS                    IMAGES                                                                                                                                                 SELECTOR
deployment.apps/push-monitor-meta-monitor   1/1     1            1           255d   prometheus,reloader,grafana   harbor.fcbox.com/tidb/prom/prometheus:v2.18.1,harbor.fcbox.com/tidb/pingcap/tidb-monitor-reloader:v1.0.1,harbor.fcbox.com/tidb/grafana/grafana:6.0.1   app.kubernetes.io/component=monitor,app.kubernetes.io/instance=push-monitor-meta,app.kubernetes.io/managed-by=tidb-operator,app.kubernetes.io/name=tidb-cluster
deployment.apps/push-tidb-discovery         1/1     1            1           255d   discovery                     harbor.fcbox.com/tidb/pingcap/tidb-operator:v1.1.4                                                                                                     app.kubernetes.io/component=discovery,app.kubernetes.io/instance=push-tidb,app.kubernetes.io/managed-by=tidb-operator,app.kubernetes.io/name=tidb-cluster

NAME                                                   DESIRED   CURRENT   READY   AGE    CONTAINERS                    IMAGES                                                                                                                                                 SELECTOR
replicaset.apps/push-monitor-meta-monitor-674c9d68f9   1         1         1       145d   prometheus,reloader,grafana   harbor.fcbox.com/tidb/prom/prometheus:v2.18.1,harbor.fcbox.com/tidb/pingcap/tidb-monitor-reloader:v1.0.1,harbor.fcbox.com/tidb/grafana/grafana:6.0.1   app.kubernetes.io/component=monitor,app.kubernetes.io/instance=push-monitor-meta,app.kubernetes.io/managed-by=tidb-operator,app.kubernetes.io/name=tidb-cluster,pod-template-hash=674c9d68f9
replicaset.apps/push-monitor-meta-monitor-75695b6d55   0         0         0       159d   prometheus,reloader,grafana   harbor.fcbox.com/tidb/prom/prometheus:v2.18.1,harbor.fcbox.com/tidb/pingcap/tidb-monitor-reloader:v1.0.1,harbor.fcbox.com/tidb/grafana/grafana:6.0.1   app.kubernetes.io/component=monitor,app.kubernetes.io/instance=push-monitor-meta,app.kubernetes.io/managed-by=tidb-operator,app.kubernetes.io/name=tidb-cluster,pod-template-hash=75695b6d55
replicaset.apps/push-monitor-meta-monitor-7fd87c45fd   0         0         0       255d   prometheus,reloader,grafana   harbor.fcbox.com/tidb/prom/prometheus:v2.18.1,harbor.fcbox.com/tidb/pingcap/tidb-monitor-reloader:v1.0.1,harbor.fcbox.com/tidb/grafana/grafana:6.0.1   app.kubernetes.io/component=monitor,app.kubernetes.io/instance=push-monitor-meta,app.kubernetes.io/managed-by=tidb-operator,app.kubernetes.io/name=tidb-cluster,pod-template-hash=7fd87c45fd
replicaset.apps/push-monitor-meta-monitor-d99b5dbc8    0         0         0       255d   prometheus,reloader,grafana   harbor.fcbox.com/tidb/prom/prometheus:v2.18.1,harbor.fcbox.com/tidb/pingcap/tidb-monitor-reloader:v1.0.1,harbor.fcbox.com/tidb/grafana/grafana:6.0.1   app.kubernetes.io/component=monitor,app.kubernetes.io/instance=push-monitor-meta,app.kubernetes.io/managed-by=tidb-operator,app.kubernetes.io/name=tidb-cluster,pod-template-hash=d99b5dbc8
replicaset.apps/push-tidb-discovery-599d8d5f9d         0         0         0       255d   discovery                     harbor.fcbox.com/tidb/pingcap/tidb-operator:v1.1.4                                                                                                     app.kubernetes.io/component=discovery,app.kubernetes.io/instance=push-tidb,app.kubernetes.io/managed-by=tidb-operator,app.kubernetes.io/name=tidb-cluster,pod-template-hash=599d8d5f9d
replicaset.apps/push-tidb-discovery-657cc7b98          1         1         1       255d   discovery                     harbor.fcbox.com/tidb/pingcap/tidb-operator:v1.1.4                                                                                                     app.kubernetes.io/component=discovery,app.kubernetes.io/instance=push-tidb,app.kubernetes.io/managed-by=tidb-operator,app.kubernetes.io/name=tidb-cluster,pod-template-hash=657cc7b98

NAME                              READY   AGE    CONTAINERS     IMAGES
statefulset.apps/push-tidb-pd     3/3     255d   pd             harbor.fcbox.com/tidb/pingcap/pd:v4.0.9
statefulset.apps/push-tidb-pump   3/3     255d   pump           harbor.fcbox.com/tidb/pingcap/tidb-binlog:v4.0.9
statefulset.apps/push-tidb-tidb   3/3     255d   slowlog,tidb   harbor.fcbox.com/tidb/busybox:1.26.2,harbor.fcbox.com/tidb/pingcap/tidb:v4.0.9
statefulset.apps/push-tidb-tikv   5/5     255d   tikv           harbor.fcbox.com/tidb/pingcap/tikv:v4.0.9

您还记得 sts 修改的哪些参数吗?

目前没动过。都是operator自己修改的。

请问之前是如何修改的?

那是另外一套TiDB的。调整了下tikv.statefulSet.replicas这个值。然后手动删除了pending状态的tikv

这个扩容的是自动故障转移产生的一个 Pod, 集群预期 Pod 数量 = 设置的副本数量 + FailurePod 的数量.

您可以 kubectl edit tc 集群对象, 修改 tc.Spec.TiKV.maxFailoverCount = 0, 然后 kubectl edit tc push-tidb 移除 tc.Status.TiKV.FailureStore 信息

关闭这个功能可以参考这里 https://docs.pingcap.com/zh/tidb-in-kubernetes/stable/use-auto-failover#关闭故障自动转移
更新 tidb-operator 的配置, 在参数配置里关掉 failover 功能.

好的,谢谢