PDMemberUnhealthy pd 故障转移后恢复

为提高效率,提问时请提供以下信息,问题描述清晰可优先响应。

信息如下:

 Pd:
    Failure Members:
      test02-shanghai-pd-2:
        Created At:      2020-10-25T06:11:03Z
        Member Deleted:  true
        Member ID:       468945150637516394
        Pod Name:        test02-shanghai-pd-2
        Pvc UID:         481215ba-591e-462f-bc9b-42e4e1cb50c3
    Image:               repo-test/pingcap/pd:latest
    Leader:
      Client URL:            http://test02-shanghai-pd-1.test02-shanghai-pd-peer.qa-ybsp.svc:2379
      Health:                true
      Id:                    60032817622559773
      Last Transition Time:  2020-10-24T07:52:33Z
      Name:                  test02-shanghai-pd-1
    Members:
      test02-shanghai-pd-0:
        Client URL:            http://test02-shanghai-pd-0.test02-shanghai-pd-peer.qa-ybsp.svc:2379
        Health:                true
        Id:                    12675796510240523469
        Last Transition Time:  2020-10-25T13:59:29Z
        Name:                  test02-shanghai-pd-0
      test02-shanghai-pd-1:
        Client URL:            http://test02-shanghai-pd-1.test02-shanghai-pd-peer.qa-ybsp.svc:2379
        Health:                true
        Id:                    60032817622559773
        Last Transition Time:  2020-10-24T07:52:33Z
        Name:                  test02-shanghai-pd-1
      test02-shanghai-pd-2:
        Client URL:            http://test02-shanghai-pd-2.test02-shanghai-pd-peer.qa-ybsp.svc:2379
        Health:                true
        Id:                    17892037682370165410
        Last Transition Time:  2020-10-27T10:38:29Z
        Name:                  test02-shanghai-pd-2
      test02-shanghai-pd-3:
        Client URL:            http://test02-shanghai-pd-3.test02-shanghai-pd-peer.qa-ybsp.svc:2379
        Health:                true
        Id:                    8705712002070658541
        Last Transition Time:  2020-10-27T10:19:43Z
        Name:                  test02-shanghai-pd-3
    Phase:                     Normal
    Stateful Set:
      Collision Count:      0
      Current Replicas:     4
      Current Revision:     test02-shanghai-pd-65487fdf4f
      Observed Generation:  7
      Ready Replicas:       3
      Replicas:             4
      Update Revision:      test02-shanghai-pd-65487fdf4f
      Updated Replicas:     4
    Synced:                 true

若提问为性能优化、故障排查类问题,请下载脚本运行。终端输出的打印结果,请务必全选并复制粘贴上传。

请问下 values.yaml 配置里面 pdFailoverPeriod 配置多久?

默认值5分钟

已经持续一天了, 查看pd 和tc的status 。health 都是true

有个异常情况是 pd status 显示3个,但实际是有4个pd

NAME                 READY   PD                                             STORAGE   READY   DESIRE   TIKV                                             STORAGE   READY   DESIRE   TIDB                                             READY   DESIRE   AGE
test   True   test/pingcap/pd:latest   20Gi      3       3        tests/pingcap/tikv:latest   20Gi      3       3        test/pingcap/tidb:latest   2       2        49d


NAME                                            READY   STATUS    RESTARTS   AGE
test-discovery-547cd69975-45n59   1/1     Running   0          21h
test-pd-0                         1/1     Running   0          32d
test-pd-1                         1/1     Running   0          32d
test-pd-2                         1/1     Running   0          20h
test-pd-3                         1/1     Running   1          3d1h
  1. 可以帮忙拿一下 controller manager 的日志吗?
  2. 如果要恢复可以 kubectl edit tc ${name} 把 status 里的 PD 的 Failure Members 项删掉

control 部分log

I1028 08:22:48.937660       1 pd_scaler.go:62] scaling out pd statefulset qa-ybsp/qa-ynn-cn-shanghai-pd, ordinal: 3 (replicas: 4, delete slots: [])
I1028 08:22:48.937929       1 scaler.go:163] scale statefulset: qa-ybsp/qa-ynn-cn-shanghai-pd replicas from 3 to 4
I1028 08:22:54.931400       1 pd_scaler.go:62] scaling out pd statefulset qa-ybsp/qa-ynn-cn-shanghai-pd, ordinal: 3 (replicas: 4, delete slots: [])
I1028 08:22:54.931551       1 scaler.go:163] scale statefulset: qa-ybsp/qa-ynn-cn-shanghai-pd replicas from 3 to 4
I1028 08:22:59.731917       1 pd_scaler.go:62] scaling out pd statefulset qa-ybsp/qa-ynn-cn-shanghai-pd, ordinal: 3 (replicas: 4, delete slots: [])
I1028 08:22:59.732104       1 scaler.go:163] scale statefulset: qa-ybsp/qa-ynn-cn-shanghai-pd replicas from 3 to 4
I1028 08:23:04.534832       1 pd_scaler.go:62] scaling out pd statefulset qa-ybsp/qa-ynn-cn-shanghai-pd, ordinal: 3 (replicas: 4, delete slots: [])
I1028 08:23:04.534972       1 scaler.go:163] scale statefulset: qa-ybsp/qa-ynn-cn-shanghai-pd replicas from 3 to 4
I1028 08:23:09.331312       1 pd_scaler.go:62] scaling out pd statefulset qa-ybsp/qa-ynn-cn-shanghai-pd, ordinal: 3 (replicas: 4, delete slots: [])
I1028 08:23:09.331454       1 scaler.go:163] scale statefulset: qa-ybsp/qa-ynn-cn-shanghai-pd replicas from 3 to 4
I1028 08:23:12.931674       1 pd_scaler.go:62] scaling out pd statefulset qa-ybsp/qa-ynn-cn-shanghai-pd, ordinal: 3 (replicas: 4, delete slots: [])
I1028 08:23:12.931803       1 scaler.go:163] scale statefulset: qa-ybsp/qa-ynn-cn-shanghai-pd replicas from 3 to 4
I1028 08:23:16.532000       1 pd_scaler.go:62] scaling out pd statefulset qa-ybsp/qa-ynn-cn-shanghai-pd, ordinal: 3 (replicas: 4, delete slots: [])
I1028 08:23:16.532129       1 scaler.go:163] scale statefulset: qa-ybsp/qa-ynn-cn-shanghai-pd replicas from 3 to 4
I1028 08:23:22.932600       1 pd_scaler.go:62] scaling out pd statefulset qa-ybsp/qa-ynn-cn-shanghai-pd, ordinal: 3 (replicas: 4, delete slots: [])
I1028 08:23:22.932736       1 scaler.go:163] scale statefulset: qa-ybsp/qa-ynn-cn-shanghai-pd replicas from 3 to 4
I1028 08:23:27.731816       1 pd_scaler.go:62] scaling out pd statefulset qa-ybsp/qa-ynn-cn-shanghai-pd, ordinal: 3 (replicas: 4, delete slots: [])
I1028 08:23:27.731979       1 scaler.go:163] scale statefulset: qa-ybsp/qa-ynn-cn-shanghai-pd replicas from 3 to 4
I1028 08:23:32.341867       1 pd_scaler.go:62] scaling out pd statefulset qa-ybsp/qa-ynn-cn-shanghai-pd, ordinal: 3 (replicas: 4, delete slots: [])
I1028 08:23:32.342032       1 scaler.go:163] scale statefulset: qa-ybsp/qa-ynn-cn-shanghai-pd replicas from 3 to 4
I1028 08:23:36.131792       1 event.go:255] Event(v1.ObjectReference{Kind:"TidbCluster", Namespace:"qa-ybsp", Name:"test", UID:"5e57a74c-464b-42da-8bad-e99bd8b72422", APIVersion:"pingcap.com/v1alpha1", ResourceVersion:"188897334", FieldPath:""}): type: 'Warning' reason: 'PDMemberUnhealthy' test-pd-3(8705712002070658541) is unhealthy
I1028 08:23:39.933093       1 event.go:255] Event(v1.ObjectReference{Kind:"TidbCluster", Namespace:"qa-ybsp", Name:"test", UID:"5e57a74c-464b-42da-8bad-e99bd8b72422", APIVersion:"pingcap.com/v1alpha1", ResourceVersion:"188897410", FieldPath:""}): type: 'Warning' reason: 'PDMemberUnhealthy' test-pd-3(8705712002070658541) is unhealthy
I1028 08:23:43.532874       1 event.go:255] Event(v1.ObjectReference{Kind:"TidbCluster", Namespace:"qa-ybsp", Name:"test", UID:"5e57a74c-464b-42da-8bad-e99bd8b72422", APIVersion:"pingcap.com/v1alpha1", ResourceVersion:"188897449", FieldPath:""}): type: 'Warning' reason: 'PDMemberUnhealthy' test-pd-3(8705712002070658541) is unhealthy
I1028 08:23:46.937818       1 event.go:255] Event(v1.ObjectReference{Kind:"TidbCluster", Namespace:"qa-ybsp", Name:"test", UID:"5e57a74c-464b-42da-8bad-e99bd8b72422", APIVersion:"pingcap.com/v1alpha1", ResourceVersion:"188897449", FieldPath:""}): type: 'Warning' reason: 'PDMemberUnhealthy' test-pd-3(8705712002070658541) is unhealthy
I1028 08:23:54.537144       1 event.go:255] Event(v1.ObjectReference{Kind:"TidbCluster", Namespace:"qa-ybsp", Name:"test", UID:"5e57a74c-464b-42da-8bad-e99bd8b72422", APIVersion:"pingcap.com/v1alpha1", ResourceVersion:"188897553", FieldPath:""}): type: 'Warning' reason: 'PDMemberUnhealthy' test-pd-3(8705712002070658541) is unhealthy
I1028 08:23:59.332603       1 event.go:255] Event(v1.ObjectReference{Kind:"TidbCluster", Namespace:"qa-ybsp", Name:"test", UID:"5e57a74c-464b-42da-8bad-e99bd8b72422", APIVersion:"pingcap.com/v1alpha1", ResourceVersion:"188897597", FieldPath:""}): type: 'Warning' reason: 'PDMemberUnhealthy' test-pd-3(8705712002070658541) is unhealthy
I1028 08:24:03.937963       1 event.go:255] Event(v1.ObjectReference{Kind:"TidbCluster", Namespace:"qa-ybsp", Name:"test", UID:"5e57a74c-464b-42da-8bad-e99bd8b72422", APIVersion:"pingcap.com/v1alpha1", ResourceVersion:"188897649", FieldPath:""}): type: 'Warning' reason: 'PDMemberUnhealthy' test-pd-3(8705712002070658541) is unhealthy
I1028 08:24:08.732663       1 event.go:255] Event(v1.ObjectReference{Kind:"TidbCluster", Namespace:"qa-ybsp", Name:"test", UID:"5e57a74c-464b-42da-8bad-e99bd8b72422", APIVersion:"pingcap.com/v1alpha1", ResourceVersion:"188897696", FieldPath:""}): type: 'Warning' reason: 'PDMemberUnhealthy' test-pd-3(8705712002070658541) is unhealthy
I1028 08:24:12.534828       1 event.go:255] Event(v1.ObjectReference{Kind:"TidbCluster", Namespace:"qa-ybsp", Name:"test", UID:"5e57a74c-464b-42da-8bad-e99bd8b72422", APIVersion:"pingcap.com/v1alpha1", ResourceVersion:"188897733", FieldPath:""}): type: 'Warning' reason: 'PDMemberUnhealthy' test-pd-3(8705712002070658541) is unhealthy
  1. 如果要恢复可以 kubectl edit tc ${name} 把 status 里的 PD 的 Failure Members 项删掉
    –re 编辑完成后会重新变回去

还有更多日志吗?这个里面的集群名字看起来和标题里提到的不一样?

名称我替换了,部分没替换完全

查看对应pd的statefulset的ready ,它是3/4 ,但pd的4个pod的ready是1/1 .
我猜测是不是因为这样原因导致的呢? 但目前没找到为何sts 的ready 为何3/4了。

NAME                    READY   AGE
qa-ynn-cn-shanghai-pd   3/4     49d

qa-ynn-cn-shanghai-pd-0                         1/1     Running            0          32d
qa-ynn-cn-shanghai-pd-1                         1/1     Running            0          32d
qa-ynn-cn-shanghai-pd-2                         1/1     Running            0          23h
qa-ynn-cn-shanghai-pd-3                         1/1     Running            0          111m

将pd 的pod 都删除, sts ready 恢复正常后,pd 能自动缩了

现在已经完全恢复了吗?kubectl describe sts qa-ynn-cn-shanghai-pd 是什么结果呢

现在恢复了。

describe sts 是正常的(这个在ready数量异常的情况下,pod正常,它也是正常,无法看出是什么问题导致的)

这种情况可能需要看看 k8s 自己的 controller 看看发生了什么,为什么认定 sts 没有 ready 4 个 pod

此话题已在最后回复的 1 分钟后被自动关闭。不再允许新回复。