tikv节点store一直处于offline状态，无法下线

panqiao · 2022 年12 月 12 日 03:20

【 TiDB 使用环境】生产环境
【 TiDB 版本】v6.1.0
【复现路径】做过哪些操作出现的问题
由于我要删除tikv三个节点的pvc，所以我不得已把tikv缩容到0，然后再次扩容到3，就出现了offline的问题
【遇到的问题：问题现象及影响】
【资源配置】
【附件：截图/日志/监控】
我目前是五个tikv pod

我想缩容到3个，但是我一缩容，他就把3和4缩掉了，保留了坏掉的1和2
1和2的日志为：

store 的信息为（坏掉的两个都是down）

我执行可下面这条命令，让他上线
curl -X POST http://127.0.0.1:2379/pd/api/v1/store/${store_id}/state?state=Up
状态就会变成offline，但是不论过多久，他的region就是不转移
我用pd-ctl store delete id 也是一直处于offline状态
我不知道该咋办了，我想让他正常恢复三个tikv节点，该怎么做呀，求助各位大佬

WalterWj · 2022 年12 月 12 日 03:29

缩容的时候需要考虑副本和 tikv 实例数，如果缩容后节点数比副本数少是缩容不掉的。原因是缩容需要将副本补在其他节点，一个节点不能有多副本数据。无法违反这个原则
k8s 上缩容，k8s 的策略就是从大 pod 开始下，需要修改策略；看下这个：https://docs.pingcap.com/zh/tidb-in-kubernetes/stable/advanced-statefulset#通过-kubectl-查看-advancedstatefulset-对象

panqiao · 2022 年12 月 12 日 03:59

我修改了一下删除策略，但是他还是把我的4删掉了

WalterWj · 2022 年12 月 12 日 04:25

name 是不是应该放外层？

tidb菜鸟一只 · 2022 年12 月 12 日 06:39

这里写的有问题吧？

WalterWj · 2022 年12 月 12 日 06:43

我看官网是这么写的，难道官网写错了？。。。。我手上没有环境可以帮忙测试的。。。。

yiduoyunQ · 2022 年12 月 12 日 08:39

提供一下 kubectl get tc {tc-name} -o yaml 的输出
提供一下 operator 日志 kubectl logs tidb-controller-manager-{xxxx} 的输出
operator 的版本是多少

panqiao · 2022 年12 月 13 日 07:24

我现在最难受的是1和2的store信息没法从我的pd节点中清除掉，他们身上的leader和region就是不会转移，如果不把这两个store信息清除掉，始终是个隐患

panqiao · 2022 年12 月 13 日 07:31

kubectl -n tidb-cluster get tc pre-test -ojson

{
“apiVersion”: “pingcap.com/v1alpha1”,
“kind”: “TidbCluster”,
“metadata”: {
“annotations”: {
“meta.helm.sh/release-name”: “pre-test”,
“meta.helm.sh/release-namespace”: “tidb-cluster”,
“pingcap.com/ha-topology-key”: “kubernetes.io/hostname”,
“pingcap.com/pd.pre-test-pd.sha”: “cfa0d77a”,
“pingcap.com/tidb.pre-test-tidb.sha”: “866b9771”,
“pingcap.com/tikv.pre-test-tikv.sha”: “1c8d5543”
},
“creationTimestamp”: “2022-11-24T03:30:41Z”,
“generation”: 5747,
“labels”: {
“app.kubernetes.io/component”: “tidb-cluster”,
“app.kubernetes.io/instance”: “pre-test”,
“app.kubernetes.io/managed-by”: “Helm”,
“app.kubernetes.io/name”: “tidb-cluster”,
“helm.sh/chart”: “tidb-cluster-v1.3.9”
},
“name”: “pre-test”,
“namespace”: “tidb-cluster”,
“resourceVersion”: “46472491”,
“uid”: “b249cfb6-8ad0-4c50-897e-d21c0acee6f3”
},
“spec”: {
“discovery”: {},
“enablePVReclaim”: false,
“helper”: {
“image”: “busybox:1.34.1”
},
“imagePullPolicy”: “IfNotPresent”,
“pd”: {
“affinity”: {},
“baseImage”: “pingcap/pd”,
“hostNetwork”: false,
“image”: “pingcap/pd:v6.1.0”,
“imagePullPolicy”: “IfNotPresent”,
“maxFailoverCount”: 3,
“replicas”: 3,
“requests”: {
“storage”: “5Gi”
},
“storageClassName”: “longhorn”
},
“pvReclaimPolicy”: “Retain”,
“schedulerName”: “default-scheduler”,
“services”: [
{
“name”: “pd”,
“type”: “ClusterIP”
}
],
“tidb”: {
“affinity”: {},
“baseImage”: “pingcap/tidb”,
“binlogEnabled”: false,
“hostNetwork”: false,
“image”: “pingcap/tidb:v6.1.0”,
“imagePullPolicy”: “IfNotPresent”,
“maxFailoverCount”: 3,
“replicas”: 2,
“separateSlowLog”: true,
“slowLogTailer”: {
“image”: “busybox:1.33.0”,
“imagePullPolicy”: “IfNotPresent”,
“limits”: {
“cpu”: “100m”,
“memory”: “50Mi”
},
“requests”: {
“cpu”: “20m”,
“memory”: “5Mi”
}
},
“tlsClient”: {}
},
“tikv”: {
“affinity”: {},
“baseImage”: “pingcap/tikv”,
“hostNetwork”: false,
“image”: “pingcap/tikv:v6.1.0”,
“imagePullPolicy”: “IfNotPresent”,
“maxFailoverCount”: 3,
“replicas”: 3,
“requests”: {
“storage”: “110Gi”
},
“storageClassName”: “longhorn”
},
“timezone”: “UTC”,
“tlsCluster”: {},
“version”: “v6.1.0”
},
“status”: {
“clusterID”: “7169420034058589617”,
“conditions”: [
{
“lastTransitionTime”: “2022-12-09T06:27:02Z”,
“lastUpdateTime”: “2022-12-12T04:00:25Z”,
“message”: “TiKV store(s) are not up”,
“reason”: “TiKVStoreNotUp”,
“status”: “False”,
“type”: “Ready”
}
],
“pd”: {
“image”: “pingcap/pd:v6.1.0”,
“leader”: {
“clientURL”: “http://pre-test-pd-2.pre-test-pd-peer.tidb-cluster.svc:2379”,
“health”: true,
“id”: “6858210497469881484”,
“lastTransitionTime”: “2022-11-24T03:37:24Z”,
“name”: “pre-test-pd-2”
},
“members”: {
“pre-test-pd-0”: {
“clientURL”: “http://pre-test-pd-0.pre-test-pd-peer.tidb-cluster.svc:2379”,
“health”: true,
“id”: “7715448974209056711”,
“lastTransitionTime”: “2022-11-24T03:38:11Z”,
“name”: “pre-test-pd-0”
},
“pre-test-pd-1”: {
“clientURL”: “http://pre-test-pd-1.pre-test-pd-peer.tidb-cluster.svc:2379”,
“health”: true,
“id”: “13787701961152413026”,
“lastTransitionTime”: “2022-11-24T03:37:41Z”,
“name”: “pre-test-pd-1”
},
“pre-test-pd-2”: {
“clientURL”: “http://pre-test-pd-2.pre-test-pd-peer.tidb-cluster.svc:2379”,
“health”: true,
“id”: “6858210497469881484”,
“lastTransitionTime”: “2022-11-24T03:37:24Z”,
“name”: “pre-test-pd-2”
}
},
“phase”: “Normal”,
“statefulSet”: {
“collisionCount”: 0,
“currentReplicas”: 3,
“currentRevision”: “pre-test-pd-6f74b4fbff”,
“observedGeneration”: 6,
“readyReplicas”: 3,
“replicas”: 3,
“updateRevision”: “pre-test-pd-6f74b4fbff”,
“updatedReplicas”: 3
},
“synced”: true,
“volumes”: {
“pd”: {
“boundCount”: 3,
“currentCapacity”: “5Gi”,
“currentCount”: 3,
“name”: “pd”,
“resizedCapacity”: “5Gi”,
“resizedCount”: 3
}
}
},
“pump”: {},
“ticdc”: {},
“tidb”: {
“image”: “pingcap/tidb:v6.1.0”,
“members”: {
“pre-test-tidb-0”: {
“health”: true,
“lastTransitionTime”: “2022-11-24T03:40:44Z”,
“name”: “pre-test-tidb-0”,
“node”: “amj-3”
},
“pre-test-tidb-1”: {
“health”: true,
“lastTransitionTime”: “2022-11-24T03:39:33Z”,
“name”: “pre-test-tidb-1”,
“node”: “amj-2”
}
},
“phase”: “Normal”,
“statefulSet”: {
“collisionCount”: 0,
“currentReplicas”: 2,
“currentRevision”: “pre-test-tidb-6dfc65fff7”,
“observedGeneration”: 5,
“readyReplicas”: 2,
“replicas”: 2,
“updateRevision”: “pre-test-tidb-6dfc65fff7”,
“updatedReplicas”: 2
}
},
“tiflash”: {},
“tikv”: {
“bootStrapped”: true,
“image”: “pingcap/tikv:v6.1.0”,
“phase”: “Scale”,
“statefulSet”: {
“collisionCount”: 0,
“currentReplicas”: 5,
“currentRevision”: “pre-test-tikv-78b778fcb”,
“observedGeneration”: 3,
“readyReplicas”: 3,
“replicas”: 5,
“updateRevision”: “pre-test-tikv-78b778fcb”,
“updatedReplicas”: 5
},
“stores”: {
“1”: {
“id”: “1”,
“ip”: “pre-test-tikv-0.pre-test-tikv-peer.tidb-cluster.svc”,
“lastTransitionTime”: “2022-12-12T04:00:49Z”,
“leaderCount”: 4,
“podName”: “pre-test-tikv-0”,
“state”: “Up”
},
“4”: {
“id”: “4”,
“ip”: “pre-test-tikv-1.pre-test-tikv-peer.tidb-cluster.svc”,
“lastTransitionTime”: “2022-12-12T02:59:04Z”,
“leaderCount”: 8,
“podName”: “pre-test-tikv-1”,
“state”: “Down”
},
“5”: {
“id”: “5”,
“ip”: “pre-test-tikv-2.pre-test-tikv-peer.tidb-cluster.svc”,
“lastTransitionTime”: “2022-12-12T02:59:04Z”,
“leaderCount”: 2,
“podName”: “pre-test-tikv-2”,
“state”: “Down”
},
“72180”: {
“id”: “72180”,
“ip”: “pre-test-tikv-3.pre-test-tikv-peer.tidb-cluster.svc”,
“lastTransitionTime”: “2022-12-12T04:03:59Z”,
“leaderCount”: 0,
“podName”: “pre-test-tikv-3”,
“state”: “Up”
},
“72223”: {
“id”: “72223”,
“ip”: “pre-test-tikv-4.pre-test-tikv-peer.tidb-cluster.svc”,
“lastTransitionTime”: “2022-12-12T04:03:59Z”,
“leaderCount”: 0,
“podName”: “pre-test-tikv-4”,
“state”: “Up”
}
},
“synced”: true,
“tombstoneStores”: {
“79736”: {
“id”: “79736”,
“ip”: “pre-test-tikv-5.pre-test-tikv-peer.tidb-cluster.svc”,
“lastTransitionTime”: null,
“leaderCount”: 0,
“podName”: “pre-test-tikv-5”,
“state”: “Tombstone”
}
},
“volumes”: {
“tikv”: {
“boundCount”: 5,
“currentCapacity”: “110Gi”,
“currentCount”: 5,
“name”: “tikv”,
“resizedCapacity”: “110Gi”,
“resizedCount”: 5
}
}
}
}
}

日志

I1213 07:29:34.989188 1 tikv_scaler.go:90] scaling in tikv statefulset tidb-cluster/pre-test-tikv, ordinal: 4 (replicas: 4, delete slots: )

E1213 07:29:34.997998 1 tikv_scaler.go:250] can’t scale in TiKV of TidbCluster [tidb-cluster/pre-test], cause the number of up stores is equal to MaxReplicas in PD configuration(3), and the store in Pod pre-test-tikv-4 which is going to be deleted is up too
I1213 07:29:34.998262 1 event.go:282] Event(v1.ObjectReference{Kind:“TidbCluster”, Namespace:“tidb-cluster”, Name:“pre-test”, UID:“b249cfb6-8ad0-4c50-897e-d21c0acee6f3”, APIVersion:“pingcap.com/v1alpha1”, ResourceVersion:“46472491”, FieldPath:“”}): type: ‘Warning’ reason: ‘FailedScaleIn’ can’t scale in TiKV of TidbCluster [tidb-cluster/pre-test], cause the number of up stores is equal to MaxReplicas in PD configuration(3), and the store in Pod pre-test-tikv-4 which is going to be deleted is up too
I1213 07:29:45.724813 1 tikv_scaler.go:90] scaling in tikv statefulset tidb-cluster/pre-test-tikv, ordinal: 4 (replicas: 4, delete slots: )
E1213 07:29:45.732652 1 tikv_scaler.go:250] can’t scale in TiKV of TidbCluster [tidb-cluster/pre-test], cause the number of up stores is equal to MaxReplicas in PD configuration(3), and the store in Pod pre-test-tikv-4 which is going to be deleted is up too
I1213 07:29:45.732806 1 event.go:282] Event(v1.ObjectReference{Kind:“TidbCluster”, Namespace:“tidb-cluster”, Name:“pre-test”, UID:“b249cfb6-8ad0-4c50-897e-d21c0acee6f3”, APIVersion:“pingcap.com/v1alpha1”, ResourceVersion:“46472491”, FieldPath:“”}): type: ‘Warning’ reason: ‘FailedScaleIn’ can’t scale in TiKV of TidbCluster [tidb-cluster/pre-test], cause the number of up stores is equal to MaxReplicas in PD configuration(3), and the store in Pod pre-test-tikv-4 which is going to be deleted is up too
I1213 07:29:45.754481 1 tidbcluster_control.go:69] TidbCluster: [tidb-cluster/pre-test] updated successfully
I1213 07:29:45.865868 1 tikv_scaler.go:90] scaling in tikv statefulset tidb-cluster/pre-test-tikv, ordinal: 4 (replicas: 4, delete slots: )
E1213 07:29:45.875868 1 tikv_scaler.go:250] can’t scale in TiKV of TidbCluster [tidb-cluster/pre-test], cause the number of up stores is equal to MaxReplicas in PD configuration(3), and the store in Pod pre-test-tikv-4 which is going to be deleted is up too
I1213 07:29:45.876047 1 event.go:282] Event(v1.ObjectReference{Kind:“TidbCluster”, Namespace:“tidb-cluster”, Name:“pre-test”, UID:“b249cfb6-8ad0-4c50-897e-d21c0acee6f3”, APIVersion:“pingcap.com/v1alpha1”, ResourceVersion:“46474051”, FieldPath:“”}): type: ‘Warning’ reason: ‘FailedScaleIn’ can’t scale in TiKV of TidbCluster [tidb-cluster/pre-test], cause the number of up stores is equal to MaxReplicas in PD configuration(3), and the store in Pod pre-test-tikv-4 which is going to be deleted is up too
I1213 07:29:51.843729 1 tikv_scaler.go:90] scaling in tikv statefulset tidb-cluster/pre-test-tikv, ordinal: 4 (replicas: 4, delete slots: )
E1213 07:29:51.851744 1 tikv_scaler.go:250] can’t scale in TiKV of TidbCluster [tidb-cluster/pre-test], cause the number of up stores is equal to MaxReplicas in PD configuration(3), and the store in Pod pre-test-tikv-4 which is going to be deleted is up too
I1213 07:29:51.852085 1 event.go:282] Event(v1.ObjectReference{Kind:“TidbCluster”, Namespace:“tidb-cluster”, Name:“pre-test”, UID:“b249cfb6-8ad0-4c50-897e-d21c0acee6f3”, APIVersion:“pingcap.com/v1alpha1”, ResourceVersion:“46474051”, FieldPath:“”}): type: ‘Warning’ reason: ‘FailedScaleIn’ can’t scale in TiKV of TidbCluster [tidb-cluster/pre-test], cause the number of up stores is equal to MaxReplicas in PD configuration(3), and the store in Pod pre-test-tikv-4 which is going to be deleted is up too
I1213 07:29:51.869532 1 tidbcluster_control.go:69] TidbCluster: [tidb-cluster/pre-test] updated successfully
I1213 07:29:51.965092 1 tikv_scaler.go:90] scaling in tikv statefulset tidb-cluster/pre-test-tikv, ordinal: 4 (replicas: 4, delete slots: )
E1213 07:29:51.972914 1 tikv_scaler.go:250] can’t scale in TiKV of TidbCluster [tidb-cluster/pre-test], cause the number of up stores is equal to MaxReplicas in PD configuration(3), and the store in Pod pre-test-tikv-4 which is going to be deleted is up too

I1213 07:29:51.973294 1 event.go:282] Event(v1.ObjectReference{Kind:“TidbCluster”, Namespace:“tidb-cluster”, Name:“pre-test”, UID:“b249cfb6-8ad0-4c50-897e-d21c0acee6f3”, APIVersion:“pingcap.com/v1alpha1”, ResourceVersion:“46474101”, FieldPath:“”}): type: ‘Warning’ reason: ‘FailedScaleIn’ can’t scale in TiKV of TidbCluster [tidb-cluster/pre-test], cause the number of up stores is equal to MaxReplicas in PD configuration(3), and the store in Pod pre-test-tikv-4 which is going to be deleted is up too

版本号：1.3.9

WalterWj · 2022 年12 月 13 日 07:36

这个可以手动清理，进入 pd pod 中，使用 pd-ctl 清理。用 store delete 手动删除。用法可以看官网 pd-ctl 的用法。

panqiao · 2022 年12 月 13 日 07:37

我试过了大佬，我用delete删除后，他就会从down状态变成offline状态，然后就一直这条，等几天他的region还是没转走

WalterWj · 2022 年12 月 13 日 08:06

剩余节点够补副本么

panqiao · 2022 年12 月 13 日 08:43

您说的是region副本数吗？如果是的话这个region副本数该怎么查看呢

WalterWj · 2022 年12 月 13 日 08:46

没有改过就是 3，你缩容后剩余的 tikv pod 要 >= 3 个才可以。

yiduoyunQ · 2022 年12 月 13 日 08:49

store-1 tikv-0 up leader count 4
store-4 tikv-1 down leader count 8
store-5 tikv-2 down leader count 2
store-72180 tikv-3 auto failover up leader count 0
store-72223 tikv-4 auto failover up leader count 0

现在需要先将 store-4 tikv-1 和 store-5 tikv-2 上的 leader 调度到其他 up 的 tikv （不清楚为何没自动调度），调度完之后再继续操作

panqiao · 2022 年12 月 13 日 08:56

这个我也满足了，最开始问这个问题的时候，我不就是有五个kv pod嘛，三个好的，两个坏的，它就是不转移

panqiao · 2022 年12 月 13 日 08:58

我在官方文档上也尝试过几种方法，现在不能直接设置为墓碑状态，只能等他自己进行调度，但是我保持5个pod的状态，3个run，保持了2天还是没用转移

panqiao · 2022 年12 月 13 日 08:59

有什么办法可以把store信息清空吗

panqiao · 2022 年12 月 13 日 09:00

而且奇怪的是我另外两个好的的pod上面一个leader都没有

yiduoyunQ · 2022 年12 月 13 日 09:09

简单解释目前情况如下：

这 3 个是原始 tikv
store-1 tikv-0 up leader count 4
store-4 tikv-1 down leader count 8
store-5 tikv-2 down leader count 2

这 2 个是 auto failover tikv
store-72180 tikv-3 auto failover up leader count 0
store-72223 tikv-4 auto failover up leader count 0

缩容只能先缩 auto failover tikv，因此

麻烦先尝试手动调度下 leader，保证原始 store-4，store-5 没有 leader
pd-ctl store delete 删除 store-4 ，store-5 ，保证状态从 down 变为 tombstone
删 store-4 和 store-5 的 pvc 、pod（不要数据），operator 会自动重新调度，以空数据加入，等 region 调度均衡，确保新 tikv（store id 不一定是 4 和 5）状态是 up
参考 https://docs.pingcap.com/zh/tidb-in-kubernetes/stable/use-auto-failover#tikv-故障转移策略将 store-72180 和 store-72223 恢复缩容
（可选）pd-ctl store remove-tombstone 清理 3. 里的 tombstone tikv