tikv节点store一直处于offline状态,无法下线

【 TiDB 使用环境】生产环境
【 TiDB 版本】v6.1.0
【复现路径】做过哪些操作出现的问题
由于我要删除tikv三个节点的pvc,所以我不得已把tikv缩容到0,然后再次扩容到3,就出现了offline的问题
【遇到的问题:问题现象及影响】
【资源配置】
【附件:截图/日志/监控】
我目前是五个tikv pod


我想缩容到3个,但是我一缩容,他就把3和4缩掉了,保留了坏掉的1和2
1和2的日志为:

store 的信息为(坏掉的两个都是down)

我执行可下面这条命令,让他上线
curl -X POST http://127.0.0.1:2379/pd/api/v1/store/${store_id}/state?state=Up
状态就会变成offline,但是不论过多久,他的region就是不转移
我用pd-ctl store delete id 也是一直处于offline状态
我不知道该咋办了,我想让他正常恢复三个tikv节点,该怎么做呀,求助各位大佬

  1. 缩容的时候需要考虑副本和 tikv 实例数,如果缩容后节点数比副本数少是缩容不掉的。原因是缩容需要将副本补在其他节点,一个节点不能有多副本数据。无法违反这个原则
  2. k8s 上缩容,k8s 的策略就是从大 pod 开始下,需要修改策略;看下这个:https://docs.pingcap.com/zh/tidb-in-kubernetes/stable/advanced-statefulset#通过-kubectl-查看-advancedstatefulset-对象


我修改了一下删除策略,但是他还是把我的4删掉了

name 是不是应该放外层?

image
这里写的有问题吧?


我看官网是这么写的,难道官网写错了?。。。。 我手上没有环境可以帮忙测试的。。。。

  1. 提供一下 kubectl get tc {tc-name} -o yaml 的输出
  2. 提供一下 operator 日志 kubectl logs tidb-controller-manager-{xxxx} 的输出
  3. operator 的版本是多少

我现在最难受的是1和2的store信息没法从我的pd节点中清除掉,他们身上的leader和region就是不会转移,如果不把这两个store信息清除掉,始终是个隐患

kubectl -n tidb-cluster get tc pre-test -ojson

{
“apiVersion”: “pingcap.com/v1alpha1”,
“kind”: “TidbCluster”,
“metadata”: {
“annotations”: {
“meta.helm.sh/release-name”: “pre-test”,
“meta.helm.sh/release-namespace”: “tidb-cluster”,
pingcap.com/ha-topology-key”: “kubernetes.io/hostname”,
pingcap.com/pd.pre-test-pd.sha”: “cfa0d77a”,
pingcap.com/tidb.pre-test-tidb.sha”: “866b9771”,
pingcap.com/tikv.pre-test-tikv.sha”: “1c8d5543”
},
“creationTimestamp”: “2022-11-24T03:30:41Z”,
“generation”: 5747,
“labels”: {
app.kubernetes.io/component”: “tidb-cluster”,
app.kubernetes.io/instance”: “pre-test”,
app.kubernetes.io/managed-by”: “Helm”,
app.kubernetes.io/name”: “tidb-cluster”,
“helm.sh/chart”: “tidb-cluster-v1.3.9”
},
“name”: “pre-test”,
“namespace”: “tidb-cluster”,
“resourceVersion”: “46472491”,
“uid”: “b249cfb6-8ad0-4c50-897e-d21c0acee6f3”
},
“spec”: {
“discovery”: {},
“enablePVReclaim”: false,
“helper”: {
“image”: “busybox:1.34.1”
},
“imagePullPolicy”: “IfNotPresent”,
“pd”: {
“affinity”: {},
“baseImage”: “pingcap/pd”,
“hostNetwork”: false,
“image”: “pingcap/pd:v6.1.0”,
“imagePullPolicy”: “IfNotPresent”,
“maxFailoverCount”: 3,
“replicas”: 3,
“requests”: {
“storage”: “5Gi”
},
“storageClassName”: “longhorn”
},
“pvReclaimPolicy”: “Retain”,
“schedulerName”: “default-scheduler”,
“services”: [
{
“name”: “pd”,
“type”: “ClusterIP”
}
],
“tidb”: {
“affinity”: {},
“baseImage”: “pingcap/tidb”,
“binlogEnabled”: false,
“hostNetwork”: false,
“image”: “pingcap/tidb:v6.1.0”,
“imagePullPolicy”: “IfNotPresent”,
“maxFailoverCount”: 3,
“replicas”: 2,
“separateSlowLog”: true,
“slowLogTailer”: {
“image”: “busybox:1.33.0”,
“imagePullPolicy”: “IfNotPresent”,
“limits”: {
“cpu”: “100m”,
“memory”: “50Mi”
},
“requests”: {
“cpu”: “20m”,
“memory”: “5Mi”
}
},
“tlsClient”: {}
},
“tikv”: {
“affinity”: {},
“baseImage”: “pingcap/tikv”,
“hostNetwork”: false,
“image”: “pingcap/tikv:v6.1.0”,
“imagePullPolicy”: “IfNotPresent”,
“maxFailoverCount”: 3,
“replicas”: 3,
“requests”: {
“storage”: “110Gi”
},
“storageClassName”: “longhorn”
},
“timezone”: “UTC”,
“tlsCluster”: {},
“version”: “v6.1.0”
},
“status”: {
“clusterID”: “7169420034058589617”,
“conditions”: [
{
“lastTransitionTime”: “2022-12-09T06:27:02Z”,
“lastUpdateTime”: “2022-12-12T04:00:25Z”,
“message”: “TiKV store(s) are not up”,
“reason”: “TiKVStoreNotUp”,
“status”: “False”,
“type”: “Ready”
}
],
“pd”: {
“image”: “pingcap/pd:v6.1.0”,
“leader”: {
“clientURL”: “http://pre-test-pd-2.pre-test-pd-peer.tidb-cluster.svc:2379”,
“health”: true,
“id”: “6858210497469881484”,
“lastTransitionTime”: “2022-11-24T03:37:24Z”,
“name”: “pre-test-pd-2”
},
“members”: {
“pre-test-pd-0”: {
“clientURL”: “http://pre-test-pd-0.pre-test-pd-peer.tidb-cluster.svc:2379”,
“health”: true,
“id”: “7715448974209056711”,
“lastTransitionTime”: “2022-11-24T03:38:11Z”,
“name”: “pre-test-pd-0”
},
“pre-test-pd-1”: {
“clientURL”: “http://pre-test-pd-1.pre-test-pd-peer.tidb-cluster.svc:2379”,
“health”: true,
“id”: “13787701961152413026”,
“lastTransitionTime”: “2022-11-24T03:37:41Z”,
“name”: “pre-test-pd-1”
},
“pre-test-pd-2”: {
“clientURL”: “http://pre-test-pd-2.pre-test-pd-peer.tidb-cluster.svc:2379”,
“health”: true,
“id”: “6858210497469881484”,
“lastTransitionTime”: “2022-11-24T03:37:24Z”,
“name”: “pre-test-pd-2”
}
},
“phase”: “Normal”,
“statefulSet”: {
“collisionCount”: 0,
“currentReplicas”: 3,
“currentRevision”: “pre-test-pd-6f74b4fbff”,
“observedGeneration”: 6,
“readyReplicas”: 3,
“replicas”: 3,
“updateRevision”: “pre-test-pd-6f74b4fbff”,
“updatedReplicas”: 3
},
“synced”: true,
“volumes”: {
“pd”: {
“boundCount”: 3,
“currentCapacity”: “5Gi”,
“currentCount”: 3,
“name”: “pd”,
“resizedCapacity”: “5Gi”,
“resizedCount”: 3
}
}
},
“pump”: {},
“ticdc”: {},
“tidb”: {
“image”: “pingcap/tidb:v6.1.0”,
“members”: {
“pre-test-tidb-0”: {
“health”: true,
“lastTransitionTime”: “2022-11-24T03:40:44Z”,
“name”: “pre-test-tidb-0”,
“node”: “amj-3”
},
“pre-test-tidb-1”: {
“health”: true,
“lastTransitionTime”: “2022-11-24T03:39:33Z”,
“name”: “pre-test-tidb-1”,
“node”: “amj-2”
}
},
“phase”: “Normal”,
“statefulSet”: {
“collisionCount”: 0,
“currentReplicas”: 2,
“currentRevision”: “pre-test-tidb-6dfc65fff7”,
“observedGeneration”: 5,
“readyReplicas”: 2,
“replicas”: 2,
“updateRevision”: “pre-test-tidb-6dfc65fff7”,
“updatedReplicas”: 2
}
},
“tiflash”: {},
“tikv”: {
“bootStrapped”: true,
“image”: “pingcap/tikv:v6.1.0”,
“phase”: “Scale”,
“statefulSet”: {
“collisionCount”: 0,
“currentReplicas”: 5,
“currentRevision”: “pre-test-tikv-78b778fcb”,
“observedGeneration”: 3,
“readyReplicas”: 3,
“replicas”: 5,
“updateRevision”: “pre-test-tikv-78b778fcb”,
“updatedReplicas”: 5
},
“stores”: {
“1”: {
“id”: “1”,
“ip”: “pre-test-tikv-0.pre-test-tikv-peer.tidb-cluster.svc”,
“lastTransitionTime”: “2022-12-12T04:00:49Z”,
“leaderCount”: 4,
“podName”: “pre-test-tikv-0”,
“state”: “Up”
},
“4”: {
“id”: “4”,
“ip”: “pre-test-tikv-1.pre-test-tikv-peer.tidb-cluster.svc”,
“lastTransitionTime”: “2022-12-12T02:59:04Z”,
“leaderCount”: 8,
“podName”: “pre-test-tikv-1”,
“state”: “Down”
},
“5”: {
“id”: “5”,
“ip”: “pre-test-tikv-2.pre-test-tikv-peer.tidb-cluster.svc”,
“lastTransitionTime”: “2022-12-12T02:59:04Z”,
“leaderCount”: 2,
“podName”: “pre-test-tikv-2”,
“state”: “Down”
},
“72180”: {
“id”: “72180”,
“ip”: “pre-test-tikv-3.pre-test-tikv-peer.tidb-cluster.svc”,
“lastTransitionTime”: “2022-12-12T04:03:59Z”,
“leaderCount”: 0,
“podName”: “pre-test-tikv-3”,
“state”: “Up”
},
“72223”: {
“id”: “72223”,
“ip”: “pre-test-tikv-4.pre-test-tikv-peer.tidb-cluster.svc”,
“lastTransitionTime”: “2022-12-12T04:03:59Z”,
“leaderCount”: 0,
“podName”: “pre-test-tikv-4”,
“state”: “Up”
}
},
“synced”: true,
“tombstoneStores”: {
“79736”: {
“id”: “79736”,
“ip”: “pre-test-tikv-5.pre-test-tikv-peer.tidb-cluster.svc”,
“lastTransitionTime”: null,
“leaderCount”: 0,
“podName”: “pre-test-tikv-5”,
“state”: “Tombstone”
}
},
“volumes”: {
“tikv”: {
“boundCount”: 5,
“currentCapacity”: “110Gi”,
“currentCount”: 5,
“name”: “tikv”,
“resizedCapacity”: “110Gi”,
“resizedCount”: 5
}
}
}
}
}

日志

I1213 07:29:34.989188 1 tikv_scaler.go:90] scaling in tikv statefulset tidb-cluster/pre-test-tikv, ordinal: 4 (replicas: 4, delete slots: )

E1213 07:29:34.997998 1 tikv_scaler.go:250] can’t scale in TiKV of TidbCluster [tidb-cluster/pre-test], cause the number of up stores is equal to MaxReplicas in PD configuration(3), and the store in Pod pre-test-tikv-4 which is going to be deleted is up too
I1213 07:29:34.998262 1 event.go:282] Event(v1.ObjectReference{Kind:“TidbCluster”, Namespace:“tidb-cluster”, Name:“pre-test”, UID:“b249cfb6-8ad0-4c50-897e-d21c0acee6f3”, APIVersion:“pingcap.com/v1alpha1”, ResourceVersion:“46472491”, FieldPath:“”}): type: ‘Warning’ reason: ‘FailedScaleIn’ can’t scale in TiKV of TidbCluster [tidb-cluster/pre-test], cause the number of up stores is equal to MaxReplicas in PD configuration(3), and the store in Pod pre-test-tikv-4 which is going to be deleted is up too
I1213 07:29:45.724813 1 tikv_scaler.go:90] scaling in tikv statefulset tidb-cluster/pre-test-tikv, ordinal: 4 (replicas: 4, delete slots: )
E1213 07:29:45.732652 1 tikv_scaler.go:250] can’t scale in TiKV of TidbCluster [tidb-cluster/pre-test], cause the number of up stores is equal to MaxReplicas in PD configuration(3), and the store in Pod pre-test-tikv-4 which is going to be deleted is up too
I1213 07:29:45.732806 1 event.go:282] Event(v1.ObjectReference{Kind:“TidbCluster”, Namespace:“tidb-cluster”, Name:“pre-test”, UID:“b249cfb6-8ad0-4c50-897e-d21c0acee6f3”, APIVersion:“pingcap.com/v1alpha1”, ResourceVersion:“46472491”, FieldPath:“”}): type: ‘Warning’ reason: ‘FailedScaleIn’ can’t scale in TiKV of TidbCluster [tidb-cluster/pre-test], cause the number of up stores is equal to MaxReplicas in PD configuration(3), and the store in Pod pre-test-tikv-4 which is going to be deleted is up too
I1213 07:29:45.754481 1 tidbcluster_control.go:69] TidbCluster: [tidb-cluster/pre-test] updated successfully
I1213 07:29:45.865868 1 tikv_scaler.go:90] scaling in tikv statefulset tidb-cluster/pre-test-tikv, ordinal: 4 (replicas: 4, delete slots: )
E1213 07:29:45.875868 1 tikv_scaler.go:250] can’t scale in TiKV of TidbCluster [tidb-cluster/pre-test], cause the number of up stores is equal to MaxReplicas in PD configuration(3), and the store in Pod pre-test-tikv-4 which is going to be deleted is up too
I1213 07:29:45.876047 1 event.go:282] Event(v1.ObjectReference{Kind:“TidbCluster”, Namespace:“tidb-cluster”, Name:“pre-test”, UID:“b249cfb6-8ad0-4c50-897e-d21c0acee6f3”, APIVersion:“pingcap.com/v1alpha1”, ResourceVersion:“46474051”, FieldPath:“”}): type: ‘Warning’ reason: ‘FailedScaleIn’ can’t scale in TiKV of TidbCluster [tidb-cluster/pre-test], cause the number of up stores is equal to MaxReplicas in PD configuration(3), and the store in Pod pre-test-tikv-4 which is going to be deleted is up too
I1213 07:29:51.843729 1 tikv_scaler.go:90] scaling in tikv statefulset tidb-cluster/pre-test-tikv, ordinal: 4 (replicas: 4, delete slots: )
E1213 07:29:51.851744 1 tikv_scaler.go:250] can’t scale in TiKV of TidbCluster [tidb-cluster/pre-test], cause the number of up stores is equal to MaxReplicas in PD configuration(3), and the store in Pod pre-test-tikv-4 which is going to be deleted is up too
I1213 07:29:51.852085 1 event.go:282] Event(v1.ObjectReference{Kind:“TidbCluster”, Namespace:“tidb-cluster”, Name:“pre-test”, UID:“b249cfb6-8ad0-4c50-897e-d21c0acee6f3”, APIVersion:“pingcap.com/v1alpha1”, ResourceVersion:“46474051”, FieldPath:“”}): type: ‘Warning’ reason: ‘FailedScaleIn’ can’t scale in TiKV of TidbCluster [tidb-cluster/pre-test], cause the number of up stores is equal to MaxReplicas in PD configuration(3), and the store in Pod pre-test-tikv-4 which is going to be deleted is up too
I1213 07:29:51.869532 1 tidbcluster_control.go:69] TidbCluster: [tidb-cluster/pre-test] updated successfully
I1213 07:29:51.965092 1 tikv_scaler.go:90] scaling in tikv statefulset tidb-cluster/pre-test-tikv, ordinal: 4 (replicas: 4, delete slots: )
E1213 07:29:51.972914 1 tikv_scaler.go:250] can’t scale in TiKV of TidbCluster [tidb-cluster/pre-test], cause the number of up stores is equal to MaxReplicas in PD configuration(3), and the store in Pod pre-test-tikv-4 which is going to be deleted is up too

I1213 07:29:51.973294 1 event.go:282] Event(v1.ObjectReference{Kind:“TidbCluster”, Namespace:“tidb-cluster”, Name:“pre-test”, UID:“b249cfb6-8ad0-4c50-897e-d21c0acee6f3”, APIVersion:“pingcap.com/v1alpha1”, ResourceVersion:“46474101”, FieldPath:“”}): type: ‘Warning’ reason: ‘FailedScaleIn’ can’t scale in TiKV of TidbCluster [tidb-cluster/pre-test], cause the number of up stores is equal to MaxReplicas in PD configuration(3), and the store in Pod pre-test-tikv-4 which is going to be deleted is up too

版本号:1.3.9

这个可以手动清理,进入 pd pod 中,使用 pd-ctl 清理。用 store delete 手动删除。用法可以看官网 pd-ctl 的用法。

我试过了大佬,我用delete删除后,他就会从down状态变成offline状态,然后就一直这条,等几天他的region还是没转走


剩余节点够补副本么

您说的是region副本数吗?如果是的话这个region副本数该怎么查看呢

没有改过就是 3,你缩容后剩余的 tikv pod 要 >= 3 个才可以。

store-1 tikv-0 up leader count 4
store-4 tikv-1 down leader count 8
store-5 tikv-2 down leader count 2
store-72180 tikv-3 auto failover up leader count 0
store-72223 tikv-4 auto failover up leader count 0

现在需要先将 store-4 tikv-1 和 store-5 tikv-2 上的 leader 调度到其他 up 的 tikv (不清楚为何没自动调度),调度完之后再继续操作

这个我也满足了,最开始问这个问题的时候,我不就是有五个kv pod嘛,三个好的,两个坏的,它就是不转移

我在官方文档上也尝试过几种方法,现在不能直接设置为墓碑状态,只能等他自己进行调度,但是我保持5个pod的状态,3个run,保持了2天还是没用转移

有什么办法可以把store信息清空吗

而且奇怪的是我另外两个好的的pod上面一个leader都没有

简单解释目前情况如下:

这 3 个是原始 tikv
store-1 tikv-0 up leader count 4
store-4 tikv-1 down leader count 8
store-5 tikv-2 down leader count 2

这 2 个是 auto failover tikv
store-72180 tikv-3 auto failover up leader count 0
store-72223 tikv-4 auto failover up leader count 0

缩容只能先缩 auto failover tikv,因此

  1. 麻烦先尝试手动调度下 leader,保证原始 store-4,store-5 没有 leader
  2. pd-ctl store delete 删除 store-4 ,store-5 ,保证状态从 down 变为 tombstone
  3. 删 store-4 和 store-5 的 pvc 、pod(不要数据),operator 会自动重新调度,以空数据加入,等 region 调度均衡,确保新 tikv(store id 不一定是 4 和 5)状态是 up
  4. 参考 https://docs.pingcap.com/zh/tidb-in-kubernetes/stable/use-auto-failover#tikv-故障转移策略 将 store-72180 和 store-72223 恢复缩容
  5. (可选)pd-ctl store remove-tombstone 清理 3. 里的 tombstone tikv