K8S部署tidb出现Tikv无法连接PD的问题，网络和PD运行状况都正常

TiDBer_NcrLwKBa · 2025 年3 月 13 日 07:10

【TiDB 使用环境】测试
【TiDB 版本】V8.5.0
【操作系统】LInux
【部署方式】K8S部署
【集群数据量】
【集群节点数】
【问题复现路径】做过哪些操作出现的问题
【遇到的问题：问题现象及影响】
【资源配置】进入到 TiDB Dashboard -集群信息 (Cluster Info) -主机(Hosts) 截图此页面
【复制黏贴 ERROR 报错的日志】
【其他附件：截图/日志/监控】

部署文件是K8S部署文档里面的快速上手basic.yaml，改动过存储方式

IT IS NOT SUITABLE FOR PRODUCTION USE.

This YAML describes a basic TiDB cluster with minimum resource requirements,

which should be able to run in any Kubernetes cluster with storage support.

apiVersion: pingcap.com/v1alpha1
kind: TidbCluster
metadata:
name: basic
spec:
version: v8.5.0
timezone: UTC
pvReclaimPolicy: Delete
enableDynamicConfiguration: true
configUpdateStrategy: RollingUpdate
discovery: {}
helper:
image: alpine:3.16.0
pd:
baseImage: uhub.service.ucloud.cn/pingcap/pd
maxFailoverCount: 0
replicas: 1
# if storageClassName is not set, the default Storage Class of the Kubernetes cluster will be used
storageClassName: managed-nfs-storage
requests:
storage: “1Gi”
additionalVolumes:
- name: nfs
nfs:
server: 10.3.254.100
path: /data/k8s-nfs/nfs-provisioner
config: {}
tikv:
baseImage: uhub.service.ucloud.cn/pingcap/tikv
maxFailoverCount: 0
# If only 1 TiKV is deployed, the TiKV region leader
# cannot be transferred during upgrade, so we have
# to configure a short timeout
evictLeaderTimeout: 1m
replicas: 1
# if storageClassName is not set, the default Storage Class of the Kubernetes cluster will be used
storageClassName: managed-nfs-storage
requests:
storage: “1Gi”
additionalVolumes:
- name: nfs
nfs:
server: 10.3.254.100
path: /data/k8s-nfs/nfs-provisioner
config:
storage:
# In basic examples, we set this to avoid using too much storage.
reserve-space: “0MB”
rocksdb:
# In basic examples, we set this to avoid the following error in some Kubernetes clusters:
# “the maximum number of open file descriptors is too small, got 1024, expect greater or equal to 82920”
max-open-files: 256
raftdb:
max-open-files: 256
tidb:
baseImage: uhub.service.ucloud.cn/pingcap/tidb
maxFailoverCount: 0
replicas: 1
service:
type: ClusterIP
config: {}

部署之后的现象：pd和tikv显示运行正常，tidb没有启动。
root@master:/usr/local/kubernetes/tidb# kubectl get all -n tidb-cluster
NAME READY STATUS RESTARTS AGE
pod/basic-discovery-5b74547b78-8jl9b 1/1 Running 0 21m
pod/basic-pd-0 1/1 Running 0 21m
pod/basic-tikv-0 1/1 Running 0 21m
pod/dnsutils 1/1 Running 0 48m

NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
service/basic-discovery ClusterIP 10.108.28.239 10261/TCP,10262/TCP 21m
service/basic-pd ClusterIP 10.103.2.132 2379/TCP 21m
service/basic-pd-peer ClusterIP None 2380/TCP,2379/TCP 21m
service/basic-tikv-peer ClusterIP None 20160/TCP 21m

NAME READY UP-TO-DATE AVAILABLE AGE
deployment.apps/basic-discovery 1/1 1 1 21m

NAME DESIRED CURRENT READY AGE
replicaset.apps/basic-discovery-5b74547b78 1 1 1 21m

NAME READY AGE
statefulset.apps/basic-pd 1/1 21m
statefulset.apps/basic-tikv 1/1 21m
root@master:/usr/local/kubernetes/tidb#

其中，PD的日志：
[2025/03/13 06:46:03.746 +00:00] [INFO] [leadership.go:197] [“check campaign resp”] [resp=“{"header":{"cluster_id":15967295672222012935,"member_id":7441053368211532809,"revision":6,"raft_term":2},"succeeded":true,"responses":[{"Response":{"response_put":{"header":{"revision":6}}}}]}”]
[2025/03/13 06:46:03.746 +00:00] [INFO] [leadership.go:206] [“write leaderData to leaderPath ok”] [leader-key=/pd/7481181755356414886/leader] [purpose=“leader election”]
[2025/03/13 06:46:03.746 +00:00] [INFO] [server.go:1717] [“campaign PD leader ok”] [campaign-leader-name=basic-pd-0]
[2025/03/13 06:46:03.746 +00:00] [INFO] [lease.go:167] [“start lease keep alive worker”] [interval=1s] [purpose=“leader election”]
[2025/03/13 06:46:03.748 +00:00] [INFO] [server.go:1840] [“server enable region storage”]
[2025/03/13 06:46:03.749 +00:00] [INFO] [server.go:1734] [“triggering the leader callback functions”]
[2025/03/13 06:46:03.749 +00:00] [WARN] [manager.go:124] [“un-marshall controller config failed, fallback to default”] [error=“unexpected end of JSON input”] [v=]
[2025/03/13 06:46:03.759 +00:00] [INFO] [manager.go:187] [“resource group manager finishes initialization”]
[2025/03/13 06:46:03.766 +00:00] [INFO] [cluster.go:510] [“initializing the global TSO allocator”]
[2025/03/13 06:46:03.766 +00:00] [INFO] [tso.go:161] [“start to sync timestamp”]
[2025/03/13 06:46:03.771 +00:00] [INFO] [tso.go:221] [“sync and save timestamp”] [last=0001/01/01 00:00:00.000 +00:00] [last-saved=0001/01/01 00:00:00.000 +00:00] [save=2025/03/13 06:46:06.767 +00:00] [next=2025/03/13 06:46:03.767 +00:00]
[2025/03/13 06:46:03.772 +00:00] [WARN] [cluster.go:361] [“cluster is not bootstrapped”]
[2025/03/13 06:46:03.775 +00:00] [INFO] [id.go:175] [“idAllocator allocates a new id”] [new-end=1000] [new-base=0] [label=idalloc] [check-curr-end=true]
[2025/03/13 06:46:03.775 +00:00] [INFO] [util.go:50] [“load pd and cluster version”] [pd-version=8.5.0] [cluster-version=0.0.0]
[2025/03/13 06:46:03.775 +00:00] [INFO] [server.go:1768] [“PD leader is ready to serve”] [leader-name=basic-pd-0]
[2025/03/13 06:46:04.745 +00:00] [INFO] [server.go:1303] [“PD server config is updated”] [new=“{"use-region-storage":"true","max-gap-reset-ts":"24h0m0s","key-type":"table","runtime-services":"","metric-storage":"","dashboard-address":"http://basic-pd-0.basic-pd-peer.tidb-cluster.svc:2379","flow-round-by-digit":3,"min-resolved-ts-persistence-interval":"1s","server-memory-limit":0,"server-memory-limit-gc-trigger":0.7,"enable-gogc-tuner":"false","gc-tuner-threshold":0.6,"block-safe-point-v1":"false"}”] [old=“{"use-region-storage":"true","max-gap-reset-ts":"24h0m0s","key-type":"table","runtime-services":"","metric-storage":"","dashboard-address":"auto","flow-round-by-digit":3,"min-resolved-ts-persistence-interval":"1s","server-memory-limit":0,"server-memory-limit-gc-trigger":0.7,"enable-gogc-tuner":"false","gc-tuner-threshold":0.6,"block-safe-point-v1":"false"}”]
[2025/03/13 06:46:05.871 +00:00] [INFO] [dbstore.go:33] [“Dashboard initializing local storage file”] [path=/var/lib/pd/dashboard.sqlite.db]
[2025/03/13 06:46:06.160 +00:00] [INFO] [version.go:33] [“TiDB Dashboard started”] [internal-version=8.4.0-618b5cde] [standalone=No] [pd-version=v8.5.0] [build-time=“2024-12-11 06:55:18”] [build-git-hash=618b5cded5bf]
[2025/03/13 06:46:06.160 +00:00] [INFO] [manager.go:201] [“dashboard server is started”]
[2025/03/13 06:46:06.160 +00:00] [INFO] [proxy.go:211] [“start serve requests to remotes”] [endpoint=127.0.0.1:34971] [remotes=“”]
[2025/03/13 06:46:06.160 +00:00] [INFO] [proxy.go:211] [“start serve requests to remotes”] [endpoint=127.0.0.1:33033] [remotes=“”]
[2025/03/13 06:46:06.161 +00:00] [WARN] [dynamic_config_manager.go:165] [“Dynamic config does not exist in etcd”]
[2025/03/13 06:46:06.264 +00:00] [INFO] [manager.go:74] [“Key visual service is started”]

TIKV的日志：
[2025/03/13 07:02:17.956 +00:00] [INFO] [util.rs:639] [“connecting to PD endpoint”] [endpoints=http://basic-pd:2379] [thread_id=1]
[2025/03/13 07:02:19.957 +00:00] [INFO] [util.rs:601] [“PD failed to respond”] [err=“Grpc(RpcFailure(RpcStatus { code: 4-DEADLINE_EXCEEDED, message: "Deadline Exceeded", details: }))”] [endpoints=http://basic-pd:2379] [thread_id=1]
[2025/03/13 07:02:20.258 +00:00] [INFO] [util.rs:639] [“connecting to PD endpoint”] [endpoints=http://basic-pd:2379] [thread_id=1]
[2025/03/13 07:02:22.259 +00:00] [INFO] [util.rs:601] [“PD failed to respond”] [err=“Grpc(RpcFailure(RpcStatus { code: 4-DEADLINE_EXCEEDED, message: "Deadline Exceeded", details: }))”] [endpoints=http://basic-pd:2379] [thread_id=1]
[2025/03/13 07:02:22.259 +00:00] [WARN] [client.rs:169] [“validate PD endpoints failed”] [err=“Other("[components/pd_client/src/util.rs:634]: PD cluster failed to respond")”] [thread_id=1]
[2025/03/13 07:02:22.560 +00:00] [INFO] [util.rs:639] [“connecting to PD endpoint”] [endpoints=http://basic-pd:2379] [thread_id=1]
[2025/03/13 07:02:24.561 +00:00] [INFO] [util.rs:601] [“PD failed to respond”] [err=“Grpc(RpcFailure(RpcStatus { code: 4-DEADLINE_EXCEEDED, message: "Deadline Exceeded", details: }))”] [endpoints=http://basic-pd:2379] [thread_id=1]
[2025/03/13 07:02:24.862 +00:00] [INFO] [util.rs:639] [“connecting to PD endpoint”] [endpoints=http://basic-pd:2379] [thread_id=1]
[2025/03/13 07:02:26.863 +00:00] [INFO] [util.rs:601] [“PD failed to respond”] [err=“Grpc(RpcFailure(RpcStatus { code: 4-DEADLINE_EXCEEDED, message: "Deadline Exceeded", details: }))”] [endpoints=http://basic-pd:2379] [thread_id=1]
[2025/03/13 07:02:27.164 +00:00] [INFO] [util.rs:639] [“connecting to PD endpoint”] [endpoints=http://basic-pd:2379] [thread_id=1]
[2025/03/13 07:02:29.165 +00:00] [INFO] [util.rs:601] [“PD failed to respond”] [err=“Grpc(RpcFailure(RpcStatus { code: 4-DEADLINE_EXCEEDED, message: "Deadline Exceeded", details: }))”] [endpoints=http://basic-pd:2379] [thread_id=1]
[2025/03/13 07:02:29.466 +00:00] [INFO] [util.rs:639] [“connecting to PD endpoint”] [endpoints=http://basic-pd:2379] [thread_id=1]
[2025/03/13 07:02:31.467 +00:00] [INFO] [util.rs:601] [“PD failed to respond”] [err=“Grpc(RpcFailure(RpcStatus { code: 4-DEADLINE_EXCEEDED, message: "Deadline Exceeded", details: }))”] [endpoints=http://basic-pd:2379] [thread_id=1]
[2025/03/13 07:02:31.768 +00:00] [INFO] [util.rs:639] [“connecting to PD endpoint”] [endpoints=http://basic-pd:2379] [thread_id=1]
[2025/03/13 07:02:33.769 +00:00] [INFO] [util.rs:601] [“PD failed to respond”] [err=“Grpc(RpcFailure(RpcStatus { code: 4-DEADLINE_EXCEEDED, message: "Deadline Exceeded", details: }))”] [endpoints=http://basic-pd:2379] [thread_id=1]
[2025/03/13 07:02:34.070 +00:00] [INFO] [util.rs:639] [“connecting to PD endpoint”] [endpoints=http://basic-pd:2379] [thread_id=1]

进入TIKV的pod中，basic-pd的域名解析没有问题，访问2379的API接口也是通的。
[root@basic-tikv-0 /]# curl http://basic-pd:2379/health
{“health”:“true”,“reason”:“”}[root@basic-tikv-0 /]#
[root@basic-tikv-0 /]#
[root@basic-tikv-0 /]#

想请问下现在这个现象的原因是什么？应该怎么解决？

Miracle · 2025 年3 月 13 日 07:43

operator的版本是多少？

Miracle · 2025 年3 月 13 日 07:50

试试这个
https://github.com/pingcap/tidb-operator/issues/5372#issuecomment-1794020036

TiDBer_NcrLwKBa · 2025 年3 月 13 日 08:36

这个配置确实有效，我的Operator版本是1.6.1，非常感谢

system · 2025 年3 月 20 日 08:37

此话题已在最后回复的 7 天后被自动关闭。不再允许新回复。