麻烦提供一下 PD 的 log,从 operator controller 的 log 来看似乎是 operator 无法成功与 PD 进行 HTTP 通信,从而认为 PD 尚未正常启动,所以没有启动 tikv 集群。
nslookup domain tidb-cluster-pd-0.tidb-cluster-pd-peer.tidb-cluster.svc failed
Name: tidb-cluster-pd-0.tidb-cluster-pd-peer.tidb-cluster.svc
Address 1: 192.168.128.52 apiserver.demo
nslookup domain tidb-cluster-pd-0.tidb-cluster-pd-peer.tidb-cluster.svc.svc success
waiting for discovery service to return start args ...
waiting for discovery service to return start args ...
waiting for discovery service to return start args ...
waiting for discovery service to return start args ...
# kubectl logs -n tidb-cluster tidb-cluster-discovery-7d4cfc7f4d-w5m2k
I0526 01:17:45.524197 1 version.go:38] Welcome to TiDB Operator.
I0526 01:17:45.524297 1 version.go:39] TiDB Operator Version: version.Info{GitVersion:"v1.0.6", GitCommit:"982720cd563ece6dbebfc4c579b17fa66a93c550", GitTreeState:"clean", BuildDate:"2019-12-27T16:53:46Z", GoVersion:"go1.13", Compiler:"gc", Platform:"linux/amd64"}
I0526 01:17:45.525474 1 mux.go:40] starting TiDB Discovery server, listening on 0.0.0.0:10261
看上去 discovery 服务没有正常工作,或者可能进程内维护信息有错误。 我建议试试删除 discovery 服务的 Pod,等待 discovery 服务新建 POD 启动试试。
我删了discovery的pod后发现没有效果,我exec进入discovery后能ping通到pd节点
# kubectl get pods -n tidb-cluster -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
tidb-cluster-discovery-7d4cfc7f4d-6g5kl 1/1 Running 0 52s 10.100.137.223 acdm1 <none> <none>
tidb-cluster-monitor-588fd6bcd5-pfsqz 3/3 Running 0 26m 10.100.137.221 acdm1 <none> <none>
tidb-cluster-pd-0 1/1 Running 0 26m 192.168.128.52 acdm2 <none> <none>
[root@acdm2 ~]# kubectl logs -n tidb-cluster tidb-cluster-discovery-7d4cfc7f4d-6g5kl
I0526 01:43:16.544964 1 version.go:38] Welcome to TiDB Operator.
I0526 01:43:16.545040 1 version.go:39] TiDB Operator Version: version.Info{GitVersion:"v1.0.6", GitCommit:"982720cd563ece6dbebfc4c579b17fa66a93c550", GitTreeState:"clean", BuildDate:"2019-12-27T16:53:46Z", GoVersion:"go1.13", Compiler:"gc", Platform:"linux/amd64"}
I0526 01:43:16.546561 1 mux.go:40] starting TiDB Discovery server, listening on 0.0.0.0:10261
[root@acdm2 ~]# kubectl exec -it tidb-cluster-discovery-7d4cfc7f4d-6g5kl -n tidb-cluster -- /bin/sh
/ # ping 192.168.128.52
PING 192.168.128.52 (192.168.128.52): 56 data bytes
64 bytes from 192.168.128.52: seq=0 ttl=63 time=0.458 ms
64 bytes from 192.168.128.52: seq=1 ttl=63 time=0.706 ms
^C
--- 192.168.128.52 ping statistics ---
2 packets transmitted, 2 packets received, 0% packet loss
round-trip min/avg/max = 0.458/0.582/0.706 ms
/ #
@yunlingfly 从 discovery 的日志里来看,似乎没有收到 PD 启动时的 HTTP 请求。 我建议使用 exec 命令登录到 某个 PD 内部,然后使用 wget http://tidb-cluster-discovery/new/xxxxx 看一下具体的 response,检查一下 PD 到 discovery 之间的连通性。
我刚刚升级operator到了v1.1.0-rc.4,发现还是没有TiKV,进入pd后发现pd日志如下:
nslookup domain basic-pd-0.basic-pd-peer.tidb-cluster.svc failed
我能像设置pd一样设置discovery的hostNetwork为true么,貌似是因为discovery的ip是内部分配的?
# kubectl get pods -n tidb-cluster -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
basic-discovery-788bf6c7cd-c64vj 1/1 Running 0 4m6s 10.100.137.229 acdm1 <none> <none>
basic-pd-0 1/1 Running 1 4m5s 192.168.128.51 acdm1 <none> <none>
Pod 在启动时 Kubernetes 内的服务的 DNS 并不是立即就绪的,一般 10~20秒内就应该不会出现这个问题。如果一直出现这个报错,需要检查一下这个 Kubernetes 网络的 DNS 服务是否正常。
麻烦提供一下当前的 tidbcluster 的 yaml 文件定义。
apiVersion: pingcap.com/v1alpha1
kind: TidbCluster
metadata:
name: basic
spec:
version: v3.0.13
timezone: UTC
pvReclaimPolicy: Retain
hostNetwork: true
pd:
baseImage: pingcap/pd
replicas: 1
# if storageClassName is not set, the default Storage Class of the Kubernetes cluster will be used
storageClassName: nfs
requests:
storage: "1Gi"
config: {}
hostNetwork: true
tikv:
baseImage: pingcap/tikv
replicas: 1
# if storageClassName is not set, the default Storage Class of the Kubernetes cluster will be used
storageClassName: nfs
requests:
storage: "1Gi"
config: {}
hostNetwork: true
tidb:
baseImage: pingcap/tidb
replicas: 2
service:
type: ClusterIP
config: {}
hostNetwork: true
那如果我要解决这个问题只能进入到具体的pd里编辑host添加解析?
我发现peer就没有分配ip
# kubectl get services --namespace tidb-cluster -o wide
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE SELECTOR
basic-discovery ClusterIP 10.96.201.137 <none> 10261/TCP 16m app.kubernetes.io/component=discovery,app.kubernetes.io/instance=basic,app.kubernetes.io/managed-by=tidb-operator,app.kubernetes.io/name=tidb-cluster
basic-pd ClusterIP 10.96.20.95 <none> 2379/TCP 16m app.kubernetes.io/component=pd,app.kubernetes.io/instance=basic,app.kubernetes.io/managed-by=tidb-operator,app.kubernetes.io/name=tidb-cluster
basic-pd-peer ClusterIP None <none> 2380/TCP 16m app.kubernetes.io/component=pd,app.kubernetes.io/instance=basic,app.kubernetes.io/managed-by=tidb-operator,app.kubernetes.io/name=tidb-cluster
peer 服务是作为 pd 集群之间 peer-to-peer 的 headless service 存在。 确实是不需要分配 IP。你可以先试一下将 hostNetwork 改为 false 试试。 我使用了你的 tidbcluster yaml 定义,在我这里的环境未能复现问题,我猜测应该和集群的网络连通性有关。
默认false不是么,错误信息是一样的
nslookup domain basic-pd-0.basic-pd-peer.tidb-cluster.svc failed
那我要如何设置“集群的网络连通性”让他能联通呢?
改kube-proxy/coredns?
# kubectl get pods -n kube-system
NAME READY STATUS RESTARTS AGE
calico-kube-controllers-5b8b769fcd-6xs92 1/1 Running 3 4d21h
calico-node-7x4c4 0/1 Running 3 4d21h
calico-node-fw49d 0/1 Running 4 4d21h
calico-node-xjqdh 1/1 Running 3 4d21h
coredns-546565776c-lb8dv 1/1 Running 3 4d21h
coredns-546565776c-vl826 1/1 Running 3 4d21h
eip-nfs-nfs-56759ddf65-frjv9 1/1 Running 5 20h
etcd-acdm2 1/1 Running 3 4d21h
kube-apiserver-acdm2 1/1 Running 3 4d21h
kube-controller-manager-acdm2 1/1 Running 3 4d21h
kube-proxy-jsb5j 1/1 Running 3 4d21h
kube-proxy-p5vf2 1/1 Running 2 4d21h
kube-proxy-r488r 1/1 Running 3 4d21h
kube-scheduler-acdm2 1/1 Running 3 4d21h
kuboard-8b8574658-fbmmr 1/1 Running 0 20h
local-volume-provisioner-55h2f 1/1 Running 3 4d19h
local-volume-provisioner-c6hbz 1/1 Running 3 4d1h
local-volume-provisioner-wxq5r 1/1 Running 3 4d19h
metrics-server-65cf9d584c-jjfbh 1/1 Running 0 20h
tiller-deploy-674cd64556-jhr5r 1/1 Running 2 4d20h
从你的 kube-system 下的网络组件来看应该是使用的 calico 网络方案,我建议可以检查一下是否有 calico 有关的 policy 策略阻碍了 discovery 与 PD 之间的通信。
嗯 是的 我重新配置了calico后可以了