K8s v1.18部署TiDB 集群失败_(:ι」∠)_

麻烦提供一下 PD 的 log,从 operator controller 的 log 来看似乎是 operator 无法成功与 PD 进行 HTTP 通信,从而认为 PD 尚未正常启动,所以没有启动 tikv 集群。

nslookup domain tidb-cluster-pd-0.tidb-cluster-pd-peer.tidb-cluster.svc failed
Name:      tidb-cluster-pd-0.tidb-cluster-pd-peer.tidb-cluster.svc
Address 1: 192.168.128.52 apiserver.demo
nslookup domain tidb-cluster-pd-0.tidb-cluster-pd-peer.tidb-cluster.svc.svc success
waiting for discovery service to return start args ...
waiting for discovery service to return start args ...
waiting for discovery service to return start args ...
waiting for discovery service to return start args ...
# kubectl logs -n tidb-cluster tidb-cluster-discovery-7d4cfc7f4d-w5m2k
I0526 01:17:45.524197       1 version.go:38] Welcome to TiDB Operator.
I0526 01:17:45.524297       1 version.go:39] TiDB Operator Version: version.Info{GitVersion:"v1.0.6", GitCommit:"982720cd563ece6dbebfc4c579b17fa66a93c550", GitTreeState:"clean", BuildDate:"2019-12-27T16:53:46Z", GoVersion:"go1.13", Compiler:"gc", Platform:"linux/amd64"}
I0526 01:17:45.525474       1 mux.go:40] starting TiDB Discovery server, listening on 0.0.0.0:10261

看上去 discovery 服务没有正常工作,或者可能进程内维护信息有错误。 我建议试试删除 discovery 服务的 Pod,等待 discovery 服务新建 POD 启动试试。

我删了discovery的pod后发现没有效果,我exec进入discovery后能ping通到pd节点

# kubectl get pods -n tidb-cluster -o wide
NAME                                      READY   STATUS    RESTARTS   AGE   IP               NODE    NOMINATED NODE   READINESS GATES
tidb-cluster-discovery-7d4cfc7f4d-6g5kl   1/1     Running   0          52s   10.100.137.223   acdm1   <none>           <none>
tidb-cluster-monitor-588fd6bcd5-pfsqz     3/3     Running   0          26m   10.100.137.221   acdm1   <none>           <none>
tidb-cluster-pd-0                         1/1     Running   0          26m   192.168.128.52   acdm2   <none>           <none>
[root@acdm2 ~]# kubectl logs -n tidb-cluster tidb-cluster-discovery-7d4cfc7f4d-6g5kl
I0526 01:43:16.544964       1 version.go:38] Welcome to TiDB Operator.
I0526 01:43:16.545040       1 version.go:39] TiDB Operator Version: version.Info{GitVersion:"v1.0.6", GitCommit:"982720cd563ece6dbebfc4c579b17fa66a93c550", GitTreeState:"clean", BuildDate:"2019-12-27T16:53:46Z", GoVersion:"go1.13", Compiler:"gc", Platform:"linux/amd64"}
I0526 01:43:16.546561       1 mux.go:40] starting TiDB Discovery server, listening on 0.0.0.0:10261
[root@acdm2 ~]# kubectl exec -it tidb-cluster-discovery-7d4cfc7f4d-6g5kl -n tidb-cluster -- /bin/sh
/ # ping 192.168.128.52
PING 192.168.128.52 (192.168.128.52): 56 data bytes
64 bytes from 192.168.128.52: seq=0 ttl=63 time=0.458 ms
64 bytes from 192.168.128.52: seq=1 ttl=63 time=0.706 ms
^C
--- 192.168.128.52 ping statistics ---
2 packets transmitted, 2 packets received, 0% packet loss
round-trip min/avg/max = 0.458/0.582/0.706 ms
/ #

@yunlingfly 从 discovery 的日志里来看,似乎没有收到 PD 启动时的 HTTP 请求。 我建议使用 exec 命令登录到 某个 PD 内部,然后使用 wget http://tidb-cluster-discovery/new/xxxxx 看一下具体的 response,检查一下 PD 到 discovery 之间的连通性。

我刚刚升级operator到了v1.1.0-rc.4,发现还是没有TiKV,进入pd后发现pd日志如下:

nslookup domain basic-pd-0.basic-pd-peer.tidb-cluster.svc failed

我能像设置pd一样设置discovery的hostNetwork为true么,貌似是因为discovery的ip是内部分配的?

# kubectl get pods -n tidb-cluster -o wide
NAME                               READY   STATUS    RESTARTS   AGE    IP               NODE    NOMINATED NODE   READINESS GATES
basic-discovery-788bf6c7cd-c64vj   1/1     Running   0          4m6s   10.100.137.229   acdm1   <none>           <none>
basic-pd-0                         1/1     Running   1          4m5s   192.168.128.51   acdm1   <none>           <none>

Pod 在启动时 Kubernetes 内的服务的 DNS 并不是立即就绪的,一般 10~20秒内就应该不会出现这个问题。如果一直出现这个报错,需要检查一下这个 Kubernetes 网络的 DNS 服务是否正常。

麻烦提供一下当前的 tidbcluster 的 yaml 文件定义。

apiVersion: pingcap.com/v1alpha1
kind: TidbCluster
metadata:
  name: basic
spec:
  version: v3.0.13
  timezone: UTC
  pvReclaimPolicy: Retain
  hostNetwork: true
  pd:
    baseImage: pingcap/pd
    replicas: 1
    # if storageClassName is not set, the default Storage Class of the Kubernetes cluster will be used
    storageClassName: nfs
    requests:
      storage: "1Gi"
    config: {}
    hostNetwork: true
  tikv:
    baseImage: pingcap/tikv
    replicas: 1
    # if storageClassName is not set, the default Storage Class of the Kubernetes cluster will be used
    storageClassName: nfs
    requests:
      storage: "1Gi"
    config: {}
    hostNetwork: true
  tidb:
    baseImage: pingcap/tidb
    replicas: 2
    service:
      type: ClusterIP
    config: {}
    hostNetwork: true

那如果我要解决这个问题只能进入到具体的pd里编辑host添加解析?

我发现peer就没有分配ip

# kubectl get services --namespace tidb-cluster -o wide
NAME              TYPE        CLUSTER-IP      EXTERNAL-IP   PORT(S)     AGE   SELECTOR
basic-discovery   ClusterIP   10.96.201.137   <none>        10261/TCP   16m   app.kubernetes.io/component=discovery,app.kubernetes.io/instance=basic,app.kubernetes.io/managed-by=tidb-operator,app.kubernetes.io/name=tidb-cluster
basic-pd          ClusterIP   10.96.20.95     <none>        2379/TCP    16m   app.kubernetes.io/component=pd,app.kubernetes.io/instance=basic,app.kubernetes.io/managed-by=tidb-operator,app.kubernetes.io/name=tidb-cluster
basic-pd-peer     ClusterIP   None            <none>        2380/TCP    16m   app.kubernetes.io/component=pd,app.kubernetes.io/instance=basic,app.kubernetes.io/managed-by=tidb-operator,app.kubernetes.io/name=tidb-cluster

peer 服务是作为 pd 集群之间 peer-to-peer 的 headless service 存在。 确实是不需要分配 IP。你可以先试一下将 hostNetwork 改为 false 试试。 我使用了你的 tidbcluster yaml 定义,在我这里的环境未能复现问题,我猜测应该和集群的网络连通性有关。

默认false不是么,错误信息是一样的

nslookup domain basic-pd-0.basic-pd-peer.tidb-cluster.svc failed

那我要如何设置“集群的网络连通性”让他能联通呢?

改kube-proxy/coredns?

# kubectl get pods -n kube-system
NAME                                       READY   STATUS    RESTARTS   AGE
calico-kube-controllers-5b8b769fcd-6xs92   1/1     Running   3          4d21h
calico-node-7x4c4                          0/1     Running   3          4d21h
calico-node-fw49d                          0/1     Running   4          4d21h
calico-node-xjqdh                          1/1     Running   3          4d21h
coredns-546565776c-lb8dv                   1/1     Running   3          4d21h
coredns-546565776c-vl826                   1/1     Running   3          4d21h
eip-nfs-nfs-56759ddf65-frjv9               1/1     Running   5          20h
etcd-acdm2                                 1/1     Running   3          4d21h
kube-apiserver-acdm2                       1/1     Running   3          4d21h
kube-controller-manager-acdm2              1/1     Running   3          4d21h
kube-proxy-jsb5j                           1/1     Running   3          4d21h
kube-proxy-p5vf2                           1/1     Running   2          4d21h
kube-proxy-r488r                           1/1     Running   3          4d21h
kube-scheduler-acdm2                       1/1     Running   3          4d21h
kuboard-8b8574658-fbmmr                    1/1     Running   0          20h
local-volume-provisioner-55h2f             1/1     Running   3          4d19h
local-volume-provisioner-c6hbz             1/1     Running   3          4d1h
local-volume-provisioner-wxq5r             1/1     Running   3          4d19h
metrics-server-65cf9d584c-jjfbh            1/1     Running   0          20h
tiller-deploy-674cd64556-jhr5r             1/1     Running   2          4d20h

从你的 kube-system 下的网络组件来看应该是使用的 calico 网络方案,我建议可以检查一下是否有 calico 有关的 policy 策略阻碍了 discovery 与 PD 之间的通信。

嗯 是的 我重新配置了calico后可以了

:crossed_fingers:

@Yisaer 辛苦研发同学

@yunlingfly 有问题欢迎开新帖继续讨论。