helm install tidb-cluster后未出现tidb pod

为提高效率,提问时请提供以下信息,问题描述清晰可优先响应。

  • 【TiDB 版本】:v1.1.9
  • 【问题描述】:
    helm install tidb-cluster后,未发现tidb的pod

controller manager的log如下:

     E0122 10:11:51.319762       1 pd_member_manager.go:183] failed to sync TidbCluster: [test-tidb/test]'s status, error: Get http://test-pd.test-tidb:2379/pd/health: dial tcp 10.98.17.173:2379: connect: connection refused
     E0122 10:11:51.319967       1 tidb_cluster_controller.go:123] TidbCluster: test-tidb/test, sync failed TidbCluster: test-tidb/test's pd status sync failed, can't failover, requeuing
     E0122 10:12:21.308840       1 pd_member_manager.go:183] failed to sync TidbCluster: [test-tidb/test]'s status, error: Get http://test-pd.test-tidb:2379/pd/health: dial tcp 10.98.17.173:2379: connect: connection refused
     E0122 10:12:21.309010       1 tidb_cluster_controller.go:123] TidbCluster: test-tidb/test, sync failed TidbCluster: test-tidb/test's pd status sync failed, can't failover, requeuing

k get all如下:


    [root@k8s-client tidb-operator]# kubectl get all -n test-tidb
    NAME                                  READY   STATUS    RESTARTS   AGE
    pod/test-discovery-684b5b6479-2mx4q   1/1     Running   0          13m
    pod/test-monitor-6cbbbf6f77-scvz2     3/3     Running   0          13m
    pod/test-pd-0                         1/1     Running   2          7m23s

    NAME                            TYPE           CLUSTER-IP       EXTERNAL-IP   PORT(S)                          AGE
    service/test-discovery          ClusterIP      10.110.83.186    <none>        10261/TCP,10262/TCP              13m
    service/test-grafana            NodePort       10.100.174.220   <none>        3000:31562/TCP                   13m
    service/test-monitor-reloader   NodePort       10.100.175.42    <none>        9089:31875/TCP                   13m
    service/test-pd                 LoadBalancer   10.98.17.173     <pending>     2379:31542/TCP                   13m
    service/test-pd-peer            ClusterIP      None             <none>        2380/TCP                         13m
    service/test-prometheus         NodePort       10.107.150.199   <none>        9090:30368/TCP                   13m
    service/test-tidb               NodePort       10.100.174.117   <none>        4000:31969/TCP,10080:31576/TCP   13m

    NAME                             READY   UP-TO-DATE   AVAILABLE   AGE
    deployment.apps/test-discovery   1/1     1            1           13m
    deployment.apps/test-monitor     1/1     1            1           13m

    NAME                                        DESIRED   CURRENT   READY   AGE
    replicaset.apps/test-discovery-684b5b6479   1         1         1       13m
    replicaset.apps/test-discovery-776c6b7f47   0         0         0       13m
    replicaset.apps/test-monitor-6cbbbf6f77     1         1         1       13m

    NAME                       READY   AGE
    statefulset.apps/test-pd   1/1     13m

若提问为性能优化、故障排查类问题,请下载脚本运行。终端输出的打印结果,请务必全选并复制粘贴上传。

吐血,看好久了,有大神知道是为什么吗

看pd 是loadbalance,目前是pending状态。这个如果是测试环境,pd使用cluster先安装好试试

修改未cluster后,似乎也没有太大区别

[root@k8s-client tidb-operator]# kubectl get all -n test-tidb
NAME                                  READY   STATUS    RESTARTS   AGE
pod/test-discovery-684b5b6479-7f6m8   1/1     Running   0          4m10s
pod/test-monitor-6cbbbf6f77-fqslx     3/3     Running   0          4m15s
pod/test-pd-0                         1/1     Running   1          4m15s

NAME                            TYPE        CLUSTER-IP       EXTERNAL-IP   PORT(S)                          AGE
service/test-discovery          ClusterIP   10.107.146.141   <none>        10261/TCP,10262/TCP              4m15s
service/test-grafana            NodePort    10.103.232.247   <none>        3000:30891/TCP                   4m15s
service/test-monitor-reloader   NodePort    10.99.221.8      <none>        9089:30515/TCP                   4m15s
service/test-pd                 ClusterIP   10.111.90.0      <none>        2379/TCP                         4m15s
service/test-pd-peer            ClusterIP   None             <none>        2380/TCP                         4m15s
service/test-prometheus         NodePort    10.98.224.132    <none>        9090:32389/TCP                   4m15s
service/test-tidb               NodePort    10.99.138.27     <none>        4000:30318/TCP,10080:31172/TCP   4m15s

NAME                             READY   UP-TO-DATE   AVAILABLE   AGE
deployment.apps/test-discovery   1/1     1            1           4m15s
deployment.apps/test-monitor     1/1     1            1           4m15s

NAME                                        DESIRED   CURRENT   READY   AGE
replicaset.apps/test-discovery-684b5b6479   1         1         1       4m10s
replicaset.apps/test-discovery-776c6b7f47   0         0         0       4m15s
replicaset.apps/test-monitor-6cbbbf6f77     1         1         1       4m15s

NAME                       READY   AGE
statefulset.apps/test-pd   1/1     4m15s

controller manager报错依然为

E0124 05:37:54.653598       1 pd_member_manager.go:183] failed to sync TidbCluster: [test-tidb/test]'s status, error: Get http://test-pd.test-tidb:2379/pd/health: dial tcp 10.111.90.0:2379: connect: connection refused
E0124 05:37:54.653968       1 tidb_cluster_controller.go:123] TidbCluster: test-tidb/test, sync failed TidbCluster: test-tidb/test's pd status sync failed, can't failover, requeuing
E0124 05:38:24.662756       1 pd_member_manager.go:183] failed to sync TidbCluster: [test-tidb/test]'s status, error: Get http://test-pd.test-tidb:2379/pd/health: dial tcp 10.111.90.0:2379: connect: connection refused
E0124 05:38:24.663119       1 tidb_cluster_controller.go:123] TidbCluster: test-tidb/test, sync failed TidbCluster: test-tidb/test's pd status sync failed, can't failover, requeuing
  1. 反馈下 tidb-operator 的信息,目前都是正常的吗?
    kubectl get all -n tidb-admin

  2. 反馈下tidb-cluster的配置文件,看下pd,tidb,tikv的配置

  3. 根据报错 ,检查下各个节点的时间是否一致 https://blog.liu-kevin.com/2018/12/22/k8syi-chang-wen-ti/
    另外检查下网络是否都正常。
    E0124 05:37:54.653598 1 pd_member_manager.go:183] failed to sync TidbCluster: [test-tidb/test]'s status, error: Get http://test-pd.test-tidb:2379/pd/health: dial tcp 10.111.90.0:2379: connect: connection refused