tidb-operator升级完成后创建实例卡在pd了

k8s version: v1.12.2
operator: v1.1.8
tidb version: v4.0.2
查看了operator-controller日志,pd日志和discovery日志,一步步定位,但是还是不太清楚哪里哪里不对,麻烦看下,谢谢!

controller.log

E1224 10:48:10.832995       1 pd_member_manager.go:183] failed to sync TidbCluster: [gejb-test/test-001-01]'s status, error: Get "http://test-001-01-pd.gejb-test:2379/pd/health": dial tcp 10.233.36.113:2379: connect: connection refused
E1224 10:48:11.458573       1 tidb_cluster_controller.go:123] TidbCluster: gejb-test/test-001-01, sync failed TidbCluster: gejb-test/test-001-01's pd status sync failed, can't failover, requeuing

pd.log

Address 1: 10.233.95.97 test-001-01-pd-0.test-001-01-pd-peer.gejb-test.svc.cluster.local
nslookup domain test-001-01-pd-0.test-001-01-pd-peer.gejb-test.svc.svc success
waiting for discovery service to return start args ...
waiting for discovery service to return start args ...
waiting for discovery service to return start args ...
waiting for discovery service to return start args ...
waiting for discovery service to return start args ...
waiting for discovery service to return start args ...
waiting for discovery service to return start args ...

discovery.log

I1224 09:44:57.897042       1 discovery.go:62] advertisePeerUrl is: test-001-01-pd-0.test-001-01-pd-peer.gejb-test.svc:2380
E1224 09:45:00.902199       1 server.go:68] failed to discover: test-001-01-pd-0.test-001-01-pd-peer.gejb-test.svc:2380
, Get "https://10.233.0.1:443/apis/pingcap.com/v1alpha1/namespaces/gejb-test/tidbclusters/test-001-01": dial tcp 10.233.0.1:443: connect: connection timed out
E1224 09:45:00.902258       1 server.go:70] failed to writeError: Get "https://10.233.0.1:443/apis/pingcap.com/v1alpha1/namespaces/gejb-test/tidbclusters/test-001-01": dial tcp 10.233.0.1:443: connect: connection timed out

我这边登录discovery的pod,执行

]# kubectl exec -it  -n gejb-test test-001-01-discovery-5d9c476d87-hzxtw   -- /bin/sh
/ # nc -nvv 10.233.0.1:443
nc: 10.233.0.1:443 (10.233.0.1:443): Operation timed out
sent 0, rcvd 0

/ # nc -nvv 10.226.132.106 6443
10.226.132.106 (10.226.132.106:6443) open    #10.226.132.106是10.233.0.1clusterIP的endpoint

k8s的service信息

# kubectl describe  service kubernetes
Name:              kubernetes
Namespace:         default
Labels:            component=apiserver
                   provider=kubernetes
Annotations:       <none>
Selector:          <none>
Type:              ClusterIP
IP:                10.233.0.1
Port:              https  443/TCP
TargetPort:        6443/TCP
Endpoints:         10.226.132.106:6443,10.226.132.107:6443,10.226.132.108:6443
Session Affinity:  None
Events:            <none>

是不是哪里配置的不对,导致实例创建卡住了。

1.升级前都是正常的吗?
2.升级步骤麻烦发一下
3. kubectl get pod 反馈下目前信息

后来发现是一个k8s节点的kube-proxy有问题,导致调度到这个节点上的pod,无法被正常访问,目前已经解决了,谢谢。

感谢反馈 ~

您好,我在您提交的https://asktug.com/t/topic/66867问题里面看到,我现在遇到的问题跟您之前的问题相同,请问您是怎么解决的呀?

日志报错也是类似吗?看下pod被调度的node节点上kube-proxy服务是否正常吧

是正常的,节点的proxy日志是这样的

目前不太清楚你卡在了哪一步,是pd创建出来后,没有tikv,tidb节点吗?

此话题已在最后回复的 1 分钟后被自动关闭。不再允许新回复。