AWS EKS 部署问题：k8s node 都没起来

dulao5 · 2020 年10 月 28 日 06:26

/deploy/aws 下执行 terraform apply 时，
遇到这个错误 : Error: tiller was not found. polling deadline exceeded
本来以为是这个原因，后来发现不是版本号问题。

而是 k8s 初始化有问题，似乎是 ec2 工作节点没启动：

no nodes available to schedule pods


➜  aws git:(develop) ✗ kubectl get pod --all-namespaces
NAMESPACE     NAME                             READY   STATUS    RESTARTS   AGE
kube-system   coredns-9b6bd4456-2qkjt          0/1     Pending   0          165m
kube-system   coredns-9b6bd4456-pgk2n          0/1     Pending   0          165m
kube-system   tiller-deploy-7f4678d5c5-g2kjs   0/1     Pending   0          21m
➜  aws git:(develop) ✗ kubectl describe pod tiller-deploy-7f4678d5c5-g2kjs -n kube-system|tail -n 4
Events:
  Type     Reason            Age                 From               Message
  ----     ------            ----                ----               -------
  Warning  FailedScheduling  62s (x16 over 21m)  default-scheduler  no nodes available to schedule pods
➜  aws git:(develop) ✗ kubectl describe pod coredns-9b6bd4456-2qkjt -n kube-system|tail -n 4
Events:
  Type     Reason            Age                     From               Message
  ----     ------            ----                    ----               -------
  Warning  FailedScheduling  4m15s (x111 over 165m)  default-scheduler  no nodes available to schedule pods
➜  aws git:(develop) ✗ kubectl describe pod coredns-9b6bd4456-pgk2n -n kube-system|tail -n 4
Events:
  Type     Reason            Age                     From               Message
  ----     ------            ----                    ----               -------
  Warning  FailedScheduling  4m28s (x110 over 165m)  default-scheduler  no nodes available to schedule pods


➜  aws git:(develop) ✗ kubectl get nodes
No resources found.

请问应该怎么诊断？我需要还提供什么信息？

➜  aws git:(develop) ✗ kubectl config current-context
eks_my-tidb-cluster

➜  aws git:(develop) ✗ helm version
WARNING: "kubernetes-charts.storage.googleapis.com" is deprecated for "stable" and will be deleted Nov. 13, 2020.
WARNING: You should switch to "https://charts.helm.sh/stable"
Client: &version.Version{SemVer:"v2.17.0", GitCommit:"a690bad98af45b015bd3da1a41f6218b1a451dbe", GitTreeState:"clean"}
Error: could not find a ready tiller pod

答案

tidb-operator 的 aws terraform 是社区维护的，pingcap 官方并不推荐。官方推荐的是 aws eksctl 方式。
aws eks 需要注意几个点
- 建立集群时，可以不指定 vpc_id
  - 这种做法比较常见，一般也不会遇到问题。aws eksctl 会从零开始创建一个 vpc , 并设置里面的 igw/subnet/nat
- 建立集群时，也可以指定 vpc_id （也就是使用现存的VPC）
  - 我就是这个情况。因为涉及到 DX，公司规定VPC必须由专人分配
  - 这种情况需要自行设置 vpc 内的网络资源 (igw/subnet/nat)
  - 注意点
    1. public subnet 需要开启 map_public_ip_on_launch，保证可以绑定共有IP（进而保证可以访问互联网）
    2. private subnet 需要设置好 nat gateway 以及路由表，保证可以访问互联网
    3. subnet内IP地址要充足。官方要求了这一点，但是没说到底要多少。个人感觉至少 x.x.x.x/26 (64个)
    4. subnet 需要设置几个 Tag

Hacker_sAKN3wef · 2020 年10 月 28 日 06:45

@dulao5 We have abandoned the terraform deployment in our official docs because the eks terraform module is not stable enough, could you please follow our doc here to deploy EKS with eksctl?
Just let us know if any issues with the procedure, thanks!

dulao5 · 2020 年10 月 28 日 07:14

Oh, Thank you.
I read the wrong document and code -_-!

Okay, I’ll use eksctl.

By the way, which version of helm should I use? 2 or 3?

Hacker_sAKN3wef · 2020 年10 月 28 日 07:16

For now, please use the latest v2 version.

dulao5 · 2020 年10 月 28 日 07:17

Thank you very much!

Hacker_sAKN3wef · 2020 年10 月 28 日 07:19

You’re welcome!
Just update here if any issues with the eksctl procedures.

dulao5 · 2020 年11 月 4 日 06:13

Hello!
I would like to ask you some questions:

Now both GCP and AWS documentation doesn’t use terraform.
Do you plan to discard the existing terraform code? (eg. aws main.tf gcp main.tf )
Or are you planning to solve the terraform instability problem?
Could you advise on how to achieve “infrastructure as code” using aws eksctl without using terrafrom?
about use local storage

During the EKS upgrade, data in the local storage will be lost due to the node reconstruction. When the node reconstruction occurs, you need to migrate data in TiKV.

If I upgrade my EKS cluster, can I do so without stopping the service?

What does “migrate data in TiKV” mean, offline a tikv node in the same TiKV cluster and migrate it to a new node , is it?

Hacker_sAKN3wef · 2020 年11 月 4 日 12:09

Now both GCP and AWS documentation doesn’t use terraform.
Do you plan to discard the existing terraform code? (eg. aws main.tf gcp main.tf )
Or are you planning to solve the terraform instability problem?

We do not maintain the terraform scripts anymore, but we leave them in the repo in case some community members may want to use or improve them.

Could you advise on how to achieve “infrastructure as code” using aws eksctl without using terrafrom?

Actually, we would like to focus on the functionalities of TiDB Operator. The setup of Kubernetes is not our focus.

about use local storage

During the EKS upgrade, data in the local storage will be lost due to the node reconstruction. When the node reconstruction occurs, you need to migrate data in TiKV.
If I upgrade my EKS cluster, can I do so without stopping the service?What does “migrate data in TiKV” mean, offline a tikv node in the same TiKV cluster and migrate it to a new node , is it?

Local storage is not recommended, if you upgrade your EKS cluster nodes, you will lose all your data with instance stores, so please use EBS and the TiKV team is also working on the performance optimization with EBS.

dulao5 · 2020 年11 月 4 日 12:59

Thank you very much.
I have understood the first two questions.
Regarding question 3, can I know further details about “on the performance optimization with EBS”?

Hacker_sAKN3wef · 2020 年11 月 5 日 12:32

Sorry for the confusion, I mean that TiKV team is working to improve the performance of TiKV running on EBS.

dulao5 · 2020 年11 月 6 日 01:51

Thank you very much.

Next question：

How to customize the node’s instance specifications (for this )?
How to specify the EBS size for EKS node? Do I have to define launch-templates?
Do I have to define PVC & PV myself? Or is that TiDB’s job?
Which of the following do you recommend for the EKS cluster node? EKS managed node groups , Self-managed nodes, or AWS Fargate?

Hacker_sAKN3wef · 2020 年11 月 9 日 04:20

How to customize the node’s instance specifications (for this )?

You can follow the eksctl docs here to set the instance type for each node group.

How to specify the EBS size for EKS node? Do I have to define launch-templates?

No, you only need to specify the storage request in the TidbCluster yaml, then PVC/PV will be created.

Do I have to define PVC & PV myself? Or is that TiDB’s job?

No need. You just need to define your own storageClass on EKS.

Which of the following do you recommend for the EKS cluster node? EKS managed node groups , Self-managed nodes, or AWS Fargate?

I think you can go with the procedure in our document if you do not have any preference.

dulao5 · 2020 年11 月 12 日 06:53

Thank you very much.

By the way,
According to a part of the code, it seems that the eks module uses low-level API such as aws_autoscaling_group and aws_launch_configuration, etc (eg).

If you use aws_eks_cluster or aws_eks_node_group or other high-level API, it may be stable.

dulao5 · 2020 年11 月 12 日 13:09

@Hacker_sAKN3wef
I was able to build my cluster, but the EXTERNAL-IP of tidb-base svc is <pending>.
How should I check it?

kubectl get svc basic-tidb -n tidb-cluster
NAME         TYPE           CLUSTER-IP      EXTERNAL-IP   PORT(S)                          AGE
basic-tidb   LoadBalancer   172.20.97.137   <pending>     4000:32069/TCP,10080:32646/TCP   3h23m

dulao5 · 2020 年11 月 13 日 01:42

The reason is that the node side has a lot of ip- addresses in the pool
https://docs.aws.amazon.com/eks/latest/userguide/pod-networking.html

yilong · 2020 年11 月 18 日 12:37

dulao5 · 2020 年12 月 2 日 01:34

@Hacker_sAKN3wef
How do I turn tidb-dashboard service into LoadBalancer ( like
basic-grafana )?

kubectl get svc -n tidb-cluster
NAME                     TYPE           CLUSTER-IP       EXTERNAL-IP                                                                          PORT(S)                          AGE
basic-discovery          ClusterIP      172.20.241.183   <none>                                                                               10261/TCP,10262/TCP              18h
basic-grafana            LoadBalancer   172.20.244.6     a**1-1**2.ap-northeast-1.elb.amazonaws.com         3000:32577/TCP                   18h
basic-monitor-reloader   NodePort       172.20.242.120   <none>                                                                               9089:31824/TCP                   18h
basic-pd                 ClusterIP      172.20.104.142   <none>                                                                               2379/TCP                         18h
basic-pd-peer            ClusterIP      None             <none>                                                                               2380/TCP                         18h
basic-prometheus         NodePort       172.20.39.19     <none>                                                                               9090:30979/TCP                   18h
basic-tidb               LoadBalancer   172.20.116.81    ae0**7-fe5**7.elb.ap-northeast-1.amazonaws.com   4000:32298/TCP,10080:31053/TCP   17h
basic-tidb-peer          ClusterIP      None             <none>                                                                               10080/TCP                        17h
basic-tikv-peer          ClusterIP      None             <none>                                                                               20160/TCP                        18h

PS: I tried this port-forward method, but was redirected to http://basic-pd-2.basic-pd-peer.tidb-cluster.svc:2379/dashboard/

kubectl port-forward svc/basic-discovery -n tidb-cluster 10262:10262

yufan022 · 2021 年3 月 3 日 09:26

@Hacker_sAKN3wef
Hi, I have a few questions.
I noticed that you recommend using EBS as TiKV storage on EKS.
In performance, What’s the difference between using EBS and local disk for TiKV.

system · 2022 年10 月 31 日 19:12

此话题已在最后回复的 1 分钟后被自动关闭。不再允许新回复。