K8S在IPv4环境下部署失败,pd容器一直不断重启

【 TiDB 使用环境】测试/ Poc
【 TiDB 版本】tidb v6.5.8, ti-operator v1.5.2
【复现路径】安装部署
【遇到的问题:问题现象及影响】K8S环境安装后,pd不停重启
【资源配置】
【附件:截图/日志/监控】

docker Version: 20.10.2
k8s Version: v1.23.17

使用K8S部署的默认yaml配置文件。

pd显示running过程中,出现ip地址丢失的现象:

node1:/# kubectl -n name-space exec -ti basic-pd-1 – sh
/ # ifconfig
lo Link encap:Local Loopback
inet addr:127.0.0.1 Mask:255.0.0.0
UP LOOPBACK RUNNING MTU:65536 Metric:1
RX packets:10 errors:0 dropped:0 overruns:0 frame:0
TX packets:10 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:500 (500.0 B) TX bytes:500 (500.0 B)

/ # command terminated with exit code 137

pd的日志如下:

Server: 10.96.0.10
Address 1: 10.96.0.10 kube-dns.kube-system.svc.cluster.local

Name: basic-pd-1.basic-pd-peer.name-space.svc
Address 1: 100.74.135.23
nslookup domain basic-pd-1.basic-pd-peer.name-space.svc.svc success
starting pd-server …
/pd-server --data-dir=/var/lib/pd --name=basic-pd-1 --peer-urls=http://0.0.0.0:2380 --advertise-peer-urls=http://basic-pd-1.basic-pd-peer.name-space.svc:2380 --client-urls=http://0.0.0.0:2379 --advertise-client-urls=http://basic-pd-1.basic-pd-peer.name-space.svc:2379 --config=/etc/pd/pd.toml
[2024/06/19 07:39:57.435 +00:00] [INFO] [util.go:43] [“Welcome to Placement Driver (PD)”]
[2024/06/19 07:39:57.435 +00:00] [INFO] [util.go:44] [PD] [release-version=v6.5.8]
[2024/06/19 07:39:57.435 +00:00] [INFO] [util.go:45] [PD] [edition=Community]
[2024/06/19 07:39:57.435 +00:00] [INFO] [util.go:46] [PD] [git-hash=4506d63ba4fba7123ecc8277da7ef5f635efee90]
[2024/06/19 07:39:57.435 +00:00] [INFO] [util.go:47] [PD] [git-branch=heads/refs/tags/v6.5.8]
[2024/06/19 07:39:57.435 +00:00] [INFO] [util.go:48] [PD] [utc-build-time=“2024-01-25 10:03:20”]
[2024/06/19 07:39:57.435 +00:00] [INFO] [metricutil.go:83] [“disable Prometheus push client”]
[2024/06/19 07:39:57.435 +00:00] [INFO] [server.go:253] [“PD Config”] [config=“{"client-urls":"http://0.0.0.0:2379","peer-urls":"http://0.0.0.0:2380","advertise-client-urls":"http://basic-pd-1.basic-pd-peer.name-space.svc:2379","advertise-peer-urls":"http://basic-pd-1.basic-pd-peer.name-space.svc:2380","name":"basic-pd-1","data-dir":"/var/lib/pd","force-new-cluster":false,"enable-grpc-gateway":true,"initial-cluster":"basic-pd-1=http://basic-pd-1.basic-pd-peer.name-space.svc:2380","initial-cluster-state":"new","initial-cluster-token":"pd-cluster","join":"","lease":3,"log":{"level":"info","format":"text","disable-timestamp":false,"file":{"filename":"","max-size":0,"max-days":0,"max-backups":0},"development":false,"disable-caller":false,"disable-stacktrace":false,"disable-error-verbose":true,"sampling":null,"error-output-path":""},"tso-save-interval":"3s","tso-update-physical-interval":"50ms","enable-local-tso":false,"metric":{"job":"basic-pd-1","address":"","interval":"15s"},"schedule":{"max-snapshot-count":64,"max-pending-peer-count":64,"max-merge-region-size":20,"max-merge-region-keys":0,"split-merge-interval":"1h0m0s","swtich-witness-interval":"1h0m0s","enable-one-way-merge":"false","enable-cross-table-merge":"true","patrol-region-interval":"10ms","max-store-down-time":"30m0s","max-store-preparing-time":"48h0m0s","leader-schedule-limit":4,"leader-schedule-policy":"count","region-schedule-limit":2048,"replica-schedule-limit":64,"merge-schedule-limit":8,"hot-region-schedule-limit":4,"hot-region-cache-hits-threshold":3,"store-limit":{},"tolerant-size-ratio":0,"low-space-ratio":0.8,"high-space-ratio":0.7,"region-score-formula-version":"v2","scheduler-max-waiting-operator":5,"enable-remove-down-replica":"true","enable-replace-offline-replica":"true","enable-make-up-replica":"true","enable-remove-extra-replica":"true","enable-location-replacement":"true","enable-debug-metrics":"false","enable-joint-consensus":"true","enable-tikv-split-region":"true","schedulers-v2":[{"type":"balance-region","args":null,"disable":false,"args-payload":""},{"type":"balance-leader","args":null,"disable":false,"args-payload":""},{"type":"hot-region","args":null,"disable":false,"args-payload":""},{"type":"split-bucket","args":null,"disable":false,"args-payload":""}],"schedulers-payload":null,"store-limit-mode":"manual","hot-regions-write-interval":"10m0s","hot-regions-reserved-days":7,"enable-diagnostic":"false","enable-witness":"false"},"replication":{"max-replicas":3,"location-labels":"","strictly-match-label":"false","enable-placement-rules":"true","enable-placement-rules-cache":"false","isolation-level":""},"pd-server":{"use-region-storage":"true","max-gap-reset-ts":"24h0m0s","key-type":"table","runtime-services":"","metric-storage":"","dashboard-address":"auto","trace-region-flow":"true","flow-round-by-digit":3,"min-resolved-ts-persistence-interval":"1s"},"cluster-version":"0.0.0","labels":{},"quota-backend-bytes":"8GiB","auto-compaction-mode":"periodic","auto-compaction-retention-v2":"1h","TickInterval":"500ms","ElectionInterval":"3s","PreVote":true,"max-request-bytes":157286400,"security":{"cacert-path":"","cert-path":"","key-path":"","cert-allowed-cn":null,"SSLCABytes":null,"SSLCertBytes":null,"SSLKEYBytes":null,"redact-info-log":false,"encryption":{"data-encryption-method":"plaintext","data-key-rotation-period":"168h0m0s","master-key":{"type":"plaintext","key-id":"","region":"","endpoint":"","path":""}}},"label-property":null,"WarningMsgs":null,"DisableStrictReconfigCheck":false,"HeartbeatStreamBindInterval":"1m0s","LeaderPriorityCheckInterval":"1m0s","dashboard":{"tidb-cacert-path":"","tidb-cert-path":"","tidb-key-path":"","public-path-prefix":"","internal-proxy":false,"enable-telemetry":false,"enable-experimental":false},"replication-mode":{"replication-mode":"majority","dr-auto-sync":{"label-key":"","primary":"","dr":"","primary-replicas":0,"dr-replicas":0,"wait-store-timeout":"1m0s","pause-region-split":"false"}}}”]
[2024/06/19 07:39:57.443 +00:00] [INFO] [server.go:228] [“register REST path”] [path=/pd/api/v1]
[2024/06/19 07:39:57.443 +00:00] [INFO] [server.go:228] [“register REST path”] [path=/pd/api/v2/]
[2024/06/19 07:39:57.443 +00:00] [INFO] [server.go:228] [“register REST path”] [path=/autoscaling]
[2024/06/19 07:39:57.443 +00:00] [INFO] [distro.go:51] [“Using distribution strings”] [strings={}]
[2024/06/19 07:39:57.445 +00:00] [INFO] [server.go:228] [“register REST path”] [path=/dashboard/api/]
[2024/06/19 07:39:57.445 +00:00] [INFO] [server.go:228] [“register REST path”] [path=/dashboard/]
[2024/06/19 07:39:57.445 +00:00] [INFO] [etcd.go:117] [“configuring peer listeners”] [listen-peer-urls=“[http://0.0.0.0:2380]”]
[2024/06/19 07:39:57.445 +00:00] [INFO] [etcd.go:127] [“configuring client listeners”] [listen-client-urls=“[http://0.0.0.0:2379]”]
[2024/06/19 07:39:57.446 +00:00] [INFO] [etcd.go:611] [“pprof is enabled”] [path=/debug/pprof]
[2024/06/19 07:39:57.446 +00:00] [INFO] [systimemon.go:30] [“start system time monitor”]
[2024/06/19 07:39:57.446 +00:00] [INFO] [etcd.go:305] [“starting an etcd server”] [etcd-version=3.4.21] [git-sha=“Not provided (use ./build instead of go build)”] [go-version=go1.19.13] [go-os=linux] [go-arch=amd64] [max-cpu-set=16] [max-cpu-available=16] [member-initialized=true] [name=basic-pd-1] [data-dir=/var/lib/pd] [wal-dir=] [wal-dir-dedicated=] [member-dir=/var/lib/pd/member] [force-new-cluster=false] [heartbeat-interval=500ms] [election-timeout=3s] [initial-election-tick-advance=true] [snapshot-count=100000] [snapshot-catchup-entries=5000] [initial-advertise-peer-urls=“[http://basic-pd-1.basic-pd-peer.name-space.svc:2380]”] [listen-peer-urls=“[http://0.0.0.0:2380]”] [advertise-client-urls=“[http://basic-pd-1.basic-pd-peer.name-space.svc:2379]”] [listen-client-urls=“[http://0.0.0.0:2379]”] [listen-metrics-urls=“”] [cors=“[]“] [host-whitelist=”[]”] [initial-cluster=] [initial-cluster-state=new] [initial-cluster-token=] [quota-backend-bytes=8589934592] [max-request-bytes=157286400] [max-concurrent-streams=4294967295] [pre-vote=true] [initial-corrupt-check=false] [corrupt-check-time-interval=0s] [auto-compaction-mode=periodic] [auto-compaction-retention=1h0m0s] [auto-compaction-interval=1h0m0s] [discovery-url=] [discovery-proxy=]
[2024/06/19 07:39:57.446 +00:00] [WARN] [server.go:297] [“exceeded recommended request limit”] [max-request-bytes=157286400] [max-request-size=“157 MB”] [recommended-request-bytes=10485760] [recommended-request-size=“10 MB”]
2024-06-19 07:39:57.446236 W | pkg/fileutil: check file permission: directory “/var/lib/pd” exist, but the permission is “drwxr-xr-x”. The recommended permission is “-rwx------” to prevent possible unprivileged access to the data.
[2024/06/19 07:39:57.453 +00:00] [INFO] [backend.go:80] [“opened backend db”] [path=/var/lib/pd/member/snap/db] [took=7.123285ms]
[2024/06/19 07:39:57.478 +00:00] [INFO] [raft.go:586] [“restarting local member”] [cluster-id=dd0f8a758c7a4da3] [local-member-id=1ceea2c50a83b82a] [commit-index=5715]
[2024/06/19 07:39:57.479 +00:00] [INFO] [raft.go:1523] [“1ceea2c50a83b82a switched to configuration voters=()”]
[2024/06/19 07:39:57.479 +00:00] [INFO] [raft.go:706] [“1ceea2c50a83b82a became follower at term 29”]
[2024/06/19 07:39:57.479 +00:00] [INFO] [raft.go:389] [“newRaft 1ceea2c50a83b82a [peers: , term: 29, commit: 5715, applied: 0, lastindex: 5715, lastterm: 29]”]
[2024/06/19 07:39:57.479 +00:00] [WARN] [store.go:1379] [“simple token is not cryptographically signed”]
[2024/06/19 07:39:57.490 +00:00] [INFO] [quota.go:126] [“enabled backend quota”] [quota-name=v3-applier] [quota-size-bytes=8589934592] [quota-size=“8.6 GB”]
[2024/06/19 07:39:57.491 +00:00] [INFO] [server.go:816] [“starting etcd server”] [local-member-id=1ceea2c50a83b82a] [local-server-version=3.4.21] [cluster-version=to_be_decided]
[2024/06/19 07:39:57.491 +00:00] [INFO] [server.go:704] [“starting initial election tick advance”] [election-ticks=6]
[2024/06/19 07:39:57.491 +00:00] [INFO] [raft.go:1523] [“1ceea2c50a83b82a switched to configuration voters=(2084782644687779882)”]
[2024/06/19 07:39:57.491 +00:00] [INFO] [cluster.go:392] [“added member”] [cluster-id=dd0f8a758c7a4da3] [local-member-id=1ceea2c50a83b82a] [added-peer-id=1ceea2c50a83b82a] [added-peer-peer-urls=“[http://basic-pd-1.basic-pd-peer.name-space.svc:2380]”]
[2024/06/19 07:39:57.492 +00:00] [INFO] [cluster.go:558] [“set initial cluster version”] [cluster-id=dd0f8a758c7a4da3] [local-member-id=1ceea2c50a83b82a] [cluster-version=3.4]
[2024/06/19 07:39:57.492 +00:00] [INFO] [capability.go:76] [“enabled capabilities for version”] [cluster-version=3.4]
[2024/06/19 07:39:57.492 +00:00] [INFO] [raft.go:1523] [“1ceea2c50a83b82a switched to configuration voters=(2084782644687779882 11509752368320045231)”]
[2024/06/19 07:39:57.492 +00:00] [INFO] [cluster.go:392] [“added member”] [cluster-id=dd0f8a758c7a4da3] [local-member-id=1ceea2c50a83b82a] [added-peer-id=9fbadacc366d60af] [added-peer-peer-urls=“[http://basic-pd-2.basic-pd-peer.name-space.svc:2380]”]
[2024/06/19 07:39:57.492 +00:00] [INFO] [peer.go:128] [“starting remote peer”] [remote-peer-id=9fbadacc366d60af]
[2024/06/19 07:39:57.492 +00:00] [INFO] [pipeline.go:71] [“started HTTP pipelining with remote peer”] [local-member-id=1ceea2c50a83b82a] [remote-peer-id=9fbadacc366d60af]
[2024/06/19 07:39:57.492 +00:00] [INFO] [stream.go:166] [“started stream writer with remote peer”] [local-member-id=1ceea2c50a83b82a] [remote-peer-id=9fbadacc366d60af]
[2024/06/19 07:39:57.493 +00:00] [INFO] [stream.go:166] [“started stream writer with remote peer”] [local-member-id=1ceea2c50a83b82a] [remote-peer-id=9fbadacc366d60af]
[2024/06/19 07:39:57.493 +00:00] [INFO] [etcd.go:585] [“serving peer traffic”] [address=“[::]:2380”]
[2024/06/19 07:39:57.493 +00:00] [INFO] [etcd.go:247] [“now serving peer/client/metrics”] [local-member-id=1ceea2c50a83b82a] [initial-advertise-peer-urls=“[http://basic-pd-1.basic-pd-peer.name-space.svc:2380]”] [listen-peer-urls=“[http://0.0.0.0:2380]”] [advertise-client-urls=“[http://basic-pd-1.basic-pd-peer.name-space.svc:2379]”] [listen-client-urls=“[http://0.0.0.0:2379]”] [listen-metrics-urls=“”]
[2024/06/19 07:39:57.493 +00:00] [INFO] [peer.go:134] [“started remote peer”] [remote-peer-id=9fbadacc366d60af]
[2024/06/19 07:39:57.493 +00:00] [INFO] [transport.go:327] [“added remote peer”] [local-member-id=1ceea2c50a83b82a] [remote-peer-id=9fbadacc366d60af] [remote-peer-urls=“[http://basic-pd-2.basic-pd-peer.name-space.svc:2380]”]
[2024/06/19 07:39:57.493 +00:00] [INFO] [stream.go:406] [“started stream reader with remote peer”] [stream-reader-type=“stream MsgApp v2”] [local-member-id=1ceea2c50a83b82a] [remote-peer-id=9fbadacc366d60af]
[2024/06/19 07:39:57.493 +00:00] [INFO] [stream.go:406] [“started stream reader with remote peer”] [stream-reader-type=“stream Message”] [local-member-id=1ceea2c50a83b82a] [remote-peer-id=9fbadacc366d60af]
[2024/06/19 07:39:57.493 +00:00] [INFO] [raft.go:1523] [“1ceea2c50a83b82a switched to configuration voters=(2084782644687779882 4137953763183718604 11509752368320045231)”]
[2024/06/19 07:39:57.493 +00:00] [INFO] [cluster.go:392] [“added member”] [cluster-id=dd0f8a758c7a4da3] [local-member-id=1ceea2c50a83b82a] [added-peer-id=396cf706178e70cc] [added-peer-peer-urls=“[http://basic-pd-0.basic-pd-peer.name-space.svc:2380]”]
[2024/06/19 07:39:57.493 +00:00] [INFO] [peer.go:128] [“starting remote peer”] [remote-peer-id=396cf706178e70cc]
[2024/06/19 07:39:57.493 +00:00] [INFO] [pipeline.go:71] [“started HTTP pipelining with remote peer”] [local-member-id=1ceea2c50a83b82a] [remote-peer-id=396cf706178e70cc]
[2024/06/19 07:39:57.494 +00:00] [INFO] [stream.go:166] [“started stream writer with remote peer”] [local-member-id=1ceea2c50a83b82a] [remote-peer-id=396cf706178e70cc]
[2024/06/19 07:39:57.494 +00:00] [INFO] [stream.go:166] [“started stream writer with remote peer”] [local-member-id=1ceea2c50a83b82a] [remote-peer-id=396cf706178e70cc]
[2024/06/19 07:39:57.494 +00:00] [INFO] [peer.go:134] [“started remote peer”] [remote-peer-id=396cf706178e70cc]
[2024/06/19 07:39:57.494 +00:00] [INFO] [stream.go:406] [“started stream reader with remote peer”] [stream-reader-type=“stream MsgApp v2”] [local-member-id=1ceea2c50a83b82a] [remote-peer-id=396cf706178e70cc]
[2024/06/19 07:39:57.494 +00:00] [INFO] [stream.go:406] [“started stream reader with remote peer”] [stream-reader-type=“stream Message”] [local-member-id=1ceea2c50a83b82a] [remote-peer-id=396cf706178e70cc]
[2024/06/19 07:39:57.494 +00:00] [INFO] [transport.go:327] [“added remote peer”] [local-member-id=1ceea2c50a83b82a] [remote-peer-id=396cf706178e70cc] [remote-peer-urls=“[http://basic-pd-0.basic-pd-peer.name-space.svc:2380]”]
[2024/06/19 07:39:57.501 +00:00] [INFO] [peer_status.go:51] [“peer became active”] [peer-id=396cf706178e70cc]
[2024/06/19 07:39:57.501 +00:00] [INFO] [stream.go:425] [“established TCP streaming connection with remote peer”] [stream-reader-type=“stream Message”] [local-member-id=1ceea2c50a83b82a] [remote-peer-id=396cf706178e70cc]
[2024/06/19 07:39:57.501 +00:00] [INFO] [stream.go:425] [“established TCP streaming connection with remote peer”] [stream-reader-type=“stream MsgApp v2”] [local-member-id=1ceea2c50a83b82a] [remote-peer-id=396cf706178e70cc]
[2024/06/19 07:39:57.525 +00:00] [INFO] [stream.go:250] [“set message encoder”] [from=1ceea2c50a83b82a] [to=1ceea2c50a83b82a] [stream-type=“stream Message”]
[2024/06/19 07:39:57.525 +00:00] [WARN] [stream.go:277] [“established TCP streaming connection with remote peer”] [stream-writer-type=“stream Message”] [local-member-id=1ceea2c50a83b82a] [remote-peer-id=396cf706178e70cc]
[2024/06/19 07:39:57.526 +00:00] [INFO] [stream.go:250] [“set message encoder”] [from=1ceea2c50a83b82a] [to=1ceea2c50a83b82a] [stream-type=“stream MsgApp v2”]
[2024/06/19 07:39:57.526 +00:00] [WARN] [stream.go:277] [“established TCP streaming connection with remote peer”] [stream-writer-type=“stream MsgApp v2”] [local-member-id=1ceea2c50a83b82a] [remote-peer-id=396cf706178e70cc]
[2024/06/19 07:39:59.868 +00:00] [INFO] [raft.go:902] [“1ceea2c50a83b82a [logterm: 29, index: 5715, vote: 9fbadacc366d60af] rejected MsgPreVote from 396cf706178e70cc [logterm: 27, index: 5698] at term 29”]
[2024/06/19 07:40:00.479 +00:00] [INFO] [raft.go:929] [“1ceea2c50a83b82a is starting a new election at term 29”]
[2024/06/19 07:40:00.479 +00:00] [INFO] [raft.go:735] [“1ceea2c50a83b82a became pre-candidate at term 29”]
[2024/06/19 07:40:00.479 +00:00] [INFO] [raft.go:830] [“1ceea2c50a83b82a received MsgPreVoteResp from 1ceea2c50a83b82a at term 29”]
[2024/06/19 07:40:00.479 +00:00] [INFO] [raft.go:817] [“1ceea2c50a83b82a [logterm: 29, index: 5715] sent MsgPreVote request to 396cf706178e70cc at term 29”]
[2024/06/19 07:40:00.479 +00:00] [INFO] [raft.go:817] [“1ceea2c50a83b82a [logterm: 29, index: 5715] sent MsgPreVote request to 9fbadacc366d60af at term 29”]
[2024/06/19 07:40:00.480 +00:00] [INFO] [raft.go:830] [“1ceea2c50a83b82a received MsgPreVoteResp from 396cf706178e70cc at term 29”]
[2024/06/19 07:40:00.480 +00:00] [INFO] [raft.go:1295] [“1ceea2c50a83b82a has received 2 MsgPreVoteResp votes and 0 vote rejections”]
[2024/06/19 07:40:00.480 +00:00] [INFO] [raft.go:719] [“1ceea2c50a83b82a became candidate at term 30”]
[2024/06/19 07:40:00.480 +00:00] [INFO] [raft.go:830] [“1ceea2c50a83b82a received MsgVoteResp from 1ceea2c50a83b82a at term 30”]
[2024/06/19 07:40:00.480 +00:00] [INFO] [raft.go:817] [“1ceea2c50a83b82a [logterm: 29, index: 5715] sent MsgVote request to 396cf706178e70cc at term 30”]
[2024/06/19 07:40:00.480 +00:00] [INFO] [raft.go:817] [“1ceea2c50a83b82a [logterm: 29, index: 5715] sent MsgVote request to 9fbadacc366d60af at term 30”]
[2024/06/19 07:40:00.482 +00:00] [INFO] [raft.go:830] [“1ceea2c50a83b82a received MsgVoteResp from 396cf706178e70cc at term 30”]
[2024/06/19 07:40:00.482 +00:00] [INFO] [raft.go:1295] [“1ceea2c50a83b82a has received 2 MsgVoteResp votes and 0 vote rejections”]
[2024/06/19 07:40:00.482 +00:00] [INFO] [raft.go:771] [“1ceea2c50a83b82a became leader at term 30”]
[2024/06/19 07:40:00.482 +00:00] [INFO] [node.go:327] [“raft.node: 1ceea2c50a83b82a elected leader 1ceea2c50a83b82a at term 30”]
[2024/06/19 07:40:00.484 +00:00] [INFO] [server.go:2069] [“published local member to cluster through raft”] [local-member-id=1ceea2c50a83b82a] [local-member-attributes=“{Name:basic-pd-1 ClientURLs:[http://basic-pd-1.basic-pd-peer.name-space.svc:2379]}”] [request-path=/0/members/1ceea2c50a83b82a/attributes] [cluster-id=dd0f8a758c7a4da3] [publish-timeout=11s]
[2024/06/19 07:40:00.484 +00:00] [INFO] [server.go:335] [“create etcd v3 client”] [endpoints=“[http://basic-pd-1.basic-pd-peer.name-space.svc:2379]”] [cert=“{"cacert-path":"","cert-path":"","key-path":"","cert-allowed-cn":null,"SSLCABytes":null,"SSLCertBytes":null,"SSLKEYBytes":null,"redact-info-log":false,"encryption":{"data-encryption-method":"plaintext","data-key-rotation-period":"168h0m0s","master-key":{"type":"plaintext","key-id":"","region":"","endpoint":"","path":""}}}”]
[2024/06/19 07:40:00.485 +00:00] [INFO] [serve.go:145] [“serving client traffic insecurely; this is strongly discouraged!”] [address=“[::]:2379”]
[2024/06/19 07:40:00.490 +00:00] [INFO] [server.go:400] [“init cluster id”] [cluster-id=7382110550679565481]
[2024/06/19 07:40:00.491 +00:00] [WARN] [cluster_util.go:315] [“failed to reach the peer URL”] [address=http://basic-pd-2.basic-pd-peer.name-space.svc:2380/version] [remote-member-id=9fbadacc366d60af] [error=“Get "http://basic-pd-2.basic-pd-peer.name-space.svc:2380/version\”: dial tcp 100.108.11.195:2380: connect: connection refused"]
[2024/06/19 07:40:00.491 +00:00] [WARN] [cluster_util.go:168] [“failed to get version”] [remote-member-id=9fbadacc366d60af] [error=“Get "http://basic-pd-2.basic-pd-peer.name-space.svc:2380/version\”: dial tcp 100.108.11.195:2380: connect: connection refused"]
[2024/06/19 07:40:00.496 +00:00] [INFO] [allocator_manager.go:262] [“delete the dc-location key previously written in etcd”] [server-id=2084782644687779882]
[2024/06/19 07:40:00.504 +00:00] [INFO] [history_buffer.go:147] [“start from history index”] [start-index=1700]
[2024/06/19 07:40:00.511 +00:00] [INFO] [server.go:1489] [“start to campaign pd leader”] [campaign-pd-leader-name=basic-pd-1]
[2024/06/19 07:40:00.514 +00:00] [INFO] [lease.go:66] [“lease granted”] [lease-id=4047205748565255964] [lease-timeout=3] [purpose=“pd leader election”]
[2024/06/19 07:40:00.515 +00:00] [INFO] [leadership.go:124] [“check campaign resp”] [resp=“{"header":{"cluster_id":15929102644505365923,"member_id":2084782644687779882,"revision":5211,"raft_term":30},"succeeded":true,"responses":[{"Response":{"ResponsePut":{"header":{"revision":5211}}}}]}”]
[2024/06/19 07:40:00.516 +00:00] [INFO] [leadership.go:133] [“write leaderData to leaderPath ok”] [leaderPath=/pd/7382110550679565481/leader] [purpose=“pd leader election”]
[2024/06/19 07:40:00.516 +00:00] [INFO] [server.go:1515] [“campaign pd leader ok”] [campaign-pd-leader-name=basic-pd-1]
[2024/06/19 07:40:00.516 +00:00] [INFO] [server.go:1522] [“initializing the global TSO allocator”]
[2024/06/19 07:40:00.516 +00:00] [INFO] [lease.go:137] [“start lease keep alive worker”] [interval=1s] [purpose=“pd leader election”]
[2024/06/19 07:40:00.519 +00:00] [INFO] [tso.go:227] [“sync and save timestamp”] [last=2024/06/19 07:39:34.414 +00:00] [save=2024/06/19 07:40:03.517 +00:00] [next=2024/06/19 07:40:00.517 +00:00]
[2024/06/19 07:40:00.521 +00:00] [INFO] [server.go:1639] [“server enable region storage”]
[2024/06/19 07:40:00.530 +00:00] [INFO] [cluster.go:374] [“load stores”] [count=3] [cost=5.587ms]
[2024/06/19 07:40:00.531 +00:00] [INFO] [cluster.go:385] [“load regions”] [count=230] [cost=1.330871ms]
[2024/06/19 07:40:00.539 +00:00] [INFO] [coordinator.go:312] [“coordinator starts to collect cluster information”]
[2024/06/19 07:40:00.539 +00:00] [INFO] [server.go:212] [“establish sync region stream”] [requested-server=basic-pd-0] [url=http://basic-pd-0.basic-pd-peer.name-space.svc:2379]
[2024/06/19 07:40:00.539 +00:00] [INFO] [server.go:230] [“requested server has already in sync with server”] [requested-server=basic-pd-0] [server=basic-pd-1] [last-index=1700]
[2024/06/19 07:40:00.541 +00:00] [INFO] [id.go:174] [“idAllocator allocates a new id”] [new-end=13000] [new-base=12000] [label=idalloc] [check-curr-end=true]
[2024/06/19 07:40:00.541 +00:00] [INFO] [util.go:79] [“load cluster version”] [cluster-version=6.5.8]
[2024/06/19 07:40:00.541 +00:00] [INFO] [server.go:1573] [“PD cluster leader is ready to serve”] [pd-leader-name=basic-pd-1]
[2024/06/19 07:40:00.543 +00:00] [INFO] [store_config.go:204] [“sync the store config successful”] [store-address=100.74.135.43:20180] [store-config=“{\n "coprocessor": {\n "region-max-size": "144MiB",\n "region-split-size": "96MiB",\n "region-max-keys": 1440000,\n "region-split-keys": 960000,\n "enable-region-bucket": false,\n "region-bucket-size": "96MiB"\n }\n}”]
[2024/06/19 07:40:00.589 +00:00] [INFO] [main.go:124] [“Got signal to exit”] [signal=terminated]
[2024/06/19 07:40:00.589 +00:00] [INFO] [server.go:504] [“closing server”]
[2024/06/19 07:40:00.589 +00:00] [INFO] [server.go:629] [“server is closed, exit allocator loop”]
[2024/06/19 07:40:00.589 +00:00] [INFO] [server.go:128] [“region syncer has been stopped”]
[2024/06/19 07:40:00.589 +00:00] [INFO] [lease.go:165] [“stop lease keep alive worker”] [purpose=“pd leader election”]
[2024/06/19 07:40:00.589 +00:00] [INFO] [server.go:1619] [“server is closed, exit etcd leader loop”]
[2024/06/19 07:40:00.589 +00:00] [INFO] [cluster.go:314] [“sync store config job is stopped”]
[2024/06/19 07:40:00.589 +00:00] [INFO] [server.go:1602] [“server is closed”]
[2024/06/19 07:40:00.589 +00:00] [INFO] [cluster.go:2281] [“min resolved ts background jobs has been stopped”]
[2024/06/19 07:40:00.589 +00:00] [INFO] [cluster.go:469] [“update store stats background jobs has been stopped”]
[2024/06/19 07:40:00.589 +00:00] [INFO] [coordinator.go:321] [“coordinator stops running”]
[2024/06/19 07:40:00.589 +00:00] [INFO] [cluster.go:451] [“statistics background jobs has been stopped”]
[2024/06/19 07:40:00.589 +00:00] [INFO] [coordinator.go:301] [“coordinator is stopping”]
[2024/06/19 07:40:00.589 +00:00] [INFO] [labeler.go:67] [“RegionLabeler GC stopped”]
[2024/06/19 07:40:00.589 +00:00] [INFO] [coordinator.go:303] [“coordinator has been stopped”]
[2024/06/19 07:40:00.589 +00:00] [INFO] [server.go:640] [“server is closed, exist encryption key manager loop”]
[2024/06/19 07:40:00.589 +00:00] [INFO] [cluster.go:410] [“metrics are reset”]
[2024/06/19 07:40:00.589 +00:00] [INFO] [server.go:615] [“server is closed, exit metrics loop”]
[2024/06/19 07:40:00.589 +00:00] [INFO] [cluster.go:433] [“node state check job has been stopped”]
[2024/06/19 07:40:00.589 +00:00] [INFO] [cluster.go:412] [“metrics collection job has been stopped”]
[2024/06/19 07:40:00.601 +00:00] [INFO] [cluster.go:516] [“raftcluster is stopped”]
[2024/06/19 07:40:00.601 +00:00] [INFO] [tso.go:416] [“reset the timestamp in memory”]
[2024/06/19 07:40:00.601 +00:00] [INFO] [server.go:1420] [“server is closed, return pd leader loop”]
[2024/06/19 07:40:00.601 +00:00] [INFO] [etcd.go:369] [“closing etcd server”] [name=basic-pd-1] [data-dir=/var/lib/pd] [advertise-peer-urls=“[http://basic-pd-1.basic-pd-peer.name-space.svc:2380]”] [advertise-client-urls=“[http://basic-pd-1.basic-pd-peer.name-space.svc:2379]”]
[2024/06/19 07:40:00.601 +00:00] [WARN] [grpclog.go:60] [“transport: http2Server.HandleStreams failed to read frame: read tcp 100.74.135.10:2379->100.66.209.207:56284: use of closed network connection”]
[2024/06/19 07:40:00.601 +00:00] [INFO] [server.go:1456] [“leadership transfer starting”] [local-member-id=1ceea2c50a83b82a] [current-leader-member-id=1ceea2c50a83b82a] [transferee-member-id=396cf706178e70cc]
[2024/06/19 07:40:00.601 +00:00] [WARN] [grpclog.go:60] [“transport: http2Server.HandleStreams failed to read frame: read tcp 100.74.135.10:2379->100.74.135.20:47442: use of closed network connection”]
[2024/06/19 07:40:00.601 +00:00] [WARN] [grpclog.go:60] [“transport: http2Server.HandleStreams failed to read frame: read tcp 127.0.0.1:2379->127.0.0.1:40374: use of closed network connection”]
[2024/06/19 07:40:00.601 +00:00] [INFO] [raft.go:1254] [“1ceea2c50a83b82a [term 30] starts to transfer leadership to 396cf706178e70cc”]
[2024/06/19 07:40:00.601 +00:00] [INFO] [raft.go:1260] [“1ceea2c50a83b82a sends MsgTimeoutNow to 396cf706178e70cc immediately as 396cf706178e70cc already has up-to-date log”]
[2024/06/19 07:40:00.601 +00:00] [WARN] [grpclog.go:60] [“grpc: addrConn.createTransport failed to connect to {0.0.0.0:2379 0 }. Err :connection error: desc = "transport: Error while dialing dial tcp 0.0.0.0:2379: operation was canceled". Reconnecting…”]
[2024/06/19 07:40:00.601 +00:00] [WARN] [grpclog.go:60] [“transport: http2Server.HandleStreams failed to read frame: read tcp 100.74.135.10:2379->100.66.209.207:56286: use of closed network connection”]
[2024/06/19 07:40:00.601 +00:00] [WARN] [grpclog.go:60] [“transport: http2Server.HandleStreams failed to read frame: read tcp 100.74.135.10:2379->100.108.11.209:34356: use of closed network connection”]
[2024/06/19 07:40:00.602 +00:00] [WARN] [grpclog.go:60] [“transport: http2Server.HandleStreams failed to read frame: read tcp 100.74.135.10:2379->100.108.11.209:34372: use of closed network connection”]
[2024/06/19 07:40:00.602 +00:00] [WARN] [grpclog.go:60] [“transport: http2Server.HandleStreams failed to read frame: read tcp 100.74.135.10:2379->100.74.135.20:47438: use of closed network connection”]
[2024/06/19 07:40:00.602 +00:00] [WARN] [grpclog.go:60] [“transport: http2Server.HandleStreams failed to read frame: read tcp 100.74.135.10:2379->100.108.11.209:34388: use of closed network connection”]
[2024/06/19 07:40:00.615 +00:00] [INFO] [raft.go:865] [“1ceea2c50a83b82a [term: 30] received a MsgVote message with higher term from 396cf706178e70cc [term: 31]”]
[2024/06/19 07:40:00.615 +00:00] [INFO] [raft.go:706] [“1ceea2c50a83b82a became follower at term 31”]
[2024/06/19 07:40:00.615 +00:00] [INFO] [raft.go:966] [“1ceea2c50a83b82a [logterm: 30, index: 5746, vote: 0] cast MsgVote for 396cf706178e70cc [logterm: 30, index: 5746] at term 31”]
[2024/06/19 07:40:00.615 +00:00] [INFO] [node.go:333] [“raft.node: 1ceea2c50a83b82a lost leader 1ceea2c50a83b82a at term 31”]
[2024/06/19 07:40:00.625 +00:00] [INFO] [node.go:327] [“raft.node: 1ceea2c50a83b82a elected leader 396cf706178e70cc at term 31”]
[2024/06/19 07:40:01.102 +00:00] [INFO] [server.go:1477] [“leadership transfer finished”] [local-member-id=1ceea2c50a83b82a] [old-leader-member-id=1ceea2c50a83b82a] [new-leader-member-id=396cf706178e70cc] [took=500.673273ms]
[2024/06/19 07:40:01.102 +00:00] [INFO] [peer.go:333] [“stopping remote peer”] [remote-peer-id=9fbadacc366d60af]
[2024/06/19 07:40:01.102 +00:00] [WARN] [stream.go:301] [“stopped TCP streaming connection with remote peer”] [stream-writer-type=“unknown stream”] [remote-peer-id=9fbadacc366d60af]
[2024/06/19 07:40:01.102 +00:00] [WARN] [stream.go:301] [“stopped TCP streaming connection with remote peer”] [stream-writer-type=“unknown stream”] [remote-peer-id=9fbadacc366d60af]
[2024/06/19 07:40:01.102 +00:00] [INFO] [pipeline.go:86] [“stopped HTTP pipelining with remote peer”] [local-member-id=1ceea2c50a83b82a] [remote-peer-id=9fbadacc366d60af]
[2024/06/19 07:40:01.102 +00:00] [INFO] [stream.go:459] [“stopped stream reader with remote peer”] [stream-reader-type=“stream MsgApp v2”] [local-member-id=1ceea2c50a83b82a] [remote-peer-id=9fbadacc366d60af]
[2024/06/19 07:40:01.102 +00:00] [INFO] [stream.go:459] [“stopped stream reader with remote peer”] [stream-reader-type=“stream Message”] [local-member-id=1ceea2c50a83b82a] [remote-peer-id=9fbadacc366d60af]
[2024/06/19 07:40:01.102 +00:00] [INFO] [peer.go:340] [“stopped remote peer”] [remote-peer-id=9fbadacc366d60af]
[2024/06/19 07:40:01.102 +00:00] [INFO] [peer.go:333] [“stopping remote peer”] [remote-peer-id=396cf706178e70cc]
[2024/06/19 07:40:01.103 +00:00] [WARN] [stream.go:291] [“closed TCP streaming connection with remote peer”] [stream-writer-type=“stream MsgApp v2”] [remote-peer-id=396cf706178e70cc]
[2024/06/19 07:40:01.103 +00:00] [WARN] [stream.go:301] [“stopped TCP streaming connection with remote peer”] [stream-writer-type=“stream MsgApp v2”] [remote-peer-id=396cf706178e70cc]
[2024/06/19 07:40:01.104 +00:00] [WARN] [stream.go:291] [“closed TCP streaming connection with remote peer”] [stream-writer-type=“stream Message”] [remote-peer-id=396cf706178e70cc]
[2024/06/19 07:40:01.104 +00:00] [WARN] [stream.go:301] [“stopped TCP streaming connection with remote peer”] [stream-writer-type=“stream Message”] [remote-peer-id=396cf706178e70cc]
[2024/06/19 07:40:01.104 +00:00] [INFO] [pipeline.go:86] [“stopped HTTP pipelining with remote peer”] [local-member-id=1ceea2c50a83b82a] [remote-peer-id=396cf706178e70cc]
[2024/06/19 07:40:01.104 +00:00] [WARN] [stream.go:436] [“lost TCP streaming connection with remote peer”] [stream-reader-type=“stream MsgApp v2”] [local-member-id=1ceea2c50a83b82a] [remote-peer-id=396cf706178e70cc] [error=“context canceled”]
[2024/06/19 07:40:01.104 +00:00] [WARN] [peer_status.go:68] [“peer became inactive (message send to peer failed)”] [peer-id=396cf706178e70cc] [error=“failed to read 396cf706178e70cc on stream MsgApp v2 (context canceled)”]
[2024/06/19 07:40:01.104 +00:00] [INFO] [stream.go:459] [“stopped stream reader with remote peer”] [stream-reader-type=“stream MsgApp v2”] [local-member-id=1ceea2c50a83b82a] [remote-peer-id=396cf706178e70cc]
[2024/06/19 07:40:01.104 +00:00] [WARN] [stream.go:436] [“lost TCP streaming connection with remote peer”] [stream-reader-type=“stream Message”] [local-member-id=1ceea2c50a83b82a] [remote-peer-id=396cf706178e70cc] [error=“context canceled”]
[2024/06/19 07:40:01.104 +00:00] [INFO] [stream.go:459] [“stopped stream reader with remote peer”] [stream-reader-type=“stream Message”] [local-member-id=1ceea2c50a83b82a] [remote-peer-id=396cf706178e70cc]
[2024/06/19 07:40:01.104 +00:00] [INFO] [peer.go:340] [“stopped remote peer”] [remote-peer-id=396cf706178e70cc]
[2024/06/19 07:40:01.105 +00:00] [WARN] [http.go:448] [“failed to find remote peer in cluster”] [local-member-id=1ceea2c50a83b82a] [remote-peer-id-stream-handler=1ceea2c50a83b82a] [remote-peer-id-from=396cf706178e70cc] [cluster-id=dd0f8a758c7a4da3]
[2024/06/19 07:40:01.105 +00:00] [WARN] [http.go:448] [“failed to find remote peer in cluster”] [local-member-id=1ceea2c50a83b82a] [remote-peer-id-stream-handler=1ceea2c50a83b82a] [remote-peer-id-from=396cf706178e70cc] [cluster-id=dd0f8a758c7a4da3]
[2024/06/19 07:40:01.139 +00:00] [INFO] [etcd.go:564] [“stopping serving peer traffic”] [address=“[::]:2380”]
[2024/06/19 07:40:01.139 +00:00] [INFO] [etcd.go:571] [“stopped serving peer traffic”] [address=“[::]:2380”]
[2024/06/19 07:40:01.139 +00:00] [INFO] [etcd.go:373] [“closed etcd server”] [name=basic-pd-1] [data-dir=/var/lib/pd] [advertise-peer-urls=“[http://basic-pd-1.basic-pd-peer.name-space.svc:2380]”] [advertise-client-urls=“[http://basic-pd-1.basic-pd-peer.name-space.svc:2379]”]
[2024/06/19 07:40:01.139 +00:00] [INFO] [manager.go:73] [“exit dashboard loop”]
[2024/06/19 07:40:01.139 +00:00] [INFO] [server.go:543] [“close server”]

ip 地址丢失是什么问题

pd启动起来,一会后ip地址没有了。 看看上面的ifconfig的结果,只有127.0.0.1地址了。
最终的结果就是pd无法正常建立集群。

这条日志表明, PD 是收到信号主动退出的,看下 k8s 有啥操作吗

网络没问题吗

Got signal to exit 排查下环境比如 OOM,k8s 调度

这么久了,今天重新看这个问题。
TiOperator是1.5.2 Tidb版本是6.5.8 k8s版本1.23.17
PD就是反复的重启,running->ContainerCreating->Pending->running
过一段时间,发现启动正常了,但是运气好的时候需要10分钟,运气不好几个小时还是这个状态。
并且debug模式不生效,开了debug模式,任然要自动重启。

对比了之前的使用的版本
TiOperator是v1.4.3 Tidb版本是6.5.0 k8s版本v1.20.7

启动失败日志:

[2024/07/11 12:38:48.620 +00:00] [WARN] [v3_server.go:814] [“waiting for ReadIndex response took too long, retrying”] [sent-request-id=1124651556563639575] [retry-timeout=500ms]
[2024/07/11 12:38:48.620 +00:00] [INFO] [raft.go:1358] [“43c224d21a700f9b no leader at term 1584; dropping index reading msg”]
[2024/07/11 12:38:48.893 +00:00] [INFO] [raft.go:929] [“43c224d21a700f9b is starting a new election at term 1584”]
[2024/07/11 12:38:48.893 +00:00] [INFO] [raft.go:735] [“43c224d21a700f9b became pre-candidate at term 1584”]
[2024/07/11 12:38:48.893 +00:00] [INFO] [raft.go:830] [“43c224d21a700f9b received MsgPreVoteResp from 43c224d21a700f9b at term 1584”]
[2024/07/11 12:38:48.893 +00:00] [INFO] [raft.go:817] [“43c224d21a700f9b [logterm: 1584, index: 113074] sent MsgPreVote request to 51c5dd9dd1f119ae at term 1584”]
[2024/07/11 12:38:48.893 +00:00] [INFO] [raft.go:817] [“43c224d21a700f9b [logterm: 1584, index: 113074] sent MsgPreVote request to 738369e98f51ba86 at term 1584”]
[2024/07/11 12:38:49.120 +00:00] [WARN] [v3_server.go:814] [“waiting for ReadIndex response took too long, retrying”] [sent-request-id=1124651556563639575] [retry-timeout=500ms]
[2024/07/11 12:38:49.664 +00:00] [WARN] [v3_server.go:814] [“waiting for ReadIndex response took too long, retrying”] [sent-request-id=1124651556563639575] [retry-timeout=500ms]
[2024/07/11 12:38:49.665 +00:00] [WARN] [probing_status.go:70] [“prober detected unhealthy status”] [round-tripper-name=ROUND_TRIPPER_RAFT_MESSAGE] [remote-peer-id=738369e98f51ba86] [rtt=0s] [error=“dial tcp: lookup basic-pd-2.basic-pd-peer.ns.svc on 10.96.0.10:53: dial udp 10.96.0.10:53: connect: network is unreachable”]
[2024/07/11 12:38:49.665 +00:00] [WARN] [probing_status.go:70] [“prober detected unhealthy status”] [round-tripper-name=ROUND_TRIPPER_SNAPSHOT] [remote-peer-id=738369e98f51ba86] [rtt=0s] [error=“dial tcp 100.108.11.194:2380: connect: connection refused”]
[2024/07/11 12:38:50.165 +00:00] [WARN] [v3_server.go:814] [“waiting for ReadIndex response took too long, retrying”] [sent-request-id=1124651556563639575] [retry-timeout=500ms]
[2024/07/11 12:38:50.665 +00:00] [WARN] [v3_server.go:814] [“waiting for ReadIndex response took too long, retrying”] [sent-request-id=1124651556563639575] [retry-timeout=500ms]
[2024/07/11 12:38:51.188 +00:00] [WARN] [v3_server.go:814] [“waiting for ReadIndex response took too long, retrying”] [sent-request-id=1124651556563639575] [retry-timeout=500ms]
[2024/07/11 12:38:51.303 +00:00] [WARN] [util.go:163] [“apply request took too long”] [took=10.00007245s] [expected-duration=100ms] [prefix=“read-only range “] [request=“key:"/pd/7389506872184522927" range_end:"/pd/7389506872184522928" “] [response=] [error=“context deadline exceeded”]
[2024/07/11 12:38:51.303 +00:00] [INFO] [trace.go:152] [“trace[107107765] range”] [detail=”{range_begin:/pd/7389506872184522927; range_end:/pd/7389506872184522928; }”] [duration=10.000303807s] [start=2024/07/11 12:38:41.303 +00:00] [end=2024/07/11 12:38:51.303 +00:00] [steps=”["trace[107107765] ‘agreement among raft nodes before linearized reading’ (duration: 10.000092564s)"]”]
[2024/07/11 12:38:51.303 +00:00] [WARN] [retry_interceptor.go:62] [“retrying of unary invoker failed”] [target=endpoint://client-9be0eacd-99c7-4f44-9693-baf0775a244d/basic-pd-0.basic-pd-peer.ns.svc:2379] [attempt=0] [error=“rpc error: code = DeadlineExceeded desc = context deadline exceeded”]
[2024/07/11 12:38:51.303 +00:00] [WARN] [etcdutil.go:121] [“kv gets too slow”] [request-key=/pd/7389506872184522927] [cost=10.000653535s] [error=“context deadline exceeded”]
[2024/07/11 12:38:51.303 +00:00] [ERROR] [etcdutil.go:126] [“load from etcd meet error”] [key=/pd/7389506872184522927] [error=“[PD:etcd:ErrEtcdKVGet]context deadline exceeded: context deadline exceeded”]
[2024/07/11 12:38:51.303 +00:00] [ERROR] [server.go:1524] [“failed to initialize the global TSO allocator”] [error=“[PD:etcd:ErrEtcdKVGet]context deadline exceeded: context deadline exceeded”]
[2024/07/11 12:38:51.689 +00:00] [WARN] [v3_server.go:814] [“waiting for ReadIndex response took too long, retrying”] [sent-request-id=1124651556563639575] [retry-timeout=500ms]
[2024/07/11 12:38:52.190 +00:00] [WARN] [v3_server.go:814] [“waiting for ReadIndex response took too long, retrying”] [sent-request-id=1124651556563639575] [retry-timeout=500ms]
[2024/07/11 12:38:52.338 +00:00] [WARN] [retry_interceptor.go:62] [“retrying of unary invoker failed”] [target=endpoint://client-9be0eacd-99c7-4f44-9693-baf0775a244d/basic-pd-0.basic-pd-peer.ns.svc:2379] [attempt=0] [error=“rpc error: code = DeadlineExceeded desc = context deadline exceeded”]
[2024/07/11 12:38:52.338 +00:00] [WARN] [v3_server.go:830] [“timed out waiting for read index response (local node might have slow network)”] [timeout=11s]
[2024/07/11 12:38:52.338 +00:00] [INFO] [server.go:1420] [“server is closed, return pd leader loop”]
[2024/07/11 12:38:52.338 +00:00] [INFO] [etcd.go:369] [“closing etcd server”] [name=basic-pd-0] [data-dir=/var/lib/pd] [advertise-peer-urls=“[http://basic-pd-0.basic-pd-peer.ns.svc:2380]”] [advertise-client-urls=“[http://basic-pd-0.basic-pd-peer.ns.svc:2379]”]
[2024/07/11 12:38:52.339 +00:00] [WARN] [grpclog.go:60] [“transport: http2Server.HandleStreams failed to read frame: read tcp 100.74.135.16:2379->100.74.135.16:35694: use of closed network connection”]
[2024/07/11 12:38:52.339 +00:00] [WARN] [grpclog.go:60] [“transport: http2Server.HandleStreams failed to read frame: read tcp 100.74.135.16:2379->100.66.209.200:59794: use of closed network connection”]
[2024/07/11 12:38:52.339 +00:00] [INFO] [server.go:1494] [“skipped leadership transfer; local server is not leader”] [local-member-id=43c224d21a700f9b] [current-leader-member-id=0]
[2024/07/11 12:38:52.339 +00:00] [WARN] [grpclog.go:60] [“transport: http2Server.HandleStreams failed to read frame: read tcp 127.0.0.1:2379->127.0.0.1:46178: use of closed network connection”]
[2024/07/11 12:38:52.339 +00:00] [WARN] [grpclog.go:60] [“transport: http2Server.HandleStreams failed to read frame: read tcp 100.74.135.16:2379->100.74.135.32:42798: use of closed network connection”]
[2024/07/11 12:38:52.339 +00:00] [WARN] [grpclog.go:60] [“transport: http2Server.HandleStreams failed to read frame: read tcp 100.74.135.16:2379->100.108.11.230:53432: use of closed network connection”]
[2024/07/11 12:38:52.339 +00:00] [INFO] [peer.go:333] [“stopping remote peer”] [remote-peer-id=51c5dd9dd1f119ae]
[2024/07/11 12:38:52.339 +00:00] [WARN] [grpclog.go:60] [“transport: http2Server.HandleStreams failed to read frame: read tcp 100.74.135.16:2379->100.66.209.204:58928: use of closed network connection”]
[2024/07/11 12:38:52.339 +00:00] [WARN] [grpclog.go:60] [“transport: http2Server.HandleStreams failed to read frame: read tcp 100.74.135.16:2379->100.108.11.230:53380: use of closed network connection”]
[2024/07/11 12:38:52.339 +00:00] [WARN] [grpclog.go:60] [“transport: http2Server.HandleStreams failed to read frame: read tcp 100.74.135.16:2379->100.66.209.200:59310: use of closed network connection”]
[2024/07/11 12:38:52.339 +00:00] [WARN] [grpclog.go:60] [“transport: http2Server.HandleStreams failed to read frame: read tcp 100.74.135.16:2379->100.108.11.230:53412: use of closed network connection”]
[2024/07/11 12:38:52.339 +00:00] [WARN] [grpclog.go:60] [“transport: http2Server.HandleStreams failed to read frame: read tcp 100.74.135.16:2379->100.108.11.247:39956: use of closed network connection”]
[2024/07/11 12:38:52.339 +00:00] [WARN] [grpclog.go:60] [“transport: http2Server.HandleStreams failed to read frame: read tcp 100.74.135.16:2379->100.74.135.16:35692: use of closed network connection”]
[2024/07/11 12:38:52.339 +00:00] [WARN] [grpclog.go:60] [“transport: http2Server.HandleStreams failed to read frame: read tcp 100.74.135.16:2379->100.66.209.204:58930: use of closed network connection”]
[2024/07/11 12:38:52.339 +00:00] [WARN] [grpclog.go:60] [“transport: http2Server.HandleStreams failed to read frame: read tcp 100.74.135.16:2379->100.74.135.33:58116: use of closed network connection”]
[2024/07/11 12:38:52.339 +00:00] [WARN] [grpclog.go:60] [“transport: http2Server.HandleStreams failed to read frame: read tcp 100.74.135.16:2379->100.74.135.33:57852: use of closed network connection”]
[2024/07/11 12:38:52.339 +00:00] [WARN] [stream.go:291] [“closed TCP streaming connection with remote peer”] [stream-writer-type=“stream MsgApp v2”] [remote-peer-id=51c5dd9dd1f119ae]
[2024/07/11 12:38:52.339 +00:00] [WARN] [stream.go:301] [“stopped TCP streaming connection with remote peer”] [stream-writer-type=“stream MsgApp v2”] [remote-peer-id=51c5dd9dd1f119ae]
[2024/07/11 12:38:52.339 +00:00] [WARN] [grpclog.go:60] [“transport: http2Server.HandleStreams failed to read frame: read tcp 100.74.135.16:2379->100.74.135.32:42756: use of closed network connection”]
[2024/07/11 12:38:52.339 +00:00] [WARN] [stream.go:291] [“closed TCP streaming connection with remote peer”] [stream-writer-type=“stream Message”] [remote-peer-id=51c5dd9dd1f119ae]
[2024/07/11 12:38:52.339 +00:00] [WARN] [stream.go:301] [“stopped TCP streaming connection with remote peer”] [stream-writer-type=“stream Message”] [remote-peer-id=51c5dd9dd1f119ae]
[2024/07/11 12:38:52.339 +00:00] [INFO] [pipeline.go:86] [“stopped HTTP pipelining with remote peer”] [local-member-id=43c224d21a700f9b] [remote-peer-id=51c5dd9dd1f119ae]
[2024/07/11 12:38:52.339 +00:00] [INFO] [stream.go:459] [“stopped stream reader with remote peer”] [stream-reader-type=“stream MsgApp v2”] [local-member-id=43c224d21a700f9b] [remote-peer-id=51c5dd9dd1f119ae]
[2024/07/11 12:38:52.340 +00:00] [INFO] [stream.go:459] [“stopped stream reader with remote peer”] [stream-reader-type=“stream Message”] [local-member-id=43c224d21a700f9b] [remote-peer-id=51c5dd9dd1f119ae]
[2024/07/11 12:38:52.340 +00:00] [INFO] [peer.go:340] [“stopped remote peer”] [remote-peer-id=51c5dd9dd1f119ae]
[2024/07/11 12:38:52.340 +00:00] [INFO] [peer.go:333] [“stopping remote peer”] [remote-peer-id=738369e98f51ba86]
[2024/07/11 12:38:52.340 +00:00] [WARN] [stream.go:291] [“closed TCP streaming connection with remote peer”] [stream-writer-type=“stream MsgApp v2”] [remote-peer-id=738369e98f51ba86]
[2024/07/11 12:38:52.340 +00:00] [WARN] [stream.go:301] [“stopped TCP streaming connection with remote peer”] [stream-writer-type=“stream MsgApp v2”] [remote-peer-id=738369e98f51ba86]
[2024/07/11 12:38:52.340 +00:00] [WARN] [stream.go:291] [“closed TCP streaming connection with remote peer”] [stream-writer-type=“stream Message”] [remote-peer-id=738369e98f51ba86]
[2024/07/11 12:38:52.340 +00:00] [WARN] [stream.go:301] [“stopped TCP streaming connection with remote peer”] [stream-writer-type=“stream Message”] [remote-peer-id=738369e98f51ba86]
[2024/07/11 12:38:52.340 +00:00] [INFO] [pipeline.go:86] [“stopped HTTP pipelining with remote peer”] [local-member-id=43c224d21a700f9b] [remote-peer-id=738369e98f51ba86]
[2024/07/11 12:38:52.340 +00:00] [INFO] [stream.go:459] [“stopped stream reader with remote peer”] [stream-reader-type=“stream MsgApp v2”] [local-member-id=43c224d21a700f9b] [remote-peer-id=738369e98f51ba86]
[2024/07/11 12:38:52.340 +00:00] [INFO] [stream.go:459] [“stopped stream reader with remote peer”] [stream-reader-type=“stream Message”] [local-member-id=43c224d21a700f9b] [remote-peer-id=738369e98f51ba86]
[2024/07/11 12:38:52.340 +00:00] [INFO] [peer.go:340] [“stopped remote peer”] [remote-peer-id=738369e98f51ba86]
[2024/07/11 12:38:52.353 +00:00] [INFO] [etcd.go:564] [“stopping serving peer traffic”] [address=“[::]:2380”]
[2024/07/11 12:38:52.353 +00:00] [INFO] [etcd.go:571] [“stopped serving peer traffic”] [address=“[::]:2380”]
[2024/07/11 12:38:52.353 +00:00] [INFO] [etcd.go:373] [“closed etcd server”] [name=basic-pd-0] [data-dir=/var/lib/pd] [advertise-peer-urls=“[http://basic-pd-0.basic-pd-peer.ns.svc:2380]”] [advertise-client-urls=“[http://basic-pd-0.basic-pd-peer.ns.svc:2379]”]
[2024/07/11 12:38:52.353 +00:00] [INFO] [manager.go:73] [“exit dashboard loop”]
[2024/07/11 12:38:52.353 +00:00] [INFO] [server.go:543] [“close server”]

扩容一下cpu 内存资源限制

没有看到哪里日志说cpu、内存有问题呢?相同的资源上,原来部署6.5.0和1.4的operator就是正常的呢。

这个是在一个高性能环境,单机pd启动,还是概率启动失败:

ci:~ # kubectl logs -n ns basic-pd-0 pd -f
Server: 10.96.0.10
Address 1: 10.96.0.10 kube-dns.kube-system.svc.cluster.local

Name: basic-pd-0.basic-pd-peer.ns.svc
Address 1: 100.91.136.6
nslookup domain basic-pd-0.basic-pd-peer.ns.svc.svc success
starting pd-server …
/pd-server --data-dir=/var/lib/pd --name=basic-pd-0 --peer-urls=http://0.0.0.0:2380 --advertise-peer-urls=http://basic-pd-0.basic-pd-peer.ns.svc:2380 --client-urls=http://0.0.0.0:2379 --advertise-client-urls=http://basic-pd-0.basic-pd-peer.ns.svc:2379 --config=/etc/pd/pd.toml
[2024/07/12 02:54:48.942 +00:00] [INFO] [util.go:43] [“Welcome to Placement Driver (PD)”]
[2024/07/12 02:54:48.942 +00:00] [INFO] [util.go:44] [PD] [release-version=v6.5.8]
[2024/07/12 02:54:48.942 +00:00] [INFO] [util.go:45] [PD] [edition=Community]
[2024/07/12 02:54:48.942 +00:00] [INFO] [util.go:46] [PD] [git-hash=4506d63ba4fba7123ecc8277da7ef5f635efee90]
[2024/07/12 02:54:48.942 +00:00] [INFO] [util.go:47] [PD] [git-branch=heads/refs/tags/v6.5.8]
[2024/07/12 02:54:48.942 +00:00] [INFO] [util.go:48] [PD] [utc-build-time=“2024-01-25 10:03:20”]
[2024/07/12 02:54:48.942 +00:00] [INFO] [metricutil.go:83] [“disable Prometheus push client”]
[2024/07/12 02:54:48.942 +00:00] [INFO] [server.go:253] [“PD Config”] [config=“{"client-urls":"http://0.0.0.0:2379","peer-urls":"http://0.0.0.0:2380","advertise-client-urls":"http://basic-pd-0.basic-pd-peer.ns.svc:2379","advertise-peer-urls":"http://basic-pd-0.basic-pd-peer.ns.svc:2380","name":"basic-pd-0","data-dir":"/var/lib/pd","force-new-cluster":false,"enable-grpc-gateway":true,"initial-cluster":"basic-pd-0=http://basic-pd-0.basic-pd-peer.ns.svc:2380","initial-cluster-state":"new","initial-cluster-token":"pd-cluster","join":"","lease":3,"log":{"level":"info","format":"text","disable-timestamp":false,"file":{"filename":"","max-size":0,"max-days":0,"max-backups":0},"development":false,"disable-caller":false,"disable-stacktrace":false,"disable-error-verbose":true,"sampling":null,"error-output-path":""},"tso-save-interval":"3s","tso-update-physical-interval":"50ms","enable-local-tso":false,"metric":{"job":"basic-pd-0","address":"","interval":"15s"},"schedule":{"max-snapshot-count":64,"max-pending-peer-count":64,"max-merge-region-size":20,"max-merge-region-keys":0,"split-merge-interval":"1h0m0s","swtich-witness-interval":"1h0m0s","enable-one-way-merge":"false","enable-cross-table-merge":"true","patrol-region-interval":"10ms","max-store-down-time":"30m0s","max-store-preparing-time":"48h0m0s","leader-schedule-limit":4,"leader-schedule-policy":"count","region-schedule-limit":2048,"replica-schedule-limit":64,"merge-schedule-limit":8,"hot-region-schedule-limit":4,"hot-region-cache-hits-threshold":3,"store-limit":{},"tolerant-size-ratio":0,"low-space-ratio":0.8,"high-space-ratio":0.7,"region-score-formula-version":"v2","scheduler-max-waiting-operator":5,"enable-remove-down-replica":"true","enable-replace-offline-replica":"true","enable-make-up-replica":"true","enable-remove-extra-replica":"true","enable-location-replacement":"true","enable-debug-metrics":"false","enable-joint-consensus":"true","enable-tikv-split-region":"true","schedulers-v2":[{"type":"balance-region","args":null,"disable":false,"args-payload":""},{"type":"balance-leader","args":null,"disable":false,"args-payload":""},{"type":"hot-region","args":null,"disable":false,"args-payload":""},{"type":"split-bucket","args":null,"disable":false,"args-payload":""}],"schedulers-payload":null,"store-limit-mode":"manual","hot-regions-write-interval":"10m0s","hot-regions-reserved-days":7,"enable-diagnostic":"false","enable-witness":"false"},"replication":{"max-replicas":3,"location-labels":"","strictly-match-label":"false","enable-placement-rules":"true","enable-placement-rules-cache":"false","isolation-level":""},"pd-server":{"use-region-storage":"true","max-gap-reset-ts":"24h0m0s","key-type":"table","runtime-services":"","metric-storage":"","dashboard-address":"auto","trace-region-flow":"true","flow-round-by-digit":3,"min-resolved-ts-persistence-interval":"1s"},"cluster-version":"0.0.0","labels":{},"quota-backend-bytes":"8GiB","auto-compaction-mode":"periodic","auto-compaction-retention-v2":"1h","TickInterval":"500ms","ElectionInterval":"3s","PreVote":true,"max-request-bytes":157286400,"security":{"cacert-path":"","cert-path":"","key-path":"","cert-allowed-cn":null,"SSLCABytes":null,"SSLCertBytes":null,"SSLKEYBytes":null,"redact-info-log":false,"encryption":{"data-encryption-method":"plaintext","data-key-rotation-period":"168h0m0s","master-key":{"type":"plaintext","key-id":"","region":"","endpoint":"","path":""}}},"label-property":null,"WarningMsgs":null,"DisableStrictReconfigCheck":false,"HeartbeatStreamBindInterval":"1m0s","LeaderPriorityCheckInterval":"1m0s","dashboard":{"tidb-cacert-path":"","tidb-cert-path":"","tidb-key-path":"","public-path-prefix":"","internal-proxy":false,"enable-telemetry":false,"enable-experimental":false},"replication-mode":{"replication-mode":"majority","dr-auto-sync":{"label-key":"","primary":"","dr":"","primary-replicas":0,"dr-replicas":0,"wait-store-timeout":"1m0s","pause-region-split":"false"}}}”]
[2024/07/12 02:54:48.946 +00:00] [INFO] [server.go:228] [“register REST path”] [path=/pd/api/v1]
[2024/07/12 02:54:48.947 +00:00] [INFO] [server.go:228] [“register REST path”] [path=/pd/api/v2/]
[2024/07/12 02:54:48.947 +00:00] [INFO] [server.go:228] [“register REST path”] [path=/autoscaling]
[2024/07/12 02:54:48.947 +00:00] [INFO] [distro.go:51] [“Using distribution strings”] [strings={}]
[2024/07/12 02:54:48.948 +00:00] [INFO] [server.go:228] [“register REST path”] [path=/dashboard/api/]
[2024/07/12 02:54:48.948 +00:00] [INFO] [server.go:228] [“register REST path”] [path=/dashboard/]
[2024/07/12 02:54:48.948 +00:00] [INFO] [etcd.go:117] [“configuring peer listeners”] [listen-peer-urls=“[http://0.0.0.0:2380]”]
[2024/07/12 02:54:48.949 +00:00] [INFO] [etcd.go:127] [“configuring client listeners”] [listen-client-urls=“[http://0.0.0.0:2379]”]
[2024/07/12 02:54:48.949 +00:00] [INFO] [systimemon.go:30] [“start system time monitor”]
[2024/07/12 02:54:48.949 +00:00] [INFO] [etcd.go:611] [“pprof is enabled”] [path=/debug/pprof]
[2024/07/12 02:54:48.949 +00:00] [INFO] [etcd.go:305] [“starting an etcd server”] [etcd-version=3.4.21] [git-sha=“Not provided (use ./build instead of go build)”] [go-version=go1.19.13] [go-os=linux] [go-arch=amd64] [max-cpu-set=16] [max-cpu-available=16] [member-initialized=true] [name=basic-pd-0] [data-dir=/var/lib/pd] [wal-dir=] [wal-dir-dedicated=] [member-dir=/var/lib/pd/member] [force-new-cluster=false] [heartbeat-interval=500ms] [election-timeout=3s] [initial-election-tick-advance=true] [snapshot-count=100000] [snapshot-catchup-entries=5000] [initial-advertise-peer-urls=“[http://basic-pd-0.basic-pd-peer.ns.svc:2380]”] [listen-peer-urls=“[http://0.0.0.0:2380]”] [advertise-client-urls=“[http://basic-pd-0.basic-pd-peer.ns.svc:2379]”] [listen-client-urls=“[http://0.0.0.0:2379]”] [listen-metrics-urls=“”] [cors=“[]“] [host-whitelist=”[]”] [initial-cluster=] [initial-cluster-state=new] [initial-cluster-token=] [quota-backend-bytes=8589934592] [max-request-bytes=157286400] [max-concurrent-streams=4294967295] [pre-vote=true] [initial-corrupt-check=false] [corrupt-check-time-interval=0s] [auto-compaction-mode=periodic] [auto-compaction-retention=1h0m0s] [auto-compaction-interval=1h0m0s] [discovery-url=] [discovery-proxy=]
[2024/07/12 02:54:48.949 +00:00] [WARN] [server.go:297] [“exceeded recommended request limit”] [max-request-bytes=157286400] [max-request-size=“157 MB”] [recommended-request-bytes=10485760] [recommended-request-size=“10 MB”]
2024-07-12 02:54:48.949450 W | pkg/fileutil: check file permission: directory “/var/lib/pd” exist, but the permission is “drwxr-xr-x”. The recommended permission is “-rwx------” to prevent possible unprivileged access to the data.
[2024/07/12 02:54:48.952 +00:00] [INFO] [backend.go:80] [“opened backend db”] [path=/var/lib/pd/member/snap/db] [took=2.525235ms]
[2024/07/12 02:54:49.570 +00:00] [INFO] [server.go:462] [“recovered v2 store from snapshot”] [snapshot-index=600006] [snapshot-size=“18 kB”]
[2024/07/12 02:54:49.571 +00:00] [INFO] [kvstore.go:388] [“restored last compact revision”] [meta-bucket-name=meta] [meta-bucket-name-key=finishedCompactRev] [restored-compact-revision=597525]
[2024/07/12 02:54:49.573 +00:00] [INFO] [server.go:480] [“recovered v3 backend from snapshot”] [backend-size-bytes=2547712] [backend-size=“2.5 MB”] [backend-size-in-use-bytes=417792] [backend-size-in-use=“418 kB”]
[2024/07/12 02:54:49.784 +00:00] [INFO] [raft.go:586] [“restarting local member”] [cluster-id=a57d21dfafd146d9] [local-member-id=43c224d21a700f9b] [commit-index=606478]
[2024/07/12 02:54:49.784 +00:00] [INFO] [raft.go:1523] [“43c224d21a700f9b switched to configuration voters=(4882505430828322715)”]
[2024/07/12 02:54:49.784 +00:00] [INFO] [raft.go:706] [“43c224d21a700f9b became follower at term 29”]
[2024/07/12 02:54:49.784 +00:00] [INFO] [raft.go:389] [“newRaft 43c224d21a700f9b [peers: [43c224d21a700f9b], term: 29, commit: 606478, applied: 600006, lastindex: 606478, lastterm: 29]”]
[2024/07/12 02:54:49.784 +00:00] [INFO] [capability.go:76] [“enabled capabilities for version”] [cluster-version=3.4]
[2024/07/12 02:54:49.784 +00:00] [INFO] [cluster.go:256] [“recovered/added member from store”] [cluster-id=a57d21dfafd146d9] [local-member-id=43c224d21a700f9b] [recovered-remote-peer-id=43c224d21a700f9b] [recovered-remote-peer-urls=“[http://basic-pd-0.basic-pd-peer.ns.svc:2380]”]
[2024/07/12 02:54:49.784 +00:00] [INFO] [cluster.go:269] [“set cluster version from store”] [cluster-version=3.4]
[2024/07/12 02:54:49.785 +00:00] [WARN] [store.go:1379] [“simple token is not cryptographically signed”]
[2024/07/12 02:54:49.785 +00:00] [INFO] [kvstore.go:388] [“restored last compact revision”] [meta-bucket-name=meta] [meta-bucket-name-key=finishedCompactRev] [restored-compact-revision=597525]
[2024/07/12 02:54:49.787 +00:00] [INFO] [quota.go:126] [“enabled backend quota”] [quota-name=v3-applier] [quota-size-bytes=8589934592] [quota-size=“8.6 GB”]
[2024/07/12 02:54:49.787 +00:00] [INFO] [server.go:803] [“starting etcd server”] [local-member-id=43c224d21a700f9b] [local-server-version=3.4.21] [cluster-id=a57d21dfafd146d9] [cluster-version=3.4]
[2024/07/12 02:54:49.787 +00:00] [INFO] [server.go:682] [“started as single-node; fast-forwarding election ticks”] [local-member-id=43c224d21a700f9b] [forward-ticks=5] [forward-duration=2.5s] [election-ticks=6] [election-timeout=3s]
[2024/07/12 02:54:49.789 +00:00] [INFO] [etcd.go:247] [“now serving peer/client/metrics”] [local-member-id=43c224d21a700f9b] [initial-advertise-peer-urls=“[http://basic-pd-0.basic-pd-peer.ns.svc:2380]”] [listen-peer-urls=“[http://0.0.0.0:2380]”] [advertise-client-urls=“[http://basic-pd-0.basic-pd-peer.ns.svc:2379]”] [listen-client-urls=“[http://0.0.0.0:2379]”] [listen-metrics-urls=“”]
[2024/07/12 02:54:49.789 +00:00] [INFO] [etcd.go:585] [“serving peer traffic”] [address=“[::]:2380”]
[2024/07/12 02:54:51.785 +00:00] [INFO] [raft.go:929] [“43c224d21a700f9b is starting a new election at term 29”]
[2024/07/12 02:54:51.786 +00:00] [INFO] [raft.go:735] [“43c224d21a700f9b became pre-candidate at term 29”]
[2024/07/12 02:54:51.786 +00:00] [INFO] [raft.go:830] [“43c224d21a700f9b received MsgPreVoteResp from 43c224d21a700f9b at term 29”]
[2024/07/12 02:54:51.786 +00:00] [INFO] [raft.go:719] [“43c224d21a700f9b became candidate at term 30”]
[2024/07/12 02:54:51.786 +00:00] [INFO] [raft.go:830] [“43c224d21a700f9b received MsgVoteResp from 43c224d21a700f9b at term 30”]
[2024/07/12 02:54:51.786 +00:00] [INFO] [raft.go:771] [“43c224d21a700f9b became leader at term 30”]
[2024/07/12 02:54:51.786 +00:00] [INFO] [node.go:327] [“raft.node: 43c224d21a700f9b elected leader 43c224d21a700f9b at term 30”]
[2024/07/12 02:54:51.786 +00:00] [INFO] [server.go:2069] [“published local member to cluster through raft”] [local-member-id=43c224d21a700f9b] [local-member-attributes=“{Name:basic-pd-0 ClientURLs:[http://basic-pd-0.basic-pd-peer.ns.svc:2379]}”] [request-path=/0/members/43c224d21a700f9b/attributes] [cluster-id=a57d21dfafd146d9] [publish-timeout=11s]
[2024/07/12 02:54:51.787 +00:00] [INFO] [serve.go:145] [“serving client traffic insecurely; this is strongly discouraged!”] [address=“[::]:2379”]
2024-07-12 02:54:59.790273 W | etcdserver: could not get cluster response from http://basic-pd-0.basic-pd-peer.ns.svc:2380: Get “http://basic-pd-0.basic-pd-peer.ns.svc:2380/members”: context deadline exceeded (Client.Timeout exceeded while awaiting headers)
[2024/07/12 02:54:59.790 +00:00] [ERROR] [etcdutil.go:71] [“failed to get cluster from remote”] [error=“[PD:etcd:ErrEtcdGetCluster]could not retrieve cluster information from the given URLs: could not retrieve cluster information from the given URLs”]
[2024/07/12 02:54:59.790 +00:00] [FATAL] [main.go:120] [“run server failed”] [error=“[PD:server:ErrCancelStartEtcd]etcd start canceled”] [stack=“main.main\n\t/home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/pd/cmd/pd-server/main.go:120\nruntime.main\n\t/usr/local/go/src/runtime/proc.go:250”]

多次启动,成功的时候,对应的日志如下,虽然也报ErrEtcdGetCluster但是又启动成功了:
[2024/07/12 03:06:08.105 +00:00] [INFO] [serve.go:145] [“serving client traffic insecurely; this is strongly discouraged!”] [address=“[::]:2379”]
2024-07-12 03:06:15.109376 W | etcdserver: could not get cluster response from http://basic-pd-0.basic-pd-peer.ns.svc:2380: Get “http://basic-pd-0.basic-pd-peer.ns.svc:2380/members”: c
ontext deadline exceeded (Client.Timeout exceeded while awaiting headers)
[2024/07/12 03:06:15.109 +00:00] [ERROR] [etcdutil.go:71] [“failed to get cluster from remote”] [error=“[PD:etcd:ErrEtcdGetCluster]could not retrieve cluster information from the given URLs: could not retriev
e cluster information from the given URLs”]
[2024/07/12 03:06:15.109 +00:00] [INFO] [server.go:335] [“create etcd v3 client”] [endpoints=“[http://basic-pd-0.basic-pd-peer.ns.svc:2379]”] [cert=“{"cacert-path":"","cert-path":"","key-
path":"","cert-allowed-cn":null,"SSLCABytes":null,"SSLCertBytes":null,"SSLKEYBytes":null,"redact-info-log":false,"encryption":{"data-encryption-method":"plaintext","data-key-rotation-perio
d":"168h0m0s","master-key":{"type":"plaintext","key-id":"","region":"","endpoint":"","path":""}}}”]
[2024/07/12 03:06:15.112 +00:00] [INFO] [server.go:400] [“init cluster id”] [cluster-id=7385453281335235694]
[2024/07/12 03:06:15.126 +00:00] [INFO] [allocator_manager.go:262] [“delete the dc-location key previously written in etcd”] [server-id=4882505430828322715]
[2024/07/12 03:06:15.157 +00:00] [INFO] [history_buffer.go:147] [“start from history index”] [start-index=728800]
[2024/07/12 03:06:15.166 +00:00] [INFO] [server.go:1489] [“start to campaign pd leader”] [campaign-pd-leader-name=basic-pd-0]

看下 tidb Operator manager 的日志。

 kubectl get all -n tidb-admin
NAME                                          READY   STATUS    RESTARTS      AGE
pod/tidb-controller-manager-f49f57768-6jgvl   1/1     Running   7 (31d ago)   88d

然后 logs 下这个 pod

资源限制:Kubernetes 可能对 PD 容器设置了资源限制,导致进程被杀掉。
配置文件问题:PD 的配置可能不正确,比如 advertise-peer-urlsadvertise-client-urls 配置错误。

没看到有什么有用的信息。
I0717 06:17:40.035380 1 event.go:282] Event(v1.ObjectReference{Kind:“TidbCluster”, Namespace:“ns”, Name:“basic”, UID:“0037af48-b6e2-4c2e-9566-0d59d9a61482”, APIVersion:“pingcap.com/v1alpha1”, ResourceVersion:“4218041”, FieldPath:“”}): type: ‘Normal’ reason: ‘SuccessfulCreate’ create StatefulSet basic-pd in basic successful
I0717 06:17:40.081271 1 tidbcluster_control.go:71] TidbCluster: [ns/basic] updated successfully
I0717 06:17:40.081308 1 tidb_cluster_controller.go:141] TidbCluster: ns/basic, still need sync: TidbCluster: [ns/basic], waiting for PD cluster running, requeuing
E0717 06:17:40.353845 1 pd_member_manager.go:195] failed to sync TidbCluster: [ns/basic]'s status, error: Get “http://basic-pd.ns:2379/pd/api/v1/health”: dial tcp 10.99.203.32:2379: connect: connection refused, service ns/basic-pd has no endpoints
I0717 06:17:40.410977 1 tidbcluster_control.go:71] TidbCluster: [ns/basic] updated successfully
I0717 06:17:40.411026 1 tidb_cluster_controller.go:141] TidbCluster: ns/basic, still need sync: TidbCluster: [ns/basic], waiting for PD cluster running, requeuing
E0717 06:17:41.500252 1 pd_member_manager.go:195] failed to sync TidbCluster: [ns/basic]'s status, error: Get “http://basic-pd.ns:2379/pd/api/v1/health”: dial tcp 10.99.203.32:2379: connect: connection refused, service ns/basic-pd has no endpoints
I0717 06:17:41.541453 1 tidbcluster_control.go:71] TidbCluster: [ns/basic] updated successfully
I0717 06:17:41.541499 1 tidb_cluster_controller.go:141] TidbCluster: ns/basic, still need sync: TidbCluster: [ns/basic], waiting for PD cluster running, requeuing
E0717 06:17:42.592209 1 pd_member_manager.go:195] failed to sync TidbCluster: [ns/basic]'s status, error: Get “http://basic-pd.ns:2379/pd/api/v1/health”: dial tcp 10.99.203.32:2379: connect: connection refused, service ns/basic-pd has no endpoints
I0717 06:17:42.600158 1 tidb_cluster_controller.go:141] TidbCluster: ns/basic, still need sync: TidbCluster: [ns/basic], waiting for PD cluster running, requeuing
E0717 06:17:43.804230 1 pd_member_manager.go:195] failed to sync TidbCluster: [ns/basic]'s status, error: Get “http://basic-pd.ns:2379/pd/api/v1/health”: dial tcp 10.99.203.32:2379: connect: connection refused, service ns/basic-pd has no endpoints
I0717 06:17:43.875595 1 tidbcluster_control.go:71] TidbCluster: [ns/basic] updated successfully
I0717 06:17:43.875633 1 tidb_cluster_controller.go:141] TidbCluster: ns/basic, still need sync: TidbCluster: [ns/basic], waiting for PD cluster running, requeuing
E0717 06:17:44.992308 1 pd_member_manager.go:195] failed to sync TidbCluster: [ns/basic]'s status, error: Get “http://basic-pd.ns:2379/pd/api/v1/health”: dial tcp 10.99.203.32:2379: connect: connection refused
I0717 06:17:45.000498 1 tidb_cluster_controller.go:141] TidbCluster: ns/basic, still need sync: TidbCluster: [ns/basic], waiting for PD cluster running, requeuing
E0717 06:17:45.040106 1 pd_member_manager.go:195] failed to sync TidbCluster: [ns/basic]'s status, error: Get “http://basic-pd.ns:2379/pd/api/v1/health”: dial tcp 10.99.203.32:2379: connect: connection refused
I0717 06:17:45.058379 1 tidbcluster_control.go:71] TidbCluster: [ns/basic] updated successfully
E0717 06:17:45.058431 1 tidb_cluster_controller.go:143] TidbCluster: ns/basic, sync failed Get “http://basic-pd.ns:2379/pd/api/v1/stores”: dial tcp 10.99.203.32:2379: connect: connection refused, requeuing
E0717 06:17:45.100990 1 pd_member_manager.go:195] failed to sync TidbCluster: [ns/basic]'s status, error: Get “http://basic-pd.ns:2379/pd/api/v1/health”: dial tcp 10.99.203.32:2379: connect: connection refused
E0717 06:17:45.106713 1 tidb_cluster_controller.go:143] TidbCluster: ns/basic, sync failed Get “http://basic-pd.ns:2379/pd/api/v1/stores”: dial tcp 10.99.203.32:2379: connect: connection refused, requeuing
E0717 06:17:50.259802 1 pd_member_manager.go:195] failed to sync TidbCluster: [ns/basic]'s status, error: Get “http://basic-pd.ns:2379/pd/api/v1/health”: dial tcp 10.99.203.32:2379: connect: connection refused
E0717 06:17:50.265372 1 tidb_cluster_controller.go:143] TidbCluster: ns/basic, sync failed Get “http://basic-pd.ns:2379/pd/api/v1/stores”: dial tcp 10.99.203.32:2379: connect: connection refused, requeuing
E0717 06:18:02.596771 1 pd_member_manager.go:195] failed to sync TidbCluster: [ns/basic]'s status, error: Error response 503 URL http://basic-pd.ns:2379/pd/api/v1/health,body response: no leader
, service ns/basic-pd has no endpoints
I0717 06:18:02.614810 1 tidbcluster_control.go:71] TidbCluster: [ns/basic] updated successfully
I0717 06:18:02.614852 1 tidb_cluster_controller.go:141] TidbCluster: ns/basic, still need sync: TidbCluster: [ns/basic], waiting for PD cluster running, requeuing
E0717 06:18:02.664913 1 pd_member_manager.go:195] failed to sync TidbCluster: [ns/basic]'s status, error: Error response 503 URL http://basic-pd.ns:2379/pd/api/v1/health,body response: no leader
, service ns/basic-pd has no endpoints
I0717 06:18:02.704625 1 tidbcluster_control.go:71] TidbCluster: [ns/basic] updated successfully
I0717 06:18:02.704671 1 tidb_cluster_controller.go:141] TidbCluster: ns/basic, still need sync: TidbCluster: [ns/basic], waiting for PD cluster running, requeuing
E0717 06:18:02.755790 1 pd_member_manager.go:195] failed to sync TidbCluster: [ns/basic]'s status, error: Error response 503 URL http://basic-pd.ns:2379/pd/api/v1/health,body response: no leader
, service ns/basic-pd has no endpoints
I0717 06:18:02.763477 1 tidb_cluster_controller.go:141] TidbCluster: ns/basic, still need sync: TidbCluster: [ns/basic], waiting for PD cluster running, requeuing
E0717 06:18:09.518105 1 pd_member_manager.go:195] failed to sync TidbCluster: [ns/basic]'s status, error: Get “http://basic-pd.ns:2379/pd/api/v1/health”: context deadline exceeded (Client.Timeout exceeded while awaiting headers)
I0717 06:18:09.538329 1 tidbcluster_control.go:71] TidbCluster: [ns/basic] updated successfully
E0717 06:18:09.538370 1 tidb_cluster_controller.go:143] TidbCluster: ns/basic, sync failed Get “http://basic-pd.ns:2379/pd/api/v1/stores”: dial tcp 10.99.203.32:2379: connect: connection refused, requeuing
E0717 06:18:09.579276 1 pd_member_manager.go:195] failed to sync TidbCluster: [ns/basic]'s status, error: Get “http://basic-pd.ns:2379/pd/api/v1/health”: dial tcp 10.99.203.32:2379: connect: connection refused
E0717 06:18:09.584470 1 tidb_cluster_controller.go:143] TidbCluster: ns/basic, sync failed Get “http://basic-pd.ns:2379/pd/api/v1/stores”: dial tcp 10.99.203.32:2379: connect: connection refused, requeuing

1、没有做特殊的yaml修改,当前pd的yaml配置没有任何资源限制。
2、 advertise-peer-urlsadvertise-client-urls 配置都是TiOperator配套的yaml文件,没有特殊配置
3、我们是在尝试IPv4环境部署,IPv6IPv4双栈部署,目前就IPv4部署有问题

你这不是有 E 也就是 ERROR 么。这看起来是去获取 pd 的健康状态 看起来获取不到?

肯定的呀,pd就是一直重启; 这个是说明是tidb-controller-manager一直在触发重启吗? 能不能让tidb-controller-manager等久一点。

换个集群名字重来一个集群试试

感觉可能是网络问题,你试试 进 manager 的 pod 去 curl 这个 api。

https://docs.pingcap.com/zh/tidb-in-kubernetes/stable/tips#kubernetes-上的-tidb-集群管理常用使用技巧 试试这个有帮助么。