pd 节点重启后,pd服务启动失败

为提高效率,请提供以下信息,问题描述清晰能够更快得到解决:
【 TiDB 使用环境】
当前集群共有7个节点。
具体信息如下:
root@Tidb ~]# tiup cluster display jmcdw
Starting component cluster: /root/.tiup/components/cluster/v1.5.2/tiup-cluster display jmcdw
Cluster type: tidb
Cluster name: jmcdw
Cluster version: v4.0.10
Deploy user: tidb
SSH type: builtin
ID Role Host Ports OS/Arch Status Data Dir Deploy Dir


172.29.6.33:9093 alertmanager 172.29.6.33 9093/9094 linux/x86_64 Up /data/data/alertmanager-9093 /data/soft/alertmanager-9093
172.29.6.33:3000 grafana 172.29.6.33 3000 linux/x86_64 Up - /data/soft/grafana-3000
172.29.6.31:2379 pd 172.29.6.31 2379/2380 linux/x86_64 Down /data/data/pd-2379 /data/soft/pd-2379
172.29.6.33:9090 prometheus 172.29.6.33 9090 linux/x86_64 Up /data/data/prometheus-9090 /data/soft/prometheus-9090
172.29.6.32:4000 tidb 172.29.6.32 4000/10080 linux/x86_64 Up - /data/soft/tidb-4000
172.29.6.33:4000 tidb 172.29.6.33 4000/10080 linux/x86_64 Up - /data/soft/tidb-4000
172.29.6.37:9000 tiflash 172.29.6.37 9000/8123/3930/20170/20292/8234 linux/x86_64 N/A /data/data/tiflash-9000 /data/soft/tidb/tiflash-9000
172.29.6.34:20160 tikv 172.29.6.34 20160/20180 linux/x86_64 N/A /data/data/tikv-20160 /data/soft/tikv-20160
172.29.6.35:20160 tikv 172.29.6.35 20160/20180 linux/x86_64 N/A /data/data/tikv-20160 /data/soft/tikv-20160
172.29.6.36:20160 tikv 172.29.6.36 20160/20180 linux/x86_64 N/A /data/data/tikv-20160 /data/soft/tikv-20160

【概述】场景+问题概述

下午3点钟,NAS存储出现问题,4点钟后恢复,存储恢复后tidb恢复正常,大约十分钟后出现pd状态为down,tikv节点状态N/A

【背景】做过哪些操作
通过tiup cluster 启动pd报错,具体错误如下:
Starting component pd
Starting instance 172.29.6.31:2379
Start instance 172.29.6.31:2379 success
Starting component node_exporter
Starting instance 172.29.6.31
Start 172.29.6.31 success
Starting component blackbox_exporter
Starting instance 172.29.6.31
Start 172.29.6.31 success

  • [ Serial ] - UpdateTopology: cluster=jmcdw
    {“level”:“warn”,“ts”:“2021-07-12T17:18:49.862+0800”,“logger”:“etcd-client”,“caller”:“v3@v3.5.0/retry_interceptor.go:62”,“msg”:“retrying of unary invoker failed”,“target”:“etcd-endpoints://0xc00018c000/#initially=[172.29.6.31:2379]”,“attempt”:0,“error”:“rpc error: code = DeadlineExceeded desc = latest balancer error: last connection error: connection error: desc = “transport: Error while dialing dial tcp 172.29.6.31:2379: connect: connection refused””}

Error: context deadline exceeded

Verbose debug logs has been written to /root/.tiup/logs/tiup-cluster-debug-2021-07-12-17-18-50.log.
Error: run /root/.tiup/components/cluster/v1.5.2/tiup-cluster (wd:/root/.tiup/data/ScwKJ88) failed: exit status 1
重启过操作系统,再重启pd,故障依旧。

【现象】业务和数据库现象
现业务全部中断,数据库无法连接。

【业务影响】
影响业务报表系统

【TiDB 版本】
Cluster version: v4.0.10

【附件】

  1. TiUP Cluster Display 信息

  2. TiUP Cluster Edit Config 信息

  3. TiDB- Overview 监控

  • 对应模块日志(包含问题前后1小时日志)
1 Like

看下 pd 日志具体报什么错误,另外如果是生产环境建议部署三个 pd 节点,单 pd 节点会有单点故障风险。

稍等,我把pd的日志发上来哈。多谢。
pd_stderr.log (2.4 MB)

这个pd.log日志
[2021/07/12 18:27:30.189 +08:00] [INFO] [util.go:42] [“Welcome to Placement Driver (PD)”]
[2021/07/12 18:27:30.190 +08:00] [INFO] [util.go:43] [PD] [release-version=v4.0.10]
[2021/07/12 18:27:30.190 +08:00] [INFO] [util.go:44] [PD] [edition=Community]
[2021/07/12 18:27:30.190 +08:00] [INFO] [util.go:45] [PD] [git-hash=560df52710293d9d67bd7b32503de0e53addfa11]
[2021/07/12 18:27:30.190 +08:00] [INFO] [util.go:46] [PD] [git-branch=heads/refs/tags/v4.0.10]
[2021/07/12 18:27:30.190 +08:00] [INFO] [util.go:47] [PD] [utc-build-time=“2021-01-15 02:55:27”]
[2021/07/12 18:27:30.190 +08:00] [INFO] [metricutil.go:81] [“disable Prometheus push client”]
[2021/07/12 18:27:30.190 +08:00] [INFO] [server.go:216] [“PD Config”] [config="{“client-urls”:“http://0.0.0.0:2379”,“peer-urls”:“http://0.0.0.0:2380”,“advertise-client-urls”:“http://172.29.6.31:2379”,“advertise-peer-urls”:“http://172.29.6.31:2380”,“name”:“pd-172.29.6.31-2379”,“data-dir”:"/data/data/pd-2379",“force-new-cluster”:false,“enable-grpc-gateway”:true,“initial-cluster”:“pd-172.29.6.31-2379=http://172.29.6.31:2380”,“initial-cluster-state”:“new”,“initial-cluster-token”:“pd-cluster”,“join”:"",“lease”:3,“log”:{“level”:"",“format”:“text”,“disable-timestamp”:false,“file”:{“filename”:"/data/soft/pd-2379/log/pd.log",“max-size”:300,“max-days”:0,“max-backups”:0},“development”:false,“disable-caller”:false,“disable-stacktrace”:false,“disable-error-verbose”:true,“sampling”:null},“tso-save-interval”:“3s”,“metric”:{“job”:“pd-172.29.6.31-2379”,“address”:"",“interval”:“15s”},“schedule”:{“max-snapshot-count”:3,“max-pending-peer-count”:16,“max-merge-region-size”:20,“max-merge-region-keys”:200000,“split-merge-interval”:“1h0m0s”,“enable-one-way-merge”:“false”,“enable-cross-table-merge”:“false”,“patrol-region-interval”:“100ms”,“max-store-down-time”:“30m0s”,“leader-schedule-limit”:4,“leader-schedule-policy”:“count”,“region-schedule-limit”:2048,“replica-schedule-limit”:64,“merge-schedule-limit”:8,“hot-region-schedule-limit”:4,“hot-region-cache-hits-threshold”:3,“store-limit”:{},“tolerant-size-ratio”:0,“low-space-ratio”:0.8,“high-space-ratio”:0.7,“scheduler-max-waiting-operator”:5,“enable-remove-down-replica”:“true”,“enable-replace-offline-replica”:“true”,“enable-make-up-replica”:“true”,“enable-remove-extra-replica”:“true”,“enable-location-replacement”:“true”,“enable-debug-metrics”:“false”,“schedulers-v2”:[{“type”:“balance-region”,“args”:null,“disable”:false,“args-payload”:""},{“type”:“balance-leader”,“args”:null,“disable”:false,“args-payload”:""},{“type”:“hot-region”,“args”:null,“disable”:false,“args-payload”:""},{“type”:“label”,“args”:null,“disable”:false,“args-payload”:""}],“schedulers-payload”:null,“store-limit-mode”:“manual”},“replication”:{“max-replicas”:3,“location-labels”:"",“strictly-match-label”:“false”,“enable-placement-rules”:“true”},“pd-server”:{“use-region-storage”:“true”,“max-gap-reset-ts”:“24h0m0s”,“key-type”:“table”,“runtime-services”:"",“metric-storage”:"",“dashboard-address”:“auto”,“trace-region-flow”:“true”},“cluster-version”:“0.0.0”,“quota-backend-bytes”:“8GiB”,“auto-compaction-mode”:“periodic”,“auto-compaction-retention-v2”:“1h”,“TickInterval”:“500ms”,“ElectionInterval”:“3s”,“PreVote”:true,“security”:{“cacert-path”:"",“cert-path”:"",“key-path”:"",“cert-allowed-cn”:null},“label-property”:null,“WarningMsgs”:null,“DisableStrictReconfigCheck”:false,“HeartbeatStreamBindInterval”:“1m0s”,“LeaderPriorityCheckInterval”:“1m0s”,“dashboard”:{“tidb-cacert-path”:"",“tidb-cert-path”:"",“tidb-key-path”:"",“public-path-prefix”:"",“internal-proxy”:false,“enable-telemetry”:true,“enable-experimental”:false},“replication-mode”:{“replication-mode”:“majority”,“dr-auto-sync”:{“label-key”:"",“primary”:"",“dr”:"",“primary-replicas”:0,“dr-replicas”:0,“wait-store-timeout”:“1m0s”,“wait-sync-timeout”:“1m0s”}},“enable-redact-log”:false}"]
[2021/07/12 18:27:30.192 +08:00] [INFO] [server.go:189] [“register REST path”] [path=/pd/api/v1]
[2021/07/12 18:27:30.192 +08:00] [INFO] [server.go:189] [“register REST path”] [path=/swagger/]
[2021/07/12 18:27:30.193 +08:00] [INFO] [server.go:189] [“register REST path”] [path=/dashboard/api/]
[2021/07/12 18:27:30.193 +08:00] [INFO] [server.go:189] [“register REST path”] [path=/dashboard/]
[2021/07/12 18:27:30.194 +08:00] [INFO] [systime_mon.go:27] [“start system time monitor”]
[2021/07/12 18:27:30.194 +08:00] [INFO] [etcd.go:117] [“configuring peer listeners”] [listen-peer-urls="[http://0.0.0.0:2380]"]
[2021/07/12 18:27:30.194 +08:00] [INFO] [etcd.go:127] [“configuring client listeners”] [listen-client-urls="[http://0.0.0.0:2379]"]
[2021/07/12 18:27:30.194 +08:00] [INFO] [etcd.go:602] [“pprof is enabled”] [path=/debug/pprof]
[2021/07/12 18:27:30.195 +08:00] [INFO] [etcd.go:299] [“starting an etcd server”] [etcd-version=3.4.3] [git-sha=“Not provided (use ./build instead of go build)”] [go-version=go1.13] [go-os=linux] [go-arch=amd64] [max-cpu-set=48] [max-cpu-available=48] [member-initialized=true] [name=pd-172.29.6.31-2379] [data-dir=/data/data/pd-2379] [wal-dir=] [wal-dir-dedicated=] [member-dir=/data/data/pd-2379/member] [force-new-cluster=false] [heartbeat-interval=500ms] [election-timeout=3s] [initial-election-tick-advance=true] [snapshot-count=100000] [snapshot-catchup-entries=5000] [initial-advertise-peer-urls="[http://172.29.6.31:2380]"] [listen-peer-urls="[http://0.0.0.0:2380]"] [advertise-client-urls="[http://172.29.6.31:2379]"] [listen-client-urls="[http://0.0.0.0:2379]"] [listen-metrics-urls="[]"] [cors="[]"] [host-whitelist="[]"] [initial-cluster=] [initial-cluster-state=new] [initial-cluster-token=] [quota-size-bytes=8589934592] [pre-vote=true] [initial-corrupt-check=false] [corrupt-check-time-interval=0s] [auto-compaction-mode=periodic] [auto-compaction-retention=1h0m0s] [auto-compaction-interval=1h0m0s] [discovery-url=] [discovery-proxy=]
[2021/07/12 18:27:45.941 +08:00] [INFO] [util.go:42] [“Welcome to Placement Driver (PD)”]
[2021/07/12 18:27:45.941 +08:00] [INFO] [util.go:43] [PD] [release-version=v4.0.10]
[2021/07/12 18:27:45.941 +08:00] [INFO] [util.go:44] [PD] [edition=Community]
[2021/07/12 18:27:45.941 +08:00] [INFO] [util.go:45] [PD] [git-hash=560df52710293d9d67bd7b32503de0e53addfa11]
[2021/07/12 18:27:45.941 +08:00] [INFO] [util.go:46] [PD] [git-branch=heads/refs/tags/v4.0.10]
[2021/07/12 18:27:45.941 +08:00] [INFO] [util.go:47] [PD] [utc-build-time=“2021-01-15 02:55:27”]
[2021/07/12 18:27:45.941 +08:00] [INFO] [metricutil.go:81] [“disable Prometheus push client”]
[2021/07/12 18:27:45.941 +08:00] [INFO] [server.go:216] [“PD Config”] [config="{“client-urls”:“http://0.0.0.0:2379”,“peer-urls”:“http://0.0.0.0:2380”,“advertise-client-urls”:“http://172.29.6.31:2379”,“advertise-peer-urls”:“http://172.29.6.31:2380”,“name”:“pd-172.29.6.31-2379”,“data-dir”:"/data/data/pd-2379",“force-new-cluster”:false,“enable-grpc-gateway”:true,“initial-cluster”:“pd-172.29.6.31-2379=http://172.29.6.31:2380”,“initial-cluster-state”:“new”,“initial-cluster-token”:“pd-cluster”,“join”:"",“lease”:3,“log”:{“level”:"",“format”:“text”,“disable-timestamp”:false,“file”:{“filename”:"/data/soft/pd-2379/log/pd.log",“max-size”:300,“max-days”:0,“max-backups”:0},“development”:false,“disable-caller”:false,“disable-stacktrace”:false,“disable-error-verbose”:true,“sampling”:null},“tso-save-interval”:“3s”,“metric”:{“job”:“pd-172.29.6.31-2379”,“address”:"",“interval”:“15s”},“schedule”:{“max-snapshot-count”:3,“max-pending-peer-count”:16,“max-merge-region-size”:20,“max-merge-region-keys”:200000,“split-merge-interval”:“1h0m0s”,“enable-one-way-merge”:“false”,“enable-cross-table-merge”:“false”,“patrol-region-interval”:“100ms”,“max-store-down-time”:“30m0s”,“leader-schedule-limit”:4,“leader-schedule-policy”:“count”,“region-schedule-limit”:2048,“replica-schedule-limit”:64,“merge-schedule-limit”:8,“hot-region-schedule-limit”:4,“hot-region-cache-hits-threshold”:3,“store-limit”:{},“tolerant-size-ratio”:0,“low-space-ratio”:0.8,“high-space-ratio”:0.7,“scheduler-max-waiting-operator”:5,“enable-remove-down-replica”:“true”,“enable-replace-offline-replica”:“true”,“enable-make-up-replica”:“true”,“enable-remove-extra-replica”:“true”,“enable-location-replacement”:“true”,“enable-debug-metrics”:“false”,“schedulers-v2”:[{“type”:“balance-region”,“args”:null,“disable”:false,“args-payload”:""},{“type”:“balance-leader”,“args”:null,“disable”:false,“args-payload”:""},{“type”:“hot-region”,“args”:null,“disable”:false,“args-payload”:""},{“type”:“label”,“args”:null,“disable”:false,“args-payload”:""}],“schedulers-payload”:null,“store-limit-mode”:“manual”},“replication”:{“max-replicas”:3,“location-labels”:"",“strictly-match-label”:“false”,“enable-placement-rules”:“true”},“pd-server”:{“use-region-storage”:“true”,“max-gap-reset-ts”:“24h0m0s”,“key-type”:“table”,“runtime-services”:"",“metric-storage”:"",“dashboard-address”:“auto”,“trace-region-flow”:“true”},“cluster-version”:“0.0.0”,“quota-backend-bytes”:“8GiB”,“auto-compaction-mode”:“periodic”,“auto-compaction-retention-v2”:“1h”,“TickInterval”:“500ms”,“ElectionInterval”:“3s”,“PreVote”:true,“security”:{“cacert-path”:"",“cert-path”:"",“key-path”:"",“cert-allowed-cn”:null},“label-property”:null,“WarningMsgs”:null,“DisableStrictReconfigCheck”:false,“HeartbeatStreamBindInterval”:“1m0s”,“LeaderPriorityCheckInterval”:“1m0s”,“dashboard”:{“tidb-cacert-path”:"",“tidb-cert-path”:"",“tidb-key-path”:"",“public-path-prefix”:"",“internal-proxy”:false,“enable-telemetry”:true,“enable-experimental”:false},“replication-mode”:{“replication-mode”:“majority”,“dr-auto-sync”:{“label-key”:"",“primary”:"",“dr”:"",“primary-replicas”:0,“dr-replicas”:0,“wait-store-timeout”:“1m0s”,“wait-sync-timeout”:“1m0s”}},“enable-redact-log”:false}"]
[2021/07/12 18:27:45.943 +08:00] [INFO] [server.go:189] [“register REST path”] [path=/pd/api/v1]
[2021/07/12 18:27:45.943 +08:00] [INFO] [server.go:189] [“register REST path”] [path=/swagger/]
[2021/07/12 18:27:45.944 +08:00] [INFO] [server.go:189] [“register REST path”] [path=/dashboard/api/]
[2021/07/12 18:27:45.944 +08:00] [INFO] [server.go:189] [“register REST path”] [path=/dashboard/]
[2021/07/12 18:27:45.944 +08:00] [INFO] [etcd.go:117] [“configuring peer listeners”] [listen-peer-urls="[http://0.0.0.0:2380]"]
[2021/07/12 18:27:45.944 +08:00] [INFO] [systime_mon.go:27] [“start system time monitor”]
[2021/07/12 18:27:45.944 +08:00] [INFO] [etcd.go:127] [“configuring client listeners”] [listen-client-urls="[http://0.0.0.0:2379]"]
[2021/07/12 18:27:45.944 +08:00] [INFO] [etcd.go:602] [“pprof is enabled”] [path=/debug/pprof]
[2021/07/12 18:27:45.945 +08:00] [INFO] [etcd.go:299] [“starting an etcd server”] [etcd-version=3.4.3] [git-sha=“Not provided (use ./build instead of go build)”] [go-version=go1.13] [go-os=linux] [go-arch=amd64] [max-cpu-set=48] [max-cpu-available=48] [member-initialized=true] [name=pd-172.29.6.31-2379] [data-dir=/data/data/pd-2379] [wal-dir=] [wal-dir-dedicated=] [member-dir=/data/data/pd-2379/member] [force-new-cluster=false] [heartbeat-interval=500ms] [election-timeout=3s] [initial-election-tick-advance=true] [snapshot-count=100000] [snapshot-catchup-entries=5000] [initial-advertise-peer-urls="[http://172.29.6.31:2380]"] [listen-peer-urls="[http://0.0.0.0:2380]"] [advertise-client-urls="[http://172.29.6.31:2379]"] [listen-client-urls="[http://0.0.0.0:2379]"] [listen-metrics-urls="[]"] [cors="[]"] [host-whitelist="[]"] [initial-cluster=] [initial-cluster-state=new] [initial-cluster-token=] [quota-size-bytes=8589934592] [pre-vote=true] [initial-corrupt-check=false] [corrupt-check-time-interval=0s] [auto-compaction-mode=periodic] [auto-compaction-retention=1h0m0s] [auto-compaction-interval=1h0m0s] [discovery-url=] [discovery-proxy=]

尝试用pd-recover 报如下错误。

[root@Tidb-01 soft]# ./pd-recover -endpoints http://172.29.6.31:2379 -cluster-id 6983999671715335558 -alloc-id 10000
{“level”:“warn”,“ts”:“2021-07-12T19:51:19.694+0800”,“caller”:“clientv3/retry_interceptor.go:61”,“msg”:“retrying of unary invoker failed”,“target”:“endpoint://client-5522029b-624e-4a43-8d47-f907508ac8e3/172.29.6.31:2379”,“attempt”:0,“error”:“rpc error: code = DeadlineExceeded desc = latest balancer error: all SubConns are in TransientFailure, latest connection error: connection error: desc = “transport: Error while dialing dial tcp 172.29.6.31:2379: connect: connection refused””}
context deadline exceeded

已恢复。多谢。

好的,建议再扩容两个 pd 节点,三个节点可以保证 pd 集群的高可用。

此话题已在最后回复的 1 分钟后被自动关闭。不再允许新回复。