两台PD,其中一台停机了,KV全部down了,而且无法在restart
Starting component tikv
Starting instance tikv 172.25.10.71:20160
Starting instance tikv 172.25.10.69:20160
Starting instance tikv 172.25.10.70:20160
retry error: operation timed out after 1m0s
tikv 172.25.10.71:20160 failed to start: timed out waiting for port 20160 to be started after 1m0s, please check the log of the instance
retry error: operation timed out after 1m0s
tikv 172.25.10.70:20160 failed to start: timed out waiting for port 20160 to be started after 1m0s, please check the log of the instance
retry error: operation timed out after 1m0s
tikv 172.25.10.69:20160 failed to start: timed out waiting for port 20160 to be started after 1m0s, please check the log of the instance
Error: failed to start: failed to start tikv: tikv 172.25.10.71:20160 failed to start: timed out waiting for port 20160 to be started after 1m0s, please check the log of the instance: timed out waiting for port 20160 to be started after 1m0s
Verbose debug logs has been written to /data/tools/tiup/logs/tiup-cluster-debug-2020-07-01-16-50-50.log.
Error: run /root/.tiup/components/cluster/v1.0.4/tiup-cluster
(wd:/root/.tiup/data/S3TfSOd) failed: exit status 1
如果是两台PD,根据 raft 需要写入多数副本,就需要2台同时写入成功才行。 所以会影响集群的使用,如果可以修复另一台PD,恢复后,就可以正常使用。
关键是我们把那个PD给缩容了,现在还能怎么恢复?
- 执行什么命令缩容的?
- 麻烦反馈 tiup 启动失败的log日志 /data/tools/tiup/logs/tiup-cluster-debug-2020-07-01-16-50-50.log , 172.25.10.71:20160 tikv.log 日志 和 pd.log 日志,多谢。
可以只发送出问题后的日志,多谢。
使用了该命令扩容
tiup cluster scale-in --node 10.0.1.4:9000
2020-07-01T16:50:50.696+0800 DEBUG retry error: operation timed out after 1m0s
2020-07-01T16:50:50.696+0800 ERROR tikv 172.25.10.71:20160 failed to start: timed out waiting for port 20160 to be started after 1m0s, please check the log of the instance
2020-07-01T16:50:50.807+0800 INFO SSHCommand {“host”: “172.25.10.70”, “port”: “22”, “cmd”: “PATH=$PATH:/usr/bin:/usr/sbin ss -ltn”, “stdout”: “State Recv-Q Send-Q Local Address:Port Peer Address:Port \
LISTEN 0 128 :22 : \
LISTEN 0 100 127.0.0.1:25 : \
LISTEN 0 128 :111 : \
LISTEN 0 128 [::]:22 [::]: \
LISTEN 0 100 [::1]:25 [::]: \
LISTEN 0 128 [::]:111 [::]:* \
”, “stderr”: “”}
2020-07-01T16:50:50.807+0800 DEBUG retry error: operation timed out after 1m0s
2020-07-01T16:50:50.807+0800 ERROR tikv 172.25.10.70:20160 failed to start: timed out waiting for port 20160 to be started after 1m0s, please check the log of the instance
2020-07-01T16:50:50.846+0800 INFO SSHCommand {“host”: “172.25.10.69”, “port”: “22”, “cmd”: “PATH=$PATH:/usr/bin:/usr/sbin ss -ltn”, “stdout”: “State Recv-Q Send-Q Local Address:Port Peer Address:Port \
LISTEN 0 128 :22 : \
LISTEN 0 100 127.0.0.1:25 : \
LISTEN 0 128 :111 : \
LISTEN 0 128 [::]:22 [::]: \
LISTEN 0 100 [::1]:25 [::]: \
LISTEN 0 128 [::]:111 [::]:* \
”, “stderr”: “”}
2020-07-01T16:50:50.846+0800 DEBUG retry error: operation timed out after 1m0s
2020-07-01T16:50:50.846+0800 ERROR tikv 172.25.10.69:20160 failed to start: timed out waiting for port 20160 to be started after 1m0s, please check the log of the instance
2020-07-01T16:50:50.846+0800 DEBUG TaskFinish {“task”: “ClusterOperate: operation=StartOperation, options={Roles:[] Nodes:[] Force:false SSHTimeout:5 OptTimeout:60 APITimeout:300}”, “error”: “failed to start: failed to start tikv: \ttikv 172.25.10.71:20160 failed to start: timed out waiting for port 20160 to be started after 1m0s, please check the log of the instance: timed out waiting for port 20160 to be started after 1m0s”, “errorVerbose”: “timed out waiting for port 20160 to be started after 1m0s\ngithub.com/pingcap/tiup/pkg/cluster/module.(*WaitFor).Execute\
\tgithub.com/pingcap/tiup@/pkg/cluster/module/wait_for.go:90\
github.com/pingcap/tiup/pkg/cluster/meta.PortStarted\
\tgithub.com/pingcap/tiup@/pkg/cluster/meta/logic.go:116\
github.com/pingcap/tiup/pkg/cluster/meta.(*instance).Ready\
\tgithub.com/pingcap/tiup@/pkg/cluster/meta/logic.go:146\
github.com/pingcap/tiup/pkg/cluster/operation.startInstance\
\tgithub.com/pingcap/tiup@/pkg/cluster/operation/action.go:468\
github.com/pingcap/tiup/pkg/cluster/operation.StartComponent.func1\
\tgithub.com/pingcap/tiup@/pkg/cluster/operation/action.go:504\
golang.org/x/sync/errgroup.(*Group).Go.func1\
\tgolang.org/x/sync@v0.0.0-20190911185100-cd5d95a43a6e/errgroup/errgroup.go:57\
runtime.goexit\
\truntime/asm_amd64.s:1357\
\ttikv 172.25.10.71:20160 failed to start: timed out waiting for port 20160 to be started after 1m0s, please check the log of the instance\
failed to start tikv\
failed to start”}
2020-07-01T16:50:50.846+0800 INFO Execute command finished {“code”: 1, “error”: “failed to start: failed to start tikv: \ttikv 172.25.10.71:20160 failed to start: timed out waiting for port 20160 to be started after 1m0s, please check the log of the instance: timed out waiting for port 20160 to be started after 1m0s”, “errorVerbose”: “timed out waiting for port 20160 to be started after 1m0s\ngithub.com/pingcap/tiup/pkg/cluster/module.(*WaitFor).Execute\
\tgithub.com/pingcap/tiup@/pkg/cluster/module/wait_for.go:90\
github.com/pingcap/tiup/pkg/cluster/meta.PortStarted\
\tgithub.com/pingcap/tiup@/pkg/cluster/meta/logic.go:116\
github.com/pingcap/tiup/pkg/cluster/meta.(*instance).Ready\
\tgithub.com/pingcap/tiup@/pkg/cluster/meta/logic.go:146\
github.com/pingcap/tiup/pkg/cluster/operation.startInstance\
\tgithub.com/pingcap/tiup@/pkg/cluster/operation/action.go:468\
github.com/pingcap/tiup/pkg/cluster/operation.StartComponent.func1\
\tgithub.com/pingcap/tiup@/pkg/cluster/operation/action.go:504\
golang.org/x/sync/errgroup.(*Group).Go.func1\
\tgolang.org/x/sync@v0.0.0-20190911185100-cd5d95a43a6e/errgroup/errgroup.go:57\
runtime.goexit\
\truntime/asm_amd64.s:1357\
\ttikv 172.25.10.71:20160 failed to start: timed out waiting for port 20160 to be started after 1m0s, please check the log of the instance\
failed to start tikv\
failed to start”}
[2020/07/01 17:00:00.009 +08:00] [INFO] [util.rs:398] [“connecting to PD endpoint”] [endpoints=172.25.10.68:2379]
[2020/07/01 17:00:01.013 +08:00] [INFO] [] [“New connected subchannel at 0x7f157280d480 for subchannel 0x7f157d219d80”]
[2020/07/01 17:00:02.010 +08:00] [INFO] [util.rs:358] [“PD failed to respond”] [err=“Grpc(RpcFailure(RpcStatus { status: 4-DEADLINE_EXCEEDED, details: Some(“Deadline Exceeded”) }))”] [endpoints=172.25.10.68:2379]
[2020/07/01 17:00:02.310 +08:00] [INFO] [util.rs:398] [“connecting to PD endpoint”] [endpoints=172.25.10.68:2379]
[2020/07/01 17:00:02.311 +08:00] [INFO] [] [“New connected subchannel at 0x7f157280d540 for subchannel 0x7f157d219d80”]
[2020/07/01 17:00:04.311 +08:00] [INFO] [util.rs:358] [“PD failed to respond”] [err=“Grpc(RpcFailure(RpcStatus { status: 4-DEADLINE_EXCEEDED, details: Some(“Deadline Exceeded”) }))”] [endpoints=172.25.10.68:2379]
[2020/07/01 17:00:04.311 +08:00] [WARN] [client.rs:56] [“validate PD endpoints failed”] [err=“Other(”[components/pd_client/src/util.rs:389]: PD cluster failed to respond")"]
[2020/07/01 17:00:04.611 +08:00] [INFO] [util.rs:398] [“connecting to PD endpoint”] [endpoints=172.25.10.68:2379]
[2020/07/01 17:00:04.612 +08:00] [INFO] [] [“New connected subchannel at 0x7f157280d600 for subchannel 0x7f157d219d80”]
[2020/07/01 17:00:06.612 +08:00] [INFO] [util.rs:358] [“PD failed to respond”] [err=“Grpc(RpcFailure(RpcStatus { status: 4-DEADLINE_EXCEEDED, details: Some(“Deadline Exceeded”) }))”] [endpoints=172.25.10.68:2379]
[2020/07/01 17:00:06.912 +08:00] [INFO] [util.rs:398] [“connecting to PD endpoint”] [endpoints=172.25.10.68:2379]
[2020/07/01 17:00:06.913 +08:00] [INFO] [] [“New connected subchannel at 0x7f157280d6c0 for subchannel 0x7f157d219d80”]
[2020/07/01 17:00:08.913 +08:00] [INFO] [util.rs:358] [“PD failed to respond”] [err=“Grpc(RpcFailure(RpcStatus { status: 4-DEADLINE_EXCEEDED, details: Some(“Deadline Exceeded”) }))”] [endpoints=172.25.10.68:2379]
[2020/07/01 17:00:09.213 +08:00] [INFO] [util.rs:398] [“connecting to PD endpoint”] [endpoints=172.25.10.68:2379]
[2020/07/01 17:00:09.214 +08:00] [INFO] [] [“New connected subchannel at 0x7f157280d780 for subchannel 0x7f157d219d80”]
[2020/07/01 17:00:11.214 +08:00] [INFO] [util.rs:358] [“PD failed to respond”] [err=“Grpc(RpcFailure(RpcStatus { status: 4-DEADLINE_EXCEEDED, details: Some(“Deadline Exceeded”) }))”] [endpoints=172.25.10.68:2379]
[2020/07/01 17:01:23.145 +08:00] [INFO] [capability.go:76] [“enabled capabilities for version”] [cluster-version=3.4]
[2020/07/01 17:01:23.145 +08:00] [INFO] [cluster.go:256] [“recovered/added member from store”] [cluster-id=ac2c18eac0909447] [local-member-id=8d2218498c0e31e0] [recovered-remote-peer-id=8d2218498c0e31e0] [recovered-remote-peer-urls="[http://172.25.10.68:2380]"]
[2020/07/01 17:01:23.145 +08:00] [INFO] [cluster.go:256] [“recovered/added member from store”] [cluster-id=ac2c18eac0909447] [local-member-id=8d2218498c0e31e0] [recovered-remote-peer-id=f8f3343278fa576e] [recovered-remote-peer-urls="[http://172.25.48.124:2380]"]
[2020/07/01 17:01:23.145 +08:00] [INFO] [cluster.go:269] [“set cluster version from store”] [cluster-version=3.4]
[2020/07/01 17:01:23.319 +08:00] [INFO] [kvstore.go:378] [“restored last compact revision”] [meta-bucket-name=meta] [meta-bucket-name-key=finishedCompactRev] [restored-compact-revision=971233]
[2020/07/01 17:01:23.323 +08:00] [WARN] [store.go:1317] [“simple token is not cryptographically signed”]
[2020/07/01 17:01:23.474 +08:00] [INFO] [quota.go:126] [“enabled backend quota”] [quota-name=v3-applier] [quota-size-bytes=8589934592] [quota-size=“8.6 GB”]
[2020/07/01 17:01:23.680 +08:00] [INFO] [peer.go:128] [“starting remote peer”] [remote-peer-id=f8f3343278fa576e]
[2020/07/01 17:01:23.680 +08:00] [INFO] [pipeline.go:71] [“started HTTP pipelining with remote peer”] [local-member-id=8d2218498c0e31e0] [remote-peer-id=f8f3343278fa576e]
[2020/07/01 17:01:23.680 +08:00] [INFO] [stream.go:166] [“started stream writer with remote peer”] [local-member-id=8d2218498c0e31e0] [remote-peer-id=f8f3343278fa576e]
[2020/07/01 17:01:23.680 +08:00] [INFO] [stream.go:166] [“started stream writer with remote peer”] [local-member-id=8d2218498c0e31e0] [remote-peer-id=f8f3343278fa576e]
[2020/07/01 17:01:23.681 +08:00] [INFO] [peer.go:134] [“started remote peer”] [remote-peer-id=f8f3343278fa576e]
[2020/07/01 17:01:23.681 +08:00] [INFO] [transport.go:327] [“added remote peer”] [local-member-id=8d2218498c0e31e0] [remote-peer-id=f8f3343278fa576e] [remote-peer-urls="[http://172.25.48.124:2380]"]
[2020/07/01 17:01:23.681 +08:00] [INFO] [stream.go:406] [“started stream reader with remote peer”] [stream-reader-type=“stream MsgApp v2”] [local-member-id=8d2218498c0e31e0] [remote-peer-id=f8f3343278fa576e]
[2020/07/01 17:01:23.681 +08:00] [INFO] [stream.go:406] [“started stream reader with remote peer”] [stream-reader-type=“stream Message”] [local-member-id=8d2218498c0e31e0] [remote-peer-id=f8f3343278fa576e]
[2020/07/01 17:01:23.681 +08:00] [INFO] [server.go:779] [“starting etcd server”] [local-member-id=8d2218498c0e31e0] [local-server-version=3.4.3] [cluster-id=ac2c18eac0909447] [cluster-version=3.4]
[2020/07/01 17:01:23.682 +08:00] [INFO] [server.go:680] [“starting initial election tick advance”] [election-ticks=6]
[2020/07/01 17:01:23.683 +08:00] [INFO] [etcd.go:241] [“now serving peer/client/metrics”] [local-member-id=8d2218498c0e31e0] [initial-advertise-peer-urls="[http://172.25.10.68:2380]"] [listen-peer-urls="[http://172.25.10.68:2380]"] [advertise-client-urls="[http://172.25.10.68:2379]"] [listen-client-urls="[http://172.25.10.68:2379]"] [listen-metrics-urls="[]"]
[2020/07/01 17:01:23.683 +08:00] [INFO] [etcd.go:576] [“serving peer traffic”] [address=172.25.10.68:2380]
有没有交流的群? 公司刚升级用这个
- 执行 tiup cluster display 集群名,反馈拓扑信息
- 执行 tiup ctl pd -u pdip:port -i 在命令行执行 member ,store,config show all 反馈信息,多谢。
- 麻烦反馈下出问题前到现在的日志,这个有点少,最好可以上传文本,多谢。
[root@jiankongserver tiup]# tiup cluster display tidb-produce
Starting component `cluster`: /root/.tiup/components/cluster/v1.0.4/tiup-cluster display tidb-produce
TiDB Cluster: tidb-produce
TiDB Version: v4.0.0
ID Role Host Ports OS/Arch Status Data Dir Deploy Dir
-- ---- ---- ----- ------- ------ -------- ----------
172.25.10.72:9093 alertmanager 172.25.10.72 9093/9094 linux/x86_64 inactive /data/tidb/alertmanager-9093 /data/tools/tiup/alertmanager-9093
172.25.10.72:3000 grafana 172.25.10.72 3000 linux/x86_64 inactive - /data/tools/tiup/grafana-3000
172.25.10.68:2379 pd 172.25.10.68 2379/2380 linux/x86_64 Down /data/tidb/pd-2379 /data/tools/tiup/pd-2379
172.25.10.72:9090 prometheus 172.25.10.72 9090 linux/x86_64 inactive /data/tidb/prometheus-9090 /data/tools/tiup/prometheus-9090
172.25.10.67:8250 pump 172.25.10.67 8250 linux/x86_64 Down /data/tidb-data/pump-8249 /data/tidb-deploy/pump-8249
172.25.10.67:4000 tidb 172.25.10.67 4000/10080 linux/x86_64 Down - /data/tools/tiup/tidb-4000
172.25.10.72:9000 tiflash 172.25.10.72 9000/8123/3930/20170/20292/8234 linux/x86_64 Down /data/tidb/tiflash-9000 /data/tools/tiup/tiflash-9000
172.25.10.69:20160 tikv 172.25.10.69 20160/20180 linux/x86_64 Down /data/tidb/tikv-20160 /data/tools/tiup/tikv-20160
172.25.10.70:20160 tikv 172.25.10.70 20160/20180 linux/x86_64 Down /data/tidb/tikv-20160 /data/tools/tiup/tikv-20160
172.25.10.71:20160 tikv 172.25.10.71 20160/20180 linux/x86_64 Down /data/tidb/tikv-20160 /data/tools/tiup/tikv-20160
您看下私信
问题是由于有一个 PD 所在服务器掉电了,之后使用了非标准步骤,修改了配置文件。 等到 PD 所在服务器来电后,再加回PD,集群恢复。
我也遇到同样问题,请问具体怎么解决的?
PD所在服务器来电后,手工将pd节点加入后,启动时tikv报错timeout;
单独启动pd,新加入的pd节点日志报错 [FATAL] [storage.go:88] [“failed to read WAL,cannot be repaired”] [error=“proto: illegal wireType 6”]
这个报错来看应该是断电的时候出现文件损坏了,你总共有几个PD?剩下的PD如果能形成多数派能工作的话,建议把这个数据出错的PD从集群里移除掉,清空数据后重新再加入。(也就是缩容再扩容)
只有2个PD,坏掉的PD那个force踢掉之后,tiup启动集群报tikv timeout;坏的PD所在服务器来电后,手工加入PD,启动集群报tikv timeout, 单独启动pd就报上面的错误。这个情况是不是只能重装了?tikv里的数据还有办法导出来吗?
只有两个PD又坏掉一个的话,确实没法服务了。你可以使用pd-recover工具新建一套PD用(建议使用3节点)。使用之前最好备份数据,工具的文档在这里:https://docs.pingcap.com/zh/tidb/stable/pd-recover
OK~谢谢