TiDB 4.0.0 使用TiUP部署,PD启动失败

PD的日志文件是空的。 这是一个新集群。 PD的三个节点我部署在一台机器上,但是端口不同。

你好,

请上传下 topology 文件、debug 日志、报错 pd 日志文件。

debug日志在哪里? error日志没有

debug 日志为 tiup 当执行出现问题会打印的日志:在你得报错可以看下 .log 文件的位置输出。

pd 日志在 deploy dir/log 与其他部署方式相同,各个节点日志位置同理

DEBUG日志: 2020-06-09T17:27:03.517+0800 INFO Starting cluster tidb-ceph… 2020-06-09T17:27:03.518+0800 INFO + [ Serial ] - SSHKeySet: privateKey=/home/mysql/.tiup/storage/cluster/clusters/tidb-ceph/ssh/id_rsa, publicKey=/home/mysql/.tiup/storage/cluster/clusters/tidb-ceph/ssh/id_rsa.pub 2020-06-09T17:27:03.518+0800 DEBUG TaskBegin {“task”: “SSHKeySet: privateKey=/home/mysql/.tiup/storage/cluster/clusters/tidb-ceph/ssh/id_rsa, publicKey=/home/mysql/.tiup/storage/cluster/clusters/tidb-ceph/ssh/id_rsa.pub”} 2020-06-09T17:27:03.518+0800 DEBUG TaskFinish {“task”: “SSHKeySet: privateKey=/home/mysql/.tiup/storage/cluster/clusters/tidb-ceph/ssh/id_rsa, publicKey=/home/mysql/.tiup/storage/cluster/clusters/tidb-ceph/ssh/id_rsa.pub”} 2020-06-09T17:27:03.518+0800 DEBUG TaskBegin {“task”: “UserSSH: user=tidb, host=xxx.xxx.xx.52\nUserSSH: user=tidb, host=xxx.xx.xx.55\nUserSSH: user=tidb, host=xxx.xx.xx.12\nUserSSH: user=tidb, host=xxx.xxx.xx.41\nUserSSH: user=tidb, host=xxx.xxx.xx.47\nUserSSH: user=tidb, host=xxx.xxx.xx.38\nUserSSH: user=tidb, host=xxx.xxx.xx1.59\nUserSSH: user=tidb, host=xxx.xxx.xx.52\nUserSSH: user=tidb, host=xxx.xxx.xx.52\nUserSSH: user=tidb, host=xxx.xxx.xx.52”} 2020-06-09T17:27:03.518+0800 INFO + [Parallel] - UserSSH: user=tidb, host=xxx.xxx.xx.52 2020-06-09T17:27:03.518+0800 INFO + [Parallel] - UserSSH: user=tidb, host=xxx.xxx.xx.52 2020-06-09T17:27:03.518+0800 DEBUG TaskBegin {“task”: “UserSSH: user=tidb, host=xxx.xxx.xx.52”} 2020-06-09T17:27:03.518+0800 INFO + [Parallel] - UserSSH: user=tidb, host=xxx.xxx.xx1.55 2020-06-09T17:27:03.518+0800 DEBUG TaskFinish {“task”: “UserSSH: user=tidb, host=xxx.xxx.xx.52”} 2020-06-09T17:27:03.518+0800 DEBUG TaskBegin {“task”: “UserSSH: user=tidb, host=xxx.xxx.xx.52”} 2020-06-09T17:27:03.518+0800 DEBUG TaskBegin {“task”: “UserSSH: user=tidb, host=xxx.xxx.xx1.55”} 2020-06-09T17:27:03.518+0800 DEBUG TaskFinish {“task”: “UserSSH: user=tidb, host=xxx.xxx.xx.52”} 2020-06-09T17:27:03.518+0800 DEBUG TaskFinish {“task”: “UserSSH: user=tidb, host=xxx.xxx.xx1.55”} 2020-06-09T17:27:03.518+0800 INFO + [Parallel] - UserSSH: user=tidb, host=xxx.xxx.xx.47 2020-06-09T17:27:03.518+0800 INFO + [Parallel] - UserSSH: user=tidb, host=xxx.xxx.xx.41 …skipping… the instance 2020-06-09T17:28:04.728+0800 INFO SSHCommand {“host”: “xxx.xxx.xx1.55”, “port”: “22”, “cmd”: “PATH=$PATH:/usr/bin:/usr/sbin ss -ltn”, “stdout”: “State Recv-Q Send-Q Local Address:Port Peer Address:Port \nLISTEN 0 100 :9422 : \nLISTEN 0 128 127.0.0.1:1999 : \nLISTEN 0 128 xxx.xxx.xx1.55:1999 : \nLISTEN 0 128 :22 : \nLISTEN 0 128 127.0.0.1:8600 : \nLISTEN 0 100 127.0.0.1:59996 : \nLISTEN 0 128 :15998 : \nLISTEN 0 128 :15999 : \nLISTEN 0 100 127.0.0.1:41855 : \nLISTEN 0 128 127.0.0.1:1991 : \nLISTEN 0 128 :::23211 ::: \nLISTEN 0 128 :::6604 ::: \nLISTEN 0 128 :::31949 ::: \nLISTEN 0 128 :::29741 ::: \nLISTEN 0 128 :::24877 :::* \nLISTEN 0 128 :::8301 :::* \nLISTEN 0 128 :::25198 :::* \nLISTEN 0 128 :::20270 :::* \nLISTEN 0 128 :::24559 :::* \nLISTEN 0 128 :::23663 :::* \nLISTEN 0 128 :::20240 :::* \nLISTEN 0 128 :::6608 :::* \nLISTEN 0 128 :::26801 :::* \nLISTEN 0 128 :::28659 :::* \nLISTEN 0 128 :::8500 :::* \nLISTEN 0 128 :::25846 :::* \nLISTEN 0 128 :::22 :::* \nLISTEN 0 128 :::21496 :::* \nLISTEN 51 50 :::6620 :::* \nLISTEN 0 128 :::29053 :::* \nLISTEN 0 128 :::26240 :::* \nLISTEN 0 128 :::25152 :::* \nLISTEN 0 128 :::25185 :::* \nLISTEN 0 128 :::20930 :::* \nLISTEN 0 128 :::22851 :::* \nLISTEN 0 128 :::29028 :::* \nLISTEN 0 128 :::30916 :::* \nLISTEN 0 128 :::27909 :::* \nLISTEN 0 128 :::4646 :::* \nLISTEN 0 128 :::31783 :::* \nLISTEN 0 128 :::25577 :::* \nLISTEN 0 128 :::29865 :::* \nLISTEN 0 128 :::21481 :::* \n”, “stderr”: “”} 2020-06-09T17:28:04.728+0800 DEBUG retry error: operation timed out after 1m0s 2020-06-09T17:28:04.728+0800 ERROR pd xxx.xxx.xx1.55:2379 failed to start: timed out waiting for port 2379 to be started after 1m0s, please check the log of the instance 2020-06-09T17:28:04.728+0800 DEBUG TaskFinish {“task”: “ClusterOperate: operation=StartOperation, options={Roles:[] Nodes:[] Force:false SSHTimeout:5 OptTimeout:60 APITimeout:300}”, “error”: “failed to start: failed to start pd: \tpd xxx.xxx.xx.52:2379 failed to start: timed out waiting for port 2379 to be started after 1m0s, please check the log of the instance: timed out waiting for port 2379 to be started after 1m0s”, “errorVerbose”: “timed out waiting for port 2379 to be started after 1m0s\ngithub.com/pingcap/tiup/pkg/cluster/module.(*WaitFor).Execute\n\tgithub.com/pingcap/tiup@/pkg/cluster/module/wait_for.go:90\ngithub.com/pingcap/tiup/pkg/cluster/meta.PortStarted\n\tgithub.com/pingcap/tiup@/pkg/cluster/meta/logic.go:116\ngithub.com/pingcap/tiup/pkg/cluster/meta.(*instance).Ready\n\tgithub.com/pingcap/tiup@/pkg/cluster/meta/logic.go:146\ngithub.com/pingcap/tiup/pkg/cluster/operation.startInstance\n\tgithub.com/pingcap/tiup@/pkg/cluster/operation/action.go:468\ngithub.com/pingcap/tiup/pkg/cluster/operation.StartComponent.func1\n\tgithub.com/pingcap/tiup@/pkg/cluster/operation/action.go:504\ngolang.org/x/sync/errgroup.(*Group).Go.func1\n\tgolang.org/x/sync@v0.0.0-20190911185100-cd5d95a43a6e/errgroup/errgroup.go:57\nruntime.goexit\n\truntime/asm_amd64.s:1357\n\tpd xxx.xxx.xx.52:2379 failed to start: timed out waiting for port 2379 to be started after 1m0s, please check the log of the instance\nfailed to start pd\nfailed to start”} 2020-06-09T17:28:04.728+0800 INFO Execute command finished {“code”: 1, “error”: “failed to start: failed to start pd: \tpd xxx.xxx.xx.52:2379 failed to start: timed out waiting for port 2379 to be started after 1m0s, please check the log of the instance: timed out waiting for port 2379 to be started after 1m0s”, “errorVerbose”: “timed out waiting for port 2379 to be started after 1m0s\ngithub.com/pingcap/tiup/pkg/cluster/module.(*WaitFor).Execute\n\tgithub.com/pingcap/tiup@/pkg/cluster/module/wait_for.go:90\ngithub.com/pingcap/tiup/pkg/cluster/meta.PortStarted\n\tgithub.com/pingcap/tiup@/pkg/cluster/meta/logic.go:116\ngithub.com/pingcap/tiup/pkg/cluster/meta.(*instance).Ready\n\tgithub.com/pingcap/tiup@/pkg/cluster/meta/logic.go:146\ngithub.com/pingcap/tiup/pkg/cluster/operation.startInstance\n\tgithub.com/pingcap/tiup@/pkg/cluster/operation/action.go:468\ngithub.com/pingcap/tiup/pkg/cluster/operation.StartComponent.func1\n\tgithub.com/pingcap/tiup@/pkg/cluster/operation/action.go:504\ngolang.org/x/sync/errgroup.(*Group).Go.func1\n\tgolang.org/x/sync@v0.0.0-20190911185100-cd5d95a43a6e/errgroup/errgroup.go:57\nruntime.goexit\n\truntime/asm_amd64.s:1357\n\tpd xxx.xxx.xx.52:2379 failed to start: timed out waiting for port 2379 to be started after 1m0s, please check the log of the instance\nfailed to start pd\nfailed to start”}

请上传三个日志以附件的形式

都在服务器上,不方便下载下来topology.yaml (2.6 KB) debug.log (9.0 KB)

在同一个服务器上部署三个PD失败后,我使用了三台服务器来部署PD,仍然失败。

上传下 pd.log ,在 deploy dir / log 中,debug 日志中没有明显报错,拓扑文件中,全局变量 deploy_dir 和 data_dir 与节点实际部署位置不同,不影响启动

没有日志

un 9 19:03:21 bjfk-staging-ls418 run_pd.sh: [2020/06/09 19:03:21.863 +08:00] [FATAL] [main.go:56] [“parse cmd flags error”] [error=“log directory shouldn’t be the subdirectory of data directory”] [errorVerbose=“log directory shouldn’t be the subdirectory of data directory\ngithub.com/pingcap/pd/v4/server/config.(*Config).Validate\n\t/home/jenkins/agent/workspace/build_pd_multi_branch_v4.0.0/go/src/github.com/pingcap/pd/server/config/config.go:330\ngithub.com/pingcap/pd/v4/server/config.(*Config).Adjust\n\t/home/jenkins/agent/workspace/build_pd_multi_branch_v4.0.0/go/src/github.com/pingcap/pd/server/config/config.go:396\ngithub.com/pingcap/pd/v4/server/config.(*Config).Parse\n\t/home/jenkins/agent/workspace/build_pd_multi_branch_v4.0.0/go/src/github.com/pingcap/pd/server/config/config.go:308\nmain.main\n\t/home/jenkins/agent/workspace/build_pd_multi_branch_v4.0.0/go/src/github.com/pingcap/pd/cmd/pd-server/main.go:42\nruntime.main\n\t/usr/local/go/src/runtime/proc.go:203\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:1357”] [stack=“github.com/pingcap/log.Fatal\n\t/home/jenkins/agent/workspace/build_pd_multi_branch_v4.0.0/go/pkg/mod/github.com/pingcap/log@v0.0.0-20200117041106-d28c14d3b1cd/global.go:59\nmain.main\n\t/home/jenkins/agent/workspace/build_pd_multi_branch_v4.0.0/go/src/github.com/pingcap/pd/cmd/pd-server/main.go:56\nruntime.main\n\t/usr/local/go/src/runtime/proc.go:203”] Jun 9 19:03:21 bjfk-staging-ls418 systemd: pd-2379.service: main process exited, code=exited, status=1/FAILURE

应该是log和data目录的问题

这里提示的意思是,不要把log目录设置为 data目录的子目录,重新配置下目录位置,多谢。可以参考配置文件

https://github.com/pingcap/docs-cn/blob/release-4.0/config-templates/complex-mini.yaml

嗯,建议把这个检查放在TiUP中,PD的日志都没有这些信息,这个是在系统的message中查找到的,比较隐晦。

感谢反馈