启动集群报playbook: start.yml; TASK: wait until the TiKV port is up

为提高效率,提问时请提供以下信息,问题描述清晰可优先响应。

  • 【TiDB 版本】:2.1.8
  • 【问题描述】: 服务器断电之后重启,我清理了一下deploy/log下的所有日志。然后用ansible启动集群,中间有一台服务器包错 。[192.168.3.130]: Ansible FAILED! => playbook: start.yml; TASK: wait until the TiKV port is up; message: {“changed”: false, “elapsed”: 300, “msg”: “the TiKV port 20160 is not up”}

查看了3.130服务器20160端口,没有被占用。

您好:

       1. 尝试在启动时加上-vvv返回日志
       2. 上传中控机<deploy>/log下的ansible.log文件

错误日志: 2020-03-12 18:24:04,253 p=30667 u=root | META: ran handlers 2020-03-12 18:24:04,253 p=30667 u=root | META: ran handlers 2020-03-12 18:24:04,257 p=30667 u=root | PLAY [pump_servers] ************************************************************************************************************************************************************************************************** 2020-03-12 18:24:04,257 p=30667 u=root | skipping: no hosts matched 2020-03-12 18:24:04,260 p=30667 u=root | PLAY [tidb_servers] ************************************************************************************************************************************************************************************************** 2020-03-12 18:24:04,266 p=30667 u=root | PLAY [grafana_servers] *********************************************************************************************************************************************************************************************** 2020-03-12 18:24:04,269 p=30667 u=root | [WARNING]: Could not create retry file ‘/home/tidb/tidb-ansible/retry_files/start.retry’. [Errno 13] Permission denied: u’/home/tidb/tidb-ansible/retry_files/start.retry’

2020-03-12 18:24:04,269 p=30667 u=root | PLAY RECAP *********************************************************************************************************************************************************************************************************** 2020-03-12 18:24:04,270 p=30667 u=root | 192.168.3.128 : ok=17 changed=0 unreachable=0 failed=0
2020-03-12 18:24:04,270 p=30667 u=root | 192.168.3.129 : ok=17 changed=0 unreachable=0 failed=0
2020-03-12 18:24:04,270 p=30667 u=root | 192.168.3.130 : ok=21 changed=0 unreachable=0 failed=1
2020-03-12 18:24:04,270 p=30667 u=root | localhost : ok=1 changed=0 unreachable=0 failed=0
2020-03-12 18:24:04,270 p=30667 u=root | ERROR MESSAGE SUMMARY ************************************************************************************************************************************************************************************************ 2020-03-12 18:24:04,271 p=30667 u=root | [192.168.3.130]: Ansible FAILED! => playbook: start.yml; TASK: wait until the TiKV port is up; message: {

2020-03-12 18:24:04,271 p=30667 u=root | “changed”: false,

2020-03-12 18:24:04,271 p=30667 u=root | “elapsed”: 300,

2020-03-12 18:24:04,271 p=30667 u=root | “invocation”: {

2020-03-12 18:24:04,271 p=30667 u=root | “module_args”: {

2020-03-12 18:24:04,271 p=30667 u=root | “active_connection_states”: [

2020-03-12 18:24:04,271 p=30667 u=root | “ESTABLISHED”,

2020-03-12 18:24:04,271 p=30667 u=root | “FIN_WAIT1”,

2020-03-12 18:24:04,272 p=30667 u=root | “FIN_WAIT2”,

2020-03-12 18:24:04,272 p=30667 u=root | “SYN_RECV”,

2020-03-12 18:24:04,272 p=30667 u=root | “SYN_SENT”,

2020-03-12 18:24:04,272 p=30667 u=root | “TIME_WAIT”

2020-03-12 18:24:04,272 p=30667 u=root | ],

2020-03-12 18:24:04,272 p=30667 u=root | “connect_timeout”: 5,

2020-03-12 18:24:04,272 p=30667 u=root | “delay”: 0,

2020-03-12 18:24:04,272 p=30667 u=root | “exclude_hosts”: null,

2020-03-12 18:24:04,272 p=30667 u=root | “host”: “192.168.3.130”,

2020-03-12 18:24:04,272 p=30667 u=root | “msg”: “the TiKV port 20160 is not up”,

2020-03-12 18:24:04,273 p=30667 u=root | “path”: null,

2020-03-12 18:24:04,273 p=30667 u=root | “port”: 20160,

2020-03-12 18:24:04,273 p=30667 u=root | “search_regex”: null,

2020-03-12 18:24:04,273 p=30667 u=root | “sleep”: 1,

2020-03-12 18:24:04,273 p=30667 u=root | “state”: “started”,

2020-03-12 18:24:04,273 p=30667 u=root | “timeout”: 300

2020-03-12 18:24:04,273 p=30667 u=root | }

2020-03-12 18:24:04,273 p=30667 u=root | },

2020-03-12 18:24:04,273 p=30667 u=root | “msg”: “the TiKV port 20160 is not up”

2020-03-12 18:24:04,273 p=30667 u=root | }

  1. 日志里看到2020-03-12 18:24:04,269 p=30667 u=root | [WARNING]: Could not create retry file ‘/home/tidb/tidb-ansible/retry_files/start.retry’. [Errno 13] Permission denied: u’/home/tidb/tidb-ansible/retry_files/start.retry’
  2. 请问,你这边是用tidb用户启停吗? 还是root用户,请尝试使用tidb用户

我删除了/home/tidb/tidb-ansible/retry_files/start.retry,然后用tidb用户启动。结果还是启动报这个错误,只是不是权限错误。而是 to retry, use: --limit @/home/tidb/tidb-ansible/retry_files/start.retry

2020-03-12 19:38:24,316 p=867 u=tidb | META: ran handlers 2020-03-12 19:38:24,316 p=867 u=tidb | META: ran handlers 2020-03-12 19:38:24,320 p=867 u=tidb | PLAY [pump_servers] ************************************************************************************************************************************************************************************************** 2020-03-12 19:38:24,320 p=867 u=tidb | skipping: no hosts matched 2020-03-12 19:38:24,324 p=867 u=tidb | PLAY [tidb_servers] ************************************************************************************************************************************************************************************************** 2020-03-12 19:38:24,329 p=867 u=tidb | PLAY [grafana_servers] *********************************************************************************************************************************************************************************************** 2020-03-12 19:38:24,333 p=867 u=tidb | to retry, use: --limit @/home/tidb/tidb-ansible/retry_files/start.retry

2020-03-12 19:38:24,333 p=867 u=tidb | PLAY RECAP *********************************************************************************************************************************************************************************************************** 2020-03-12 19:38:24,333 p=867 u=tidb | 192.168.3.128 : ok=17 changed=0 unreachable=0 failed=0
2020-03-12 19:38:24,333 p=867 u=tidb | 192.168.3.129 : ok=17 changed=0 unreachable=0 failed=0
2020-03-12 19:38:24,334 p=867 u=tidb | 192.168.3.130 : ok=21 changed=0 unreachable=0 failed=1
2020-03-12 19:38:24,334 p=867 u=tidb | localhost : ok=1 changed=0 unreachable=0 failed=0
2020-03-12 19:38:24,334 p=867 u=tidb | ERROR MESSAGE SUMMARY ************************************************************************************************************************************************************************************************ 2020-03-12 19:38:24,334 p=867 u=tidb | [192.168.3.130]: Ansible FAILED! => playbook: start.yml; TASK: wait until the TiKV port is up; message: {

2020-03-12 19:38:24,334 p=867 u=tidb | “changed”: false,

2020-03-12 19:38:24,334 p=867 u=tidb | “elapsed”: 300,

2020-03-12 19:38:24,334 p=867 u=tidb | “invocation”: {

2020-03-12 19:38:24,334 p=867 u=tidb | “module_args”: {

2020-03-12 19:38:24,334 p=867 u=tidb | “active_connection_states”: [

2020-03-12 19:38:24,335 p=867 u=tidb | “ESTABLISHED”,

2020-03-12 19:38:24,335 p=867 u=tidb | “FIN_WAIT1”,

2020-03-12 19:38:24,335 p=867 u=tidb | “FIN_WAIT2”,

2020-03-12 19:38:24,335 p=867 u=tidb | “SYN_RECV”,

2020-03-12 19:38:24,335 p=867 u=tidb | “SYN_SENT”,

2020-03-12 19:38:24,335 p=867 u=tidb | “TIME_WAIT”

2020-03-12 19:38:24,335 p=867 u=tidb | ],

2020-03-12 19:38:24,335 p=867 u=tidb | “connect_timeout”: 5,

2020-03-12 19:38:24,335 p=867 u=tidb | “delay”: 0,

2020-03-12 19:38:24,335 p=867 u=tidb | “exclude_hosts”: null,

2020-03-12 19:38:24,335 p=867 u=tidb | “host”: “192.168.3.130”,

2020-03-12 19:38:24,335 p=867 u=tidb | “msg”: “the TiKV port 20160 is not up”,

2020-03-12 19:38:24,335 p=867 u=tidb | “path”: null,

2020-03-12 19:38:24,336 p=867 u=tidb | “port”: 20160,

2020-03-12 19:38:24,336 p=867 u=tidb | “search_regex”: null,

2020-03-12 19:38:24,336 p=867 u=tidb | “sleep”: 1,

2020-03-12 19:38:24,336 p=867 u=tidb | “state”: “started”,

2020-03-12 19:38:24,336 p=867 u=tidb | “timeout”: 300

2020-03-12 19:38:24,336 p=867 u=tidb | }

2020-03-12 19:38:24,336 p=867 u=tidb | },

2020-03-12 19:38:24,336 p=867 u=tidb | “msg”: “the TiKV port 20160 is not up”

2020-03-12 19:38:24,336 p=867 u=tidb | }

  1. 请上传中控机/log下的ansible.log文件
  2. 先用tidb用户启动,看看报什么错,这个文件之前是权限问题,为什么要删除这个文件
  3. 当前提示192.168.3.130无法启动,请到192.168.3.130服务器,部署tikv目录/scripts/ 使用tidb用户,执行./start_tikv.sh , 再上传/log目录下tikv.log和tikv_stderr.log文件.

那个文件是启动就会生成,所以删除和修改权限都是一样的错误。我根据您的方法启动来 ./start_tikv.sh。只看到tikv_stderr.log,内容如下: thread ‘main’ panicked at ‘invalid auto generated configuration file “conf/tikv.toml”, err No such file or directory (os error 2)’, src/config.rs:1122:17

stack backtrace:

0: std::sys::unix::backtrace::tracing::imp::unwind_backtrace

         at libstd/sys/unix/backtrace/tracing/gcc_s.rs:49

1: std::panicking::default_hook::{{closure}}

         at libstd/sys_common/backtrace.rs:71

2: std::panicking::default_hook

         at libstd/panicking.rs:227

3: std::panicking::rust_panic_with_hook

         at libstd/panicking.rs:475

4: std::panicking::continue_panic_fmt

         at libstd/panicking.rs:390

5: std::panicking::begin_panic_fmt

         at libstd/panicking.rs:345

6: tikv_server::main at /home/jenkins/workspace/release_tidb_2.1-ga/tikv/:8

         at /checkout/src/libcore/option.rs:458
         at src/bin/tikv-server.rs:420

7: std::rt::lang_start::{{closure}} at /checkout/src/libstd/rt.rs:74 8: main 9: __libc_start_main 10:

thread ‘main’ panicked at ‘invalid auto generated configuration file “conf/tikv.toml”, err No such file or directory (os error 2)’, src/config.rs:1122:17

stack backtrace:

0: std::sys::unix::backtrace::tracing::imp::unwind_backtrace at libstd/sys/unix/backtrace/tracing/gcc_s.rs:49

1: std::panicking::default_hook::{{closure}} at libstd/sys_common/backtrace.rs:71

2: std::panicking::default_hook at libstd/panicking.rs:227 3: std::panicking::rust_panic_with_hook at libstd/panicking.rs:475 4: std::panicking::continue_panic_fmt at libstd/panicking.rs:390 5: std::panicking::begin_panic_fmt at libstd/panicking.rs:345 6: tikv_server::main at /home/jenkins/workspace/release_tidb_2.1-ga/tikv/:8 at /checkout/src/libcore/option.rs:458 at src/bin/tikv-server.rs:420 7: std::rt::lang_start::{{closure}}

         at /checkout/src/libstd/rt.rs:74

8: main 9: __libc_start_main

10:

thread ‘main’ panicked at ‘invalid auto generated configuration file “conf/tikv.toml”, err No such file or directory (os error 2)’, src/config.rs:1122:17

stack backtrace:

0: std::sys::unix::backtrace::tracing::imp::unwind_backtrace

         at libstd/sys/unix/backtrace/tracing/gcc_s.rs:49

1: std::panicking::default_hook::{{closure}} at libstd/sys_common/backtrace.rs:71 2: std::panicking::default_hook at libstd/panicking.rs:227 3: std::panicking::rust_panic_with_hook at libstd/panicking.rs:475 4: std::panicking::continue_panic_fmt at libstd/panicking.rs:390 5: std::panicking::begin_panic_fmt at libstd/panicking.rs:345 6: tikv_server::main at /home/jenkins/workspace/release_tidb_2.1-ga/tikv/:8 at /checkout/src/libcore/option.rs:458 at src/bin/tikv-server.rs:420 7: std::rt::lang_start::{{closure}} at /checkout/src/libstd/rt.rs:74 8: main 9: __libc_start_main 10: thread ‘main’ panicked at ‘invalid auto generated configuration file “conf/tikv.toml”, err No such file or directory (os error 2)’, src/config.rs:1122:17 stack backtrace: 0: std::sys::unix::backtrace::tracing::imp::unwind_backtrace at libstd/sys/unix/backtrace/tracing/gcc_s.rs:49 1: std::panicking::default_hook::{{closure}} at libstd/sys_common/backtrace.rs:71 2: std::panicking::default_hook at libstd/panicking.rs:227 3: std::panicking::rust_panic_with_hook at libstd/panicking.rs:475 4: std::panicking::continue_panic_fmt at libstd/panicking.rs:390 5: std::panicking::begin_panic_fmt at libstd/panicking.rs:345 6: tikv_server::main at /home/jenkins/workspace/release_tidb_2.1-ga/tikv/:8 at /checkout/src/libcore/option.rs:458 at src/bin/tikv-server.rs:420 7: std::rt::lang_start::{{closure}} at /checkout/src/libstd/rt.rs:74 8: main 9: __libc_start_main 10: thread ‘main’ panicked at ‘invalid auto generated configuration file “conf/tikv.toml”, err No such file or directory (os error 2)’, src/config.rs:1122:17 stack backtrace: 0: std::sys::unix::backtrace::tracing::imp::unwind_backtrace at libstd/sys/unix/backtrace/tracing/gcc_s.rs:49 1: std::panicking::default_hook::{{closure}} at libstd/sys_common/backtrace.rs:71 2: std::panicking::default_hook at libstd/panicking.rs:227 3: std::panicking::rust_panic_with_hook at libstd/panicking.rs:475 4: std::panicking::continue_panic_fmt at libstd/panicking.rs:390 5: std::panicking::begin_panic_fmt at libstd/panicking.rs:345 6: tikv_server::main at /home/jenkins/workspace/release_tidb_2.1-ga/tikv/:8 at /checkout/src/libcore/option.rs:458 at src/bin/tikv-server.rs:420 7: std::rt::lang_start::{{closure}} at /checkout/src/libstd/rt.rs:74 8: main 9: __libc_start_main 10: thread ‘main’ panicked at ‘invalid auto generated configuration file “conf/tikv.toml”, err No such file or directory (os error 2)’, src/config.rs:1122:17 stack backtrace: 0: std::sys::unix::backtrace::tracing::imp::unwind_backtrace at libstd/sys/unix/backtrace/tracing/gcc_s.rs:49 1: std::panicking::default_hook::{{closure}} at libstd/sys_common/backtrace.rs:71 2: std::panicking::default_hook at libstd/panicking.rs:227 3: std::panicking::rust_panic_with_hook at libstd/panicking.rs:475 4: std::panicking::continue_panic_fmt at libstd/panicking.rs:390 5: std::panicking::begin_panic_fmt at libstd/panicking.rs:345 6: tikv_server::main at /home/jenkins/workspace/release_tidb_2.1-ga/tikv/:8 at /checkout/src/libcore/option.rs:458 at src/bin/tikv-server.rs:420 7: std::rt::lang_start::{{closure}} at /checkout/src/libstd/rt.rs:74 8: main 9: __libc_start_main 10: thread ‘main’ panicked at ‘invalid auto generated configuration file “conf/tikv.toml”, err No such file or directory (os error 2)’, src/config.rs:1122:17 stack backtrace: 0: std::sys::unix::backtrace::tracing::imp::unwind_backtrace at libstd/sys/unix/backtrace/tracing/gcc_s.rs:49 1: std::panicking::default_hook::{{closure}} at libstd/sys_common/backtrace.rs:71 2: std::panicking::default_hook at libstd/panicking.rs:227 3: std::panicking::rust_panic_with_hook at libstd/panicking.rs:475 4: std::panicking::continue_panic_fmt at libstd/panicking.rs:390 5: std::panicking::begin_panic_fmt at libstd/panicking.rs:345 6: tikv_server::main at /home/jenkins/workspace/release_tidb_2.1-ga/tikv/:8 at /checkout/src/libcore/option.rs:458 at src/bin/tikv-server.rs:420 7: std::rt::lang_start::{{closure}} at /checkout/src/libstd/rt.rs:74 8: main 9: __libc_start_main 10: thread ‘main’ panicked at ‘invalid auto generated configuration file “conf/tikv.toml”, err No such file or directory (os error 2)’, src/config.rs:1122:17 stack backtrace: 0: std::sys::unix::backtrace::tracing::imp::unwind_backtrace at libstd/sys/unix/backtrace/tracing/gcc_s.rs:49 1: std::panicking::default_hook::{{closure}} at libstd/sys_common/backtrace.rs:71 2: std::panicking::default_hook at libstd/panicking.rs:227 3: std::panicking::rust_panic_with_hook at libstd/panicking.rs:475 4: std::panicking::continue_panic_fmt at libstd/panicking.rs:390 5: std::panicking::begin_panic_fmt at libstd/panicking.rs:345 6: tikv_server::main at /home/jenkins/workspace/release_tidb_2.1-ga/tikv/:8 at /checkout/src/libcore/option.rs:458 at src/bin/tikv-server.rs:420 7: std::rt::lang_start::{{closure}} at /checkout/src/libstd/rt.rs:74 8: main 9: __libc_start_main 10:

  1. 根据报错,thread ‘main’ panicked at ‘invalid auto generated configuration file “conf/tikv.toml”, err No such file or directory (os error 2)’
  2. 请查看tikv服务器上的/conf/目录下是否存在tikv.toml文件, 并且是否文件有权限可读可写,你可以测试一下.
  3. 你的集群是几个节点?其他tikv可以正常启动吗? 是在docker上部署,还是有多个服务器? 请上传inventory.ini文件,多谢. —>方便上传文件吗?这样贴出来的文件容易乱码

这个目录没有tikv.toml文件,在其他到节点可以正常启动tivk,其他的节点有这个文件,能否直接copy其他的节点tikv.toml到这里。

  1. 可以尝试从其他节点copy,你可以尝试能否启动。这里的其他信息都被清理了吗?

我尝试copy其他节点的tikv.toml文件到这里,然后他又报其他错误: thread ‘main’ panicked at 'invalid auto generated configuration file “/data/deploy/data/last_tikv.toml” 我确认来这个last_tikv.toml有,而且具有tidb的用户权限tikv_stderr.log (50.8 KB)

  1. 查看报错文件信息: thread ‘main’ panicked at ‘invalid auto generated configuration file “/data/deploy/data/last_tikv.toml”, err expected an equals, found eof at line 289’ 2 建议对比一下和其他节点是否相同,从其他节点copy一个过来,感觉断电时,这个目录是不是受损了.

非常感谢。我copy了其他节点的这个文件然后修改成本机端口,启动成功了。仔细对比确实两个机器的文件不对。

好的,多观察一下,这个时候应该在补数据了。

好的,谢谢

:+1: