Tidb服务启动时 报错 提示超时

ERROR MESSAGE SUMMARY *********************************************************************************************************************************************** [192.168.1.26]: Ansible FAILED! => playbook: start.yml; TASK: wait for TiDB up; message: {“changed”: false, “elapsed”: 300, “msg”: “Timeout when waiting for search string TiDB in 192.168.1.26:10080”}

[192.168.1.27]: Ansible FAILED! => playbook: start.yml; TASK: wait for TiDB up; message: {“changed”: false, “elapsed”: 300, “msg”: “Timeout when waiting for search string TiDB in 192.168.1.27:10080”}

麻烦提供一下更完整的信息

  1. 集群版本
  2. 执行的操作
  3. tidb pd tikv 日志中有报错信息吗

集群版本:tidb-ansible-2.1.1 执行操作:ansible-playbook start.yml tidb 报错信息: 2020-01-15 14:26:29,318 p=9598 u=tidb | [192.168.1.26]: Ansible FAILED! => playbook: start.yml; TASK: wait for kafka_exporter up; message: {“changed”: false, “elapsed”: 300, “msg”: “Timeout when waiting for 192.168.1.26:9308”}

2020-01-15 14:26:29,318 p=9598 u=tidb | [192.168.1.21]: Ansible FAILED! => playbook: start.yml; TASK: wait for kafka_exporter up; message: {“changed”: false, “elapsed”: 300, “msg”: “Timeout when waiting for 192.168.1.21:9308”}

2020-01-15 14:26:29,318 p=9598 u=tidb | [192.168.1.27]: Ansible FAILED! => playbook: start.yml; TASK: wait for kafka_exporter up; message: {“changed”: false, “elapsed”: 300, “msg”: “Timeout when waiting for 192.168.1.27:9308”}

[tidb@tidb01 tidb-ansible-2.1.1]$ cat inventory.ini

TiDB Cluster Part

[tidb_servers] 192.168.1.26 192.168.1.27 192.168.1.21 [tikv_servers] 192.168.1.22 192.168.1.23 192.168.1.24

[pd_servers] 192.168.1.21 192.168.1.26 192.168.1.27

[spark_master]

[spark_slaves]

[lightning_server]

[importer_server]

Monitoring Part

prometheus and pushgateway servers

[monitoring_servers] 192.168.1.21

[grafana_servers] 192.168.1.21

node_exporter and blackbox_exporter servers

[monitored_servers] 192.168.1.21 192.168.1.22 192.168.1.23 192.168.1.26 192.168.1.27 192.168.1.24

[alertmanager_servers] 192.168.1.21

[kafka_exporter_servers] 192.168.1.26 192.168.1.27 192.168.1.21

Binlog Part

[pump_servers] pump1 ansible_host=192.168.1.26 deploy_dir=/u01/pump1 pump2 ansible_host=192.168.1.27 deploy_dir=/u01/pump2 pump3 ansible_host=192.168.1.21 deploy_dir=/u01/pump3 [drainer_servers]

Group variables

[pd_servers:vars]

location_labels = [“zone”,“rack”,“host”]

Global variables

[all:vars] deploy_dir = /u01

Connection

ssh via normal user

ansible_user = tidb

cluster_name = develop-cluster

tidb_version = latest

process supervision, [systemd, supervise]

process_supervision = systemd

timezone = Asia/Shanghai

enable_firewalld = False

check NTP service

enable_ntpd = False set_hostname = False

binlog trigger

enable_binlog = True

kafka cluster address for monitoring, example:

kafka_addrs = “192.168.0.11:9092,192.168.0.12:9092,192.168.0.13:9092”

kafka_addrs = “192.168.1.21:9092,192.168.1.26:9092,192.168.1.27:9092”

zookeeper address of kafka cluster for monitoring, example:

zookeeper_addrs = “192.168.0.11:2181,192.168.0.12:2181,192.168.0.13:2181”

zookeeper_addrs = “192.168.1.21:2181,192.168.1.26:2181,192.168.1.27:2181”

store slow query log into seperate file

enable_slow_query_log = True

enable TLS authentication in the TiDB cluster

enable_tls = False

KV mode

deploy_without_tidb = False

Optional: Set if you already have a alertmanager server.

Format: alertmanager_host:alertmanager_port

alertmanager_target = “”

grafana_admin_user = “admin” grafana_admin_password = “admin”

Collect diagnosis

collect_log_recent_hours = 2

enable_bandwidth_limit = True

default: 10Mb/s, unit: Kbit/s

collect_bandwidth_limit = 10000

执行:ansible-playbook bootstrap.yml,ansible-playbook deploy.yml 均无报错

在不开启二进制日志,配置PUMP的情况下,部署TIDB服务及启动TIDB服务是能成功的。

检查下 tidb 的日志是否存在异常?

另外,是先部署集群之后,然后开启 binlog ,执行 start 的时候报错吗 ? 是否是参考 https://pingcap.com/docs-cn/v2.1/reference/tidb-binlog/deploy/ 操作? 最后确认下版本,可以使用 tidb-server -V 查看

[tidb@tidb03 bin]$ ./tidb-server -V Release Version: v4.0.0-alpha-1356-g5c874403f Git Commit Hash: 5c874403fdf96dd4406e152b1fd39231a53f79b3 Git Branch: master UTC Build Time: 2020-01-11 06:30:08 GoVersion: go1.13 Race Enabled: false TiKV Min Version: v3.0.0-60965b006877ca7234adaced7890d7b029ed1306 Check Table Before Drop: false 我又按照官方文档部署了一下,部署完成后。配置binlog时,配置完pump ,执行ansible-playbook rolling_update.yml --tags=tidb 报错如下: [192.168.1.26]: Ansible FAILED! => playbook: rolling_update.yml; TASK: wait for TiDB up; message: {“changed”: false, “elapsed”: 300, “msg”: “Timeout when waiting for search string TiDB in 192.168.1.26:10080”} [192.168.1.27]: Ansible FAILED! => playbook: rolling_update.yml; TASK: wait for TiDB up; message: {“changed”: false, “elapsed”: 300, “msg”: “Timeout when waiting for search string TiDB in 192.168.1.27:10080”}

麻烦提供一下 tidb.log ,看下 tidb 报错,目前给的都是 ansible 的日志信息

2020/01/15 17:04:42.960 +08:00] [INFO] [manager.go:316] [“get owner”] [“owner info”="[bindinfo] /tidb/bindinfo/owner ownerManager 4adc3ad9-77e3-4b9f-b632-a2cbda24a107"] [ownerID=4adc3ad9-77e3-4b9f-b632-a2cbda24a107] [2020/01/15 17:04:42.960 +08:00] [INFO] [manager.go:316] [“get owner”] [“owner info”="[stats] /tidb/stats/owner ownerManager 4adc3ad9-77e3-4b9f-b632-a2cbda24a107"] [ownerID=4adc3ad9-77e3-4b9f-b632-a2cbda24a107] [2020/01/15 17:08:02.269 +08:00] [INFO] [gc_worker.go:263] ["[gc worker] starts the whole job"] [uuid=5bea1895b54001a] [safePoint=413945984431423488] [concurrency=3] [2020/01/15 17:08:02.270 +08:00] [INFO] [gc_worker.go:865] ["[gc worker] start resolve locks"] [uuid=5bea1895b54001a] [safePoint=413945984431423488] [concurrency=3] [2020/01/15 17:08:02.270 +08:00] [INFO] [range_task.go:113] [“range task started”] [name=resolve-locks-runner] [startKey=] [endKey=] [concurrency=3] [2020/01/15 17:08:02.283 +08:00] [INFO] [range_task.go:219] [“range task finished”] [name=resolve-locks-runner] [startKey=] [endKey=] [“cost time”=13.430645ms] [“completed regions”=22] [2020/01/15 17:08:02.283 +08:00] [INFO] [gc_worker.go:886] ["[gc worker] finish resolve locks"] [uuid=5bea1895b54001a] [safePoint=413945984431423488] [regions=22] [2020/01/15 17:08:13.267 +08:00] [INFO] [signal_posix.go:53] [“got signal to exit”] [signal=terminated] [2020/01/15 17:08:13.267 +08:00] [INFO] [server.go:583] ["[server] graceful shutdown."] [2020/01/15 17:08:13.267 +08:00] [ERROR] [http_status.go:303] [“start status/rpc server error”] [error=“accept tcp [::]:10080: use of closed network connection”] [2020/01/15 17:08:13.267 +08:00] [ERROR] [http_status.go:287] [“grpc server error”] [error=“mux: listener closed”] [2020/01/15 17:08:13.267 +08:00] [WARN] [manager.go:281] [“is not the owner”] [“owner info”="[ddl] /tidb/ddl/fg/owner ownerManager 4adc3ad9-77e3-4b9f-b632-a2cbda24a107"] [2020/01/15 17:08:13.267 +08:00] [INFO] [manager.go:248] [“break campaign loop, context is done”] [“owner info”="[ddl] /tidb/ddl/fg/owner ownerManager 4adc3ad9-77e3-4b9f-b632-a2cbda24a107"] [2020/01/15 17:08:13.267 +08:00] [ERROR] [http_status.go:298] [“http server error”] [error=“http: Server closed”] [2020/01/15 17:08:13.269 +08:00] [INFO] [ddl_worker.go:114] ["[ddl] DDL worker closed"] [worker=“worker 1, tp general”] [“take time”=6.955µs] [2020/01/15 17:08:13.269 +08:00] [INFO] [ddl_worker.go:114] ["[ddl] DDL worker closed"] [worker=“worker 2, tp add index”] [“take time”=3.533µs] [2020/01/15 17:08:13.269 +08:00] [INFO] [delete_range.go:123] ["[ddl] closing delRange"] [2020/01/15 17:08:13.269 +08:00] [INFO] [session_pool.go:85] ["[ddl] closing sessionPool"] [2020/01/15 17:08:13.269 +08:00] [INFO] [ddl.go:514] ["[ddl] DDL closed"] [ID=4adc3ad9-77e3-4b9f-b632-a2cbda24a107] [“take time”=2.02357ms] [2020/01/15 17:08:13.269 +08:00] [INFO] [ddl.go:423] ["[ddl] stop DDL"] [ID=4adc3ad9-77e3-4b9f-b632-a2cbda24a107] [2020/01/15 17:08:13.269 +08:00] [INFO] [manager.go:292] [“revoke session”] [“owner info”="[ddl] /tidb/ddl/fg/owner ownerManager 4adc3ad9-77e3-4b9f-b632-a2cbda24a107"] [] [2020/01/15 17:08:13.271 +08:00] [INFO] [manager.go:335] [“watcher is closed, no owner”] [“owner info”="[stats] ownerManager 4adc3ad9-77e3-4b9f-b632-a2cbda24a107 watch owner key /tidb/stats/owner/5eb96fa861fdf841"] [2020/01/15 17:08:13.271 +08:00] [WARN] [manager.go:281] [“is not the owner”] [“owner info”="[bindinfo] /tidb/bindinfo/owner ownerManager 4adc3ad9-77e3-4b9f-b632-a2cbda24a107"] [2020/01/15 17:08:13.271 +08:00] [INFO] [manager.go:239] [“etcd session is done, creates a new one”] [“owner info”="[bindinfo] /tidb/bindinfo/owner ownerManager 4adc3ad9-77e3-4b9f-b632-a2cbda24a107"] [2020/01/15 17:08:13.271 +08:00] [INFO] [manager.go:243] [“break campaign loop, NewSession failed”] [“owner info”="[bindinfo] /tidb/bindinfo/owner ownerManager 4adc3ad9-77e3-4b9f-b632-a2cbda24a107"] [error=“context canceled”] [2020/01/15 17:08:13.271 +08:00] [WARN] [manager.go:281] [“is not the owner”] [“owner info”="[stats] /tidb/stats/owner ownerManager 4adc3ad9-77e3-4b9f-b632-a2cbda24a107"] [2020/01/15 17:08:13.272 +08:00] [INFO] [manager.go:239] [“etcd session is done, creates a new one”] [“owner info”="[stats] /tidb/stats/owner ownerManager 4adc3ad9-77e3-4b9f-b632-a2cbda24a107"] [2020/01/15 17:08:13.285 +08:00] [INFO] [domain.go:582] [“domain closed”] [“take time”=18.156247ms] [2020/01/15 17:08:13.285 +08:00] [WARN] [manager.go:170] [“failed to new session to etcd”] [ownerInfo="[stats] /tidb/stats/owner ownerManager 4adc3ad9-77e3-4b9f-b632-a2cbda24a107"] [error=“context canceled”] [2020/01/15 17:08:13.285 +08:00] [INFO] [gc_worker.go:192] ["[gc worker] quit"] [uuid=5bea1895b54001a]

看 tidb 的日志是中 [2020/01/15 17:08:13.267 +08:00] [ERROR] [http_status.go:303] [“start status/rpc server error”] [error=“accept tcp [::]:10080: use of closed network connection”] [2020/01/15 17:08:13.267 +08:00] [ERROR] [http_status.go:287] [“grpc server error”] [error=“mux: listener closed”] [2020/01/15 17:08:13.267 +08:00]

检查一下端口是否被占用以及网络情况

我在3.0.5上也遇到类似的问题, 当网络中断的情况下tidb 报 listnener stopped, 但一个半小时以后收到一个signal terminated. 网络中断的原因我知道了, 现在想了解这个signal terminated 是tidb 集群自己发出的吗?

{"log":"[2020/12/28 11:52:21.993 +08:00] [ERROR] [server.go:364] [\"listener stopped, waiting for manual kill.\"] [stack=\"github.com/pingcap/tidb/server.(*Server).Run\\
\\tgithub.com/pingcap/tidb@/server/server.go:364\\
main.runServer\\
\\tgithub.com/pingcap/tidb@/tidb-server/main.go:575\\
main.main\\
\\tgithub.com/pingcap/tidb@/tidb-server/main.go:174\\
runtime.main\\
\\truntime/proc.go:203\"]\
","stream":"stdout","time":"2020-12-28T03:52:21.994197339Z"}
{"log":"[2020/12/28 11:52:41.033 +08:00] [INFO] [signal_posix.go:54] [\"got signal to exit\"] [signal=terminated]\
","stream":"stdout","time":"2020-12-28T03:52:41.033619413Z"}

这个 signal 是 linux 操作系统的信号
这个应该不是 TiDB 集群自己发出的,而是收到系统的 terminated 信号结束。