Tidb 4.0.6 三节点集群 40G 一张表 数据导不到10%左右突然中断,重启tidb集群后,Tidb 4000起不来

在对Tidb 使用tidb-lightning 数据导入40G一张表数据,大约导入5%, tikv 集群显示无法连接,对整个集群关闭,再起后,tidb起不了
retry error: operation timed out after 2m0s
tidb 192.168.0.62:4000 failed to start: timed out waiting for port 4000 to be started after 2m0s, please check the log of the instance

Error: failed to start tidb: tidb 192.168.0.163:4000 failed to start: timed out waiting for port 4000 to be started after 2m0s, please check the log of the instance: timed out waiting for port 4000 to be started after 2m0s
查看日志
/data/tidb_cluster/tidb/deploy/tidb-4000/log
[root@62 log]# tail -f tidb.log
ed"]
[2020/10/22 17:12:36.160 +08:00] [INFO] [client_batch.go:314] [“batchRecvLoop re-create streaming fail”] [target=192.168.0.163:20160] [error=“context deadline exceeded”]
[2020/10/22 17:12:36.453 +08:00] [INFO] [client_batch.go:314] [“batchRecvLoop re-create streaming fail”] [target=192.168.0.161:20160] [error=“context deadline exceeded”]
[2020/10/22 17:12:36.453 +08:00] [INFO] [client_batch.go:314] [“batchRecvLoop re-create streaming fail”] [target=192.168.0.161:20160] [error=“context deadline exceeded”]

请检查下 tikv 节点的状态,并看下日志中有无报错信息。

[tidb@161 ~]$ tiup cluster start luban
Starting component cluster: /home/tidb/.tiup/components/cluster/v1.2.0/tiup-cluster start luban
Starting cluster luban…

  • [ Serial ] - SSHKeySet: privateKey=/home/tidb/.tiup/storage/cluster/clusters/luban/ssh/id_rsa, publicKey=/home/tidb/.tiup/storage/cluster/clusters/luban/ssh/id_rsa.pub
  • [Parallel] - UserSSH: user=tidb, host=192.168.0.161
  • [Parallel] - UserSSH: user=tidb, host=192.168.0.163
  • [Parallel] - UserSSH: user=tidb, host=192.168.0.163
  • [Parallel] - UserSSH: user=tidb, host=192.168.0.161
  • [Parallel] - UserSSH: user=tidb, host=192.168.0.62
  • [Parallel] - UserSSH: user=tidb, host=192.168.0.161
  • [Parallel] - UserSSH: user=tidb, host=192.168.0.161
  • [Parallel] - UserSSH: user=tidb, host=192.168.0.62
  • [Parallel] - UserSSH: user=tidb, host=192.168.0.62
  • [Parallel] - UserSSH: user=tidb, host=192.168.0.163
  • [Parallel] - UserSSH: user=tidb, host=192.168.0.161
  • [Parallel] - UserSSH: user=tidb, host=192.168.0.161
  • [Parallel] - UserSSH: user=tidb, host=192.168.0.64
  • [ Serial ] - StartCluster
    Starting component pd
    Starting instance pd 192.168.0.163:2379
    Starting instance pd 192.168.0.161:2379
    Starting instance pd 192.168.0.62:2379
    Start pd 192.168.0.161:2379 success
    Start pd 192.168.0.163:2379 success
    Start pd 192.168.0.62:2379 success
    Starting component node_exporter
    Starting instance 192.168.0.161
    Start 192.168.0.161 success
    Starting component blackbox_exporter
    Starting instance 192.168.0.161
    Start 192.168.0.161 success
    Starting component node_exporter
    Starting instance 192.168.0.62
    Start 192.168.0.62 success
    Starting component blackbox_exporter
    Starting instance 192.168.0.62
    Start 192.168.0.62 success
    Starting component node_exporter
    Starting instance 192.168.0.163
    Start 192.168.0.163 success
    Starting component blackbox_exporter
    Starting instance 192.168.0.163
    Start 192.168.0.163 success
    Starting component tikv
    Starting instance tikv 192.168.0.163:20160
    Starting instance tikv 192.168.0.161:20160
    Starting instance tikv 192.168.0.62:20610
    Start tikv 192.168.0.161:20160 success
    Start tikv 192.168.0.163:20160 success
    Start tikv 192.168.0.62:20610 success
    Starting component tidb
    Starting instance tidb 192.168.0.163:4000
    Starting instance tidb 192.168.0.161:4000
    Starting instance tidb 192.168.0.62:4000
    retry error: operation timed out after 2m0s
    tidb 192.168.0.62:4000 failed to start: timed out waiting for port 4000 to be started after 2m0s, please check the log of the instance
    retry error: operation timed out after 2m0s
    tidb 192.168.0.163:4000 failed to start: timed out waiting for port 4000 to be started after 2m0s, please check the log of the instance
    retry error: operation timed out after 2m0s
    tidb 192.168.0.161:4000 failed to start: timed out waiting for port 4000 to be started after 2m0s, please check the log of the instance

Error: failed to start tidb: tidb 192.168.0.62:4000 failed to start: timed out waiting for port 4000 to be started after 2m0s, please check the log of the instance: timed out waiting for port 4000 to be started after 2m0s

Verbose debug logs has been written to /home/tidb/logs/tiup-cluster-debug-2020-10-23-09-27-19.log.
Error: run /home/tidb/.tiup/components/cluster/v1.2.0/tiup-cluster (wd:/home/tidb/.tiup/data/SECRN16) failed: exit status 1
[tidb@161 ~]$

[tidb@161 ~]$ tiup cluster display luban
Starting component cluster: /home/tidb/.tiup/components/cluster/v1.2.0/tiup-cluster display luban
Cluster type: tidb
Cluster name: luban
Cluster version: v4.0.6
SSH type: builtin
ID Role Host Ports OS/Arch Status Data Dir Deploy Dir


192.168.0.161:9093 alertmanager 192.168.0.161 9093/9094 linux/x86_64 inactive /data/tidb_cluster/tidb/data/alertmanager-9093 /data/tidb_cluster/tidb/deploy/alertmanager-9093
192.168.0.161:3000 grafana 192.168.0.161 3000 linux/x86_64 inactive - /data/tidb_cluster/tidb/deploy/grafana-3000
192.168.0.161:2379 pd 192.168.0.161 2379/2380 linux/x86_64 Up|L /data/tidb_cluster/pd/data /data/tidb_cluster/pd/deploy
192.168.0.163:2379 pd 192.168.0.163 2379/2380 linux/x86_64 Up|UI /data/tidb_cluster/pd/data /data/tidb_cluster/pd/deploy
192.168.0.62:2379 pd 192.168.0.62 2379/2380 linux/x86_64 Up /data/tidb_cluster/pd/data /data/tidb_cluster/pd/deploy
192.168.0.161:9090 prometheus 192.168.0.161 9090 linux/x86_64 inactive /data/tidb_cluster/tidb/data/prometheus-8249 /data/tidb_cluster/tidb/deploy/prometheus-8249
192.168.0.161:4000 tidb 192.168.0.161 4000/10080 linux/x86_64 Down - /data/tidb_cluster/tidb/deploy/tidb-4000
192.168.0.163:4000 tidb 192.168.0.163 4000/10080 linux/x86_64 Down - /data/tidb_cluster/tidb/deploy/tidb-4000
192.168.0.62:4000 tidb 192.168.0.62 4000/10080 linux/x86_64 Down - /data/tidb_cluster/tidb/deploy/tidb-4000
192.168.0.64:9000 tiflash 192.168.0.64 9000/8123/3930/20170/20292/8234 linux/x86_64 Down /data/tidb_cluster/tiflash/data /data/tidb_cluster/tiflash/deploy
192.168.0.161:20160 tikv 192.168.0.161 20160/20180 linux/x86_64 Down /data/tidb_cluster/tikv/data /data/tidb_cluster/tikv/deploy
192.168.0.163:20160 tikv 192.168.0.163 20160/20180 linux/x86_64 Down /data/tidb_cluster/tikv/data /data/tidb_cluster/tikv/deploy
192.168.0.62:20610 tikv 192.168.0.62 20610/20810 linux/x86_64 Down /data/tidb_cluster/tikv/data /data/tidb_cluster/tikv/deploy
Total nodes: 13
[tidb@161 ~]$

[tidb@161 log]$ ll -h
总用量 986M
-rw-r–r-- 1 tidb tidb 59M 10月 23 09:29 tikv.log
-rw-r–r-- 1 tidb tidb 307K 10月 21 15:01 tikv.log.2020-10-21-15:01:51.883167907
-rw-r–r-- 1 tidb tidb 22M 10月 22 15:01 tikv.log.2020-10-22-15:01:52.588154726
-rw-r–r-- 1 tidb tidb 301M 10月 22 21:14 tikv.log.2020-10-22-21:14:34.102780654
-rw-r–r-- 1 tidb tidb 301M 10月 23 02:49 tikv.log.2020-10-23-02:49:05.798967126
-rw-r–r-- 1 tidb tidb 301M 10月 23 08:23 tikv.log.2020-10-23-08:23:57.150792551
-rw-r–r-- 1 tidb tidb 0 10月 20 15:01 tikv_stderr.log
[tidb@161 log]$ tail -f tikv.log
[2020/10/23 09:29:25.448 +08:00] [INFO] [mod.rs:335] [“starting working thread”] [worker=backup-endpoint]
[2020/10/23 09:29:25.450 +08:00] [INFO] [mod.rs:335] [“starting working thread”] [worker=snap-handler]
[2020/10/23 09:29:25.450 +08:00] [INFO] [server.rs:223] [“listening on addr”] [addr=0.0.0.0:20160]
[2020/10/23 09:29:25.470 +08:00] [INFO] [server.rs:248] [“TiKV is ready to serve”]
[2020/10/23 09:29:25.472 +08:00] [WARN] [mod.rs:489] [“failed to register addr to pd”] [body=Body(Streaming)] [“status code”=400]
[2020/10/23 09:29:25.472 +08:00] [WARN] [mod.rs:489] [“failed to register addr to pd”] [body=Body(Streaming)] [“status code”=400]
[2020/10/23 09:29:25.473 +08:00] [WARN] [mod.rs:489] [“failed to register addr to pd”] [body=Body(Streaming)] [“status code”=400]
[2020/10/23 09:29:25.473 +08:00] [WARN] [mod.rs:489] [“failed to register addr to pd”] [body=Body(Streaming)] [“status code”=400]
[2020/10/23 09:29:25.473 +08:00] [WARN] [mod.rs:489] [“failed to register addr to pd”] [body=Body(Streaming)] [“status code”=400]
[2020/10/23 09:29:25.473 +08:00] [WARN] [mod.rs:499] [“failed to register addr to pd after 5 tries”]

看下你要的信息够不够

数据导入5%左右,集群自动挂了,集群显示tikv 离线,导入停止,stop 集群后,再启动集群,启动不了,上面是启动报错相关日志,附件是tidb-lightning导入过程日志。集群是4台虚拟机部署,[tidb@161 .tiup]$ more toplogy.yaml

# Global variables are applied to all deployments and used as the default value of

# the deployments if a specific deployment value is missing.

global:

user: “tidb”
ssh_port: 22
deploy_dir: “/data/tidb_cluster/tidb/deploy”
data_dir: “/data/tidb_cluster/tidb/data”

monitored:

node_exporter_port: 9100
blackbox_exporter_port: 9115
deploy_dir: “/data/tidb_cluster/tidb/deploy/monitored-9100”
data_dir: “/data/tidb_cluster/tidb/data/monitored-9100”
log_dir: “/data/tidb_cluster/tidb/deploy/monitored-9100/log”

server_configs:
tidb:
log.slow-threshold: 300

pd_servers:

- host: 192.168.0.161
  ssh_port: 22
  name: "pd-1"
  client_port: 2379
  peer_port: 2380
  deploy_dir: "/data/tidb_cluster/pd/deploy"
  data_dir: "/data/tidb_cluster/pd/data"
  log_dir: "/data/tidb_cluster/pd/deploy/log"

- host:  192.168.0.62
  ssh_port: 22
  name: "pd-2"
  client_port: 2379
  peer_port: 2380
  deploy_dir: "/data/tidb_cluster/pd/deploy"
  data_dir: "/data/tidb_cluster/pd/data"
  log_dir: "/data/tidb_cluster/pd/deploy/log"

- host:  192.168.0.163
  ssh_port: 22
  name: "pd-3"
  client_port: 2379
  peer_port: 2380
  deploy_dir: "/data/tidb_cluster/pd/deploy"
  data_dir: "/data/tidb_cluster/pd/data"
  log_dir: "/data/tidb_cluster/pd/deploy/log"

tidb_servers:

  • host: 192.168.0.161
    port: 4000
    status_port: 10080
    deploy_dir: “/data/tidb_cluster/tidb/deploy/tidb-4000”
    log_dir: “/data/tidb_cluster/tidb/deploy/tidb-4000/log”

  • host: 192.168.0.62
    port: 4000
    status_port: 10080
    deploy_dir: “/data/tidb_cluster/tidb/deploy/tidb-4000”
    log_dir: “/data/tidb_cluster/tidb/deploy/tidb-4000/log”

  • host: 192.168.0.163
    port: 4000
    status_port: 10080
    deploy_dir: “/data/tidb_cluster/tidb/deploy/tidb-4000”
    log_dir: “/data/tidb_cluster/tidb/deploy/tidb-4000/log”

tikv_servers:

  • host: 192.168.0.161
    port: 20160
    status_port: 20180
    deploy_dir: “/data/tidb_cluster/tikv/deploy”
    data_dir: “/data/tidb_cluster/tikv/data”
    log_dir: “/data/tidb_cluster/tikv/deploy/log”

  • host: 192.168.0.62
    port: 20610
    status_port: 20810
    deploy_dir: “/data/tidb_cluster/tikv/deploy”
    data_dir: “/data/tidb_cluster/tikv/data”
    log_dir: “/data/tidb_cluster/tikv/deploy/log”

  • host: 192.168.0.163
    port: 20160
    status_port: 20180
    deploy_dir: “/data/tidb_cluster/tikv/deploy”
    data_dir: “/data/tidb_cluster/tikv/data”
    log_dir: “/data/tidb_cluster/tikv/deploy/log”

tiflash_servers:

  • host: 192.168.0.64
    deploy_dir: “/data/tidb_cluster/tiflash/deploy”
    data_dir: “/data/tidb_cluster/tiflash/data”
    ssh_port: 22
    tcp_port: 9000
    http_port: 8123
    flash_service_port: 3930
    flash_proxy_port: 20170
    flash_proxy_status_port: 20292
    metrics_port: 8234

    The following configs are used to overwrite the server_configs.tiflash values.

    config:
    logger.level: “info”
    learner_config:
    log-level: “info”

monitoring_servers:

  • host: 192.168.0.161
    ssh_port: 22
    port: 9090
    deploy_dir: “/data/tidb_cluster/tidb/deploy/prometheus-8249”
    data_dir: “/data/tidb_cluster/tidb/data/prometheus-8249”
    log_dir: “/data/tidb_cluster/tidb/deploy/prometheus-8249/log”

grafana_servers:

  • host: 192.168.0.161
    port: 3000
    deploy_dir: /data/tidb_cluster/tidb/deploy/grafana-3000

alertmanager_servers:

  • host: 192.168.0.161
    ssh_port: 22
    web_port: 9093
    cluster_port: 9094
    deploy_dir: “/data/tidb_cluster/tidb/deploy/alertmanager-9093”
    data_dir: “/data/tidb_cluster/tidb/data/alertmanager-9093”
    log_dir: “/data/tidb_cluster/tidb/deploy/alertmanager-9093/log”

[tidb@161 .tiup]$

部署结构tidb-lightning.log.zip (75.2 KB)

报错日志中提示 tikv 无法注册地址到 pd,请检查下 tikv 和 pd 之间的防火墙和 selinux 是否都关闭了,并看下 pd 日志有无其他报错。

四台服务器的防火墙和selinux都是关的状态

1.麻烦看下现在 pd leader 节点的日志信息;
2.麻烦提供下当时导数失败时间段对应的 tidb、tikv 和 pd日志信息,看下可能是什么原因导致集群挂掉了。

我也遇到同样情况,大数据量,就会导挂

看日志,错误就是 tikv [“failed to register addr to pd”] [body=Body(Streaming)] [“status code”=400] 类似这样的