Tidb 4.0.6 三节点集群 40G 一张表数据导不到10%左右突然中断，重启tidb集群后，Tidb 4000起不来

jinju · 2020 年10 月 22 日 09:15

在对Tidb 使用tidb-lightning 数据导入40G一张表数据，大约导入5%， tikv 集群显示无法连接，对整个集群关闭，再起后，tidb起不了
retry error: operation timed out after 2m0s
tidb 192.168.0.62:4000 failed to start: timed out waiting for port 4000 to be started after 2m0s, please check the log of the instance

Error: failed to start tidb: tidb 192.168.0.163:4000 failed to start: timed out waiting for port 4000 to be started after 2m0s, please check the log of the instance: timed out waiting for port 4000 to be started after 2m0s
查看日志
/data/tidb_cluster/tidb/deploy/tidb-4000/log
[root@62 log]# tail -f tidb.log
ed"]
[2020/10/22 17:12:36.160 +08:00] [INFO] [client_batch.go:314] [“batchRecvLoop re-create streaming fail”] [target=192.168.0.163:20160] [error=“context deadline exceeded”]
[2020/10/22 17:12:36.453 +08:00] [INFO] [client_batch.go:314] [“batchRecvLoop re-create streaming fail”] [target=192.168.0.161:20160] [error=“context deadline exceeded”]
[2020/10/22 17:12:36.453 +08:00] [INFO] [client_batch.go:314] [“batchRecvLoop re-create streaming fail”] [target=192.168.0.161:20160] [error=“context deadline exceeded”]

这道题我不会 · 2020 年10 月 22 日 09:31

请检查下 tikv 节点的状态，并看下日志中有无报错信息。

jinju · 2020 年10 月 23 日 01:31

[tidb@161 ~]$ tiup cluster start luban
Starting component cluster: /home/tidb/.tiup/components/cluster/v1.2.0/tiup-cluster start luban
Starting cluster luban…

[ Serial ] - SSHKeySet: privateKey=/home/tidb/.tiup/storage/cluster/clusters/luban/ssh/id_rsa, publicKey=/home/tidb/.tiup/storage/cluster/clusters/luban/ssh/id_rsa.pub
[Parallel] - UserSSH: user=tidb, host=192.168.0.161
[Parallel] - UserSSH: user=tidb, host=192.168.0.163
[Parallel] - UserSSH: user=tidb, host=192.168.0.163
[Parallel] - UserSSH: user=tidb, host=192.168.0.161
[Parallel] - UserSSH: user=tidb, host=192.168.0.62
[Parallel] - UserSSH: user=tidb, host=192.168.0.161
[Parallel] - UserSSH: user=tidb, host=192.168.0.161
[Parallel] - UserSSH: user=tidb, host=192.168.0.62
[Parallel] - UserSSH: user=tidb, host=192.168.0.62
[Parallel] - UserSSH: user=tidb, host=192.168.0.163
[Parallel] - UserSSH: user=tidb, host=192.168.0.161
[Parallel] - UserSSH: user=tidb, host=192.168.0.161
[Parallel] - UserSSH: user=tidb, host=192.168.0.64
[ Serial ] - StartCluster
Starting component pd
Starting instance pd 192.168.0.163:2379
Starting instance pd 192.168.0.161:2379
Starting instance pd 192.168.0.62:2379
Start pd 192.168.0.161:2379 success
Start pd 192.168.0.163:2379 success
Start pd 192.168.0.62:2379 success
Starting component node_exporter
Starting instance 192.168.0.161
Start 192.168.0.161 success
Starting component blackbox_exporter
Starting instance 192.168.0.161
Start 192.168.0.161 success
Starting component node_exporter
Starting instance 192.168.0.62
Start 192.168.0.62 success
Starting component blackbox_exporter
Starting instance 192.168.0.62
Start 192.168.0.62 success
Starting component node_exporter
Starting instance 192.168.0.163
Start 192.168.0.163 success
Starting component blackbox_exporter
Starting instance 192.168.0.163
Start 192.168.0.163 success
Starting component tikv
Starting instance tikv 192.168.0.163:20160
Starting instance tikv 192.168.0.161:20160
Starting instance tikv 192.168.0.62:20610
Start tikv 192.168.0.161:20160 success
Start tikv 192.168.0.163:20160 success
Start tikv 192.168.0.62:20610 success
Starting component tidb
Starting instance tidb 192.168.0.163:4000
Starting instance tidb 192.168.0.161:4000
Starting instance tidb 192.168.0.62:4000
retry error: operation timed out after 2m0s
tidb 192.168.0.62:4000 failed to start: timed out waiting for port 4000 to be started after 2m0s, please check the log of the instance
retry error: operation timed out after 2m0s
tidb 192.168.0.163:4000 failed to start: timed out waiting for port 4000 to be started after 2m0s, please check the log of the instance
retry error: operation timed out after 2m0s
tidb 192.168.0.161:4000 failed to start: timed out waiting for port 4000 to be started after 2m0s, please check the log of the instance

Error: failed to start tidb: tidb 192.168.0.62:4000 failed to start: timed out waiting for port 4000 to be started after 2m0s, please check the log of the instance: timed out waiting for port 4000 to be started after 2m0s

Verbose debug logs has been written to /home/tidb/logs/tiup-cluster-debug-2020-10-23-09-27-19.log.
Error: run /home/tidb/.tiup/components/cluster/v1.2.0/tiup-cluster (wd:/home/tidb/.tiup/data/SECRN16) failed: exit status 1
[tidb@161 ~]$

[tidb@161 ~]$ tiup cluster display luban
Starting component cluster: /home/tidb/.tiup/components/cluster/v1.2.0/tiup-cluster display luban
Cluster type: tidb
Cluster name: luban
Cluster version: v4.0.6
SSH type: builtin
ID Role Host Ports OS/Arch Status Data Dir Deploy Dir

192.168.0.161:9093 alertmanager 192.168.0.161 9093/9094 linux/x86_64 inactive /data/tidb_cluster/tidb/data/alertmanager-9093 /data/tidb_cluster/tidb/deploy/alertmanager-9093
192.168.0.161:3000 grafana 192.168.0.161 3000 linux/x86_64 inactive - /data/tidb_cluster/tidb/deploy/grafana-3000
192.168.0.161:2379 pd 192.168.0.161 2379/2380 linux/x86_64 Up|L /data/tidb_cluster/pd/data /data/tidb_cluster/pd/deploy
192.168.0.163:2379 pd 192.168.0.163 2379/2380 linux/x86_64 Up|UI /data/tidb_cluster/pd/data /data/tidb_cluster/pd/deploy
192.168.0.62:2379 pd 192.168.0.62 2379/2380 linux/x86_64 Up /data/tidb_cluster/pd/data /data/tidb_cluster/pd/deploy
192.168.0.161:9090 prometheus 192.168.0.161 9090 linux/x86_64 inactive /data/tidb_cluster/tidb/data/prometheus-8249 /data/tidb_cluster/tidb/deploy/prometheus-8249
192.168.0.161:4000 tidb 192.168.0.161 4000/10080 linux/x86_64 Down - /data/tidb_cluster/tidb/deploy/tidb-4000
192.168.0.163:4000 tidb 192.168.0.163 4000/10080 linux/x86_64 Down - /data/tidb_cluster/tidb/deploy/tidb-4000
192.168.0.62:4000 tidb 192.168.0.62 4000/10080 linux/x86_64 Down - /data/tidb_cluster/tidb/deploy/tidb-4000
192.168.0.64:9000 tiflash 192.168.0.64 9000/8123/3930/20170/20292/8234 linux/x86_64 Down /data/tidb_cluster/tiflash/data /data/tidb_cluster/tiflash/deploy
192.168.0.161:20160 tikv 192.168.0.161 20160/20180 linux/x86_64 Down /data/tidb_cluster/tikv/data /data/tidb_cluster/tikv/deploy
192.168.0.163:20160 tikv 192.168.0.163 20160/20180 linux/x86_64 Down /data/tidb_cluster/tikv/data /data/tidb_cluster/tikv/deploy
192.168.0.62:20610 tikv 192.168.0.62 20610/20810 linux/x86_64 Down /data/tidb_cluster/tikv/data /data/tidb_cluster/tikv/deploy
Total nodes: 13
[tidb@161 ~]$

[tidb@161 log]$ ll -h
总用量 986M
-rw-r–r-- 1 tidb tidb 59M 10月 23 09:29 tikv.log
-rw-r–r-- 1 tidb tidb 307K 10月 21 15:01 tikv.log.2020-10-21-15:01:51.883167907
-rw-r–r-- 1 tidb tidb 22M 10月 22 15:01 tikv.log.2020-10-22-15:01:52.588154726
-rw-r–r-- 1 tidb tidb 301M 10月 22 21:14 tikv.log.2020-10-22-21:14:34.102780654
-rw-r–r-- 1 tidb tidb 301M 10月 23 02:49 tikv.log.2020-10-23-02:49:05.798967126
-rw-r–r-- 1 tidb tidb 301M 10月 23 08:23 tikv.log.2020-10-23-08:23:57.150792551
-rw-r–r-- 1 tidb tidb 0 10月 20 15:01 tikv_stderr.log
[tidb@161 log]$ tail -f tikv.log
[2020/10/23 09:29:25.448 +08:00] [INFO] [mod.rs:335] [“starting working thread”] [worker=backup-endpoint]
[2020/10/23 09:29:25.450 +08:00] [INFO] [mod.rs:335] [“starting working thread”] [worker=snap-handler]
[2020/10/23 09:29:25.450 +08:00] [INFO] [server.rs:223] [“listening on addr”] [addr=0.0.0.0:20160]
[2020/10/23 09:29:25.470 +08:00] [INFO] [server.rs:248] [“TiKV is ready to serve”]
[2020/10/23 09:29:25.472 +08:00] [WARN] [mod.rs:489] [“failed to register addr to pd”] [body=Body(Streaming)] [“status code”=400]
[2020/10/23 09:29:25.472 +08:00] [WARN] [mod.rs:489] [“failed to register addr to pd”] [body=Body(Streaming)] [“status code”=400]
[2020/10/23 09:29:25.473 +08:00] [WARN] [mod.rs:489] [“failed to register addr to pd”] [body=Body(Streaming)] [“status code”=400]
[2020/10/23 09:29:25.473 +08:00] [WARN] [mod.rs:489] [“failed to register addr to pd”] [body=Body(Streaming)] [“status code”=400]
[2020/10/23 09:29:25.473 +08:00] [WARN] [mod.rs:489] [“failed to register addr to pd”] [body=Body(Streaming)] [“status code”=400]
[2020/10/23 09:29:25.473 +08:00] [WARN] [mod.rs:499] [“failed to register addr to pd after 5 tries”]

看下你要的信息够不够

jinju · 2020 年10 月 23 日 02:07

数据导入5%左右，集群自动挂了，集群显示tikv 离线，导入停止，stop 集群后，再启动集群，启动不了，上面是启动报错相关日志，附件是tidb-lightning导入过程日志。集群是4台虚拟机部署，[tidb@161 .tiup]$ more toplogy.yaml

# Global variables are applied to all deployments and used as the default value of

# the deployments if a specific deployment value is missing.

global:

user: “tidb”
ssh_port: 22
deploy_dir: “/data/tidb_cluster/tidb/deploy”
data_dir: “/data/tidb_cluster/tidb/data”

monitored:

node_exporter_port: 9100
blackbox_exporter_port: 9115
deploy_dir: “/data/tidb_cluster/tidb/deploy/monitored-9100”
data_dir: “/data/tidb_cluster/tidb/data/monitored-9100”
log_dir: “/data/tidb_cluster/tidb/deploy/monitored-9100/log”

server_configs:
tidb:
log.slow-threshold: 300

pd_servers:

- host: 192.168.0.161
  ssh_port: 22
  name: "pd-1"
  client_port: 2379
  peer_port: 2380
  deploy_dir: "/data/tidb_cluster/pd/deploy"
  data_dir: "/data/tidb_cluster/pd/data"
  log_dir: "/data/tidb_cluster/pd/deploy/log"

- host:  192.168.0.62
  ssh_port: 22
  name: "pd-2"
  client_port: 2379
  peer_port: 2380
  deploy_dir: "/data/tidb_cluster/pd/deploy"
  data_dir: "/data/tidb_cluster/pd/data"
  log_dir: "/data/tidb_cluster/pd/deploy/log"

- host:  192.168.0.163
  ssh_port: 22
  name: "pd-3"
  client_port: 2379
  peer_port: 2380
  deploy_dir: "/data/tidb_cluster/pd/deploy"
  data_dir: "/data/tidb_cluster/pd/data"
  log_dir: "/data/tidb_cluster/pd/deploy/log"

tidb_servers:

host: 192.168.0.161
port: 4000
status_port: 10080
deploy_dir: “/data/tidb_cluster/tidb/deploy/tidb-4000”
log_dir: “/data/tidb_cluster/tidb/deploy/tidb-4000/log”
host: 192.168.0.62
port: 4000
status_port: 10080
deploy_dir: “/data/tidb_cluster/tidb/deploy/tidb-4000”
log_dir: “/data/tidb_cluster/tidb/deploy/tidb-4000/log”
host: 192.168.0.163
port: 4000
status_port: 10080
deploy_dir: “/data/tidb_cluster/tidb/deploy/tidb-4000”
log_dir: “/data/tidb_cluster/tidb/deploy/tidb-4000/log”

tikv_servers:

host: 192.168.0.161
port: 20160
status_port: 20180
deploy_dir: “/data/tidb_cluster/tikv/deploy”
data_dir: “/data/tidb_cluster/tikv/data”
log_dir: “/data/tidb_cluster/tikv/deploy/log”
host: 192.168.0.62
port: 20610
status_port: 20810
deploy_dir: “/data/tidb_cluster/tikv/deploy”
data_dir: “/data/tidb_cluster/tikv/data”
log_dir: “/data/tidb_cluster/tikv/deploy/log”
host: 192.168.0.163
port: 20160
status_port: 20180
deploy_dir: “/data/tidb_cluster/tikv/deploy”
data_dir: “/data/tidb_cluster/tikv/data”
log_dir: “/data/tidb_cluster/tikv/deploy/log”

tiflash_servers:

host: 192.168.0.64
deploy_dir: “/data/tidb_cluster/tiflash/deploy”
data_dir: “/data/tidb_cluster/tiflash/data”
ssh_port: 22
tcp_port: 9000
http_port: 8123
flash_service_port: 3930
flash_proxy_port: 20170
flash_proxy_status_port: 20292
metrics_port: 8234
The following configs are used to overwrite the server_configs.tiflash values.
config:
logger.level: “info”
learner_config:
log-level: “info”

monitoring_servers:

host: 192.168.0.161
ssh_port: 22
port: 9090
deploy_dir: “/data/tidb_cluster/tidb/deploy/prometheus-8249”
data_dir: “/data/tidb_cluster/tidb/data/prometheus-8249”
log_dir: “/data/tidb_cluster/tidb/deploy/prometheus-8249/log”

grafana_servers:

host: 192.168.0.161
port: 3000
deploy_dir: /data/tidb_cluster/tidb/deploy/grafana-3000

alertmanager_servers:

host: 192.168.0.161
ssh_port: 22
web_port: 9093
cluster_port: 9094
deploy_dir: “/data/tidb_cluster/tidb/deploy/alertmanager-9093”
data_dir: “/data/tidb_cluster/tidb/data/alertmanager-9093”
log_dir: “/data/tidb_cluster/tidb/deploy/alertmanager-9093/log”

[tidb@161 .tiup]$

部署结构tidb-lightning.log.zip (75.2 KB)

这道题我不会 · 2020 年10 月 23 日 05:53

报错日志中提示 tikv 无法注册地址到 pd，请检查下 tikv 和 pd 之间的防火墙和 selinux 是否都关闭了，并看下 pd 日志有无其他报错。

jinju · 2020 年10 月 23 日 06:20

四台服务器的防火墙和selinux都是关的状态

这道题我不会 · 2020 年10 月 23 日 06:29

1.麻烦看下现在 pd leader 节点的日志信息；
2.麻烦提供下当时导数失败时间段对应的 tidb、tikv 和 pd日志信息，看下可能是什么原因导致集群挂掉了。

sinfere · 2021 年1 月 4 日 15:40

我也遇到同样情况，大数据量，就会导挂

sinfere · 2021 年1 月 4 日 15:40

看日志，错误就是 tikv [“failed to register addr to pd”] [body=Body(Streaming)] [“status code”=400] 类似这样的

Tidb 4.0.6 三节点集群 40G 一张表 数据导不到10%左右突然中断，重启tidb集群后，Tidb 4000起不来

# Global variables are applied to all deployments and used as the default value of

# the deployments if a specific deployment value is missing.

The following configs are used to overwrite the server_configs.tiflash values.

Tidb 4.0.6 三节点集群 40G 一张表数据导不到10%左右突然中断，重启tidb集群后，Tidb 4000起不来

The following configs are used to overwrite the `server_configs.tiflash` values.