扩容tidb节点总是timeout失败,如何加个超时时间

如题, tiup 4.0.0-rc2

tidb 扩容提示timeout失败,start 也提示这个,

tidb 172.7.160.15:4000 failed to start: timed out waiting for port 4000 to be started after 1m0s, please check the log of the instance

Error: failed to start: failed to start tidb: tidb 172.7.160.15:4000 failed to start: timed out waiting for port 4000 to be started after 1m0s, please check the log of the instance: timed out waiting for port 4000 to be started after 1m0s

您好,反馈 报错的debug 日志 和 tidb 日志,多谢。

debug.log (104.0 KB) tidb.log (27.5 KB)

一个提示的错误日志
一个是 15 那台机子里的tidb.log

  1. 查看tidb.log 日志

[2020/05/28 18:33:11.650 +08:00] [WARN] [client_batch.go:223] [“init create streaming fail”] [target=172.7.160.231:20160] [error=“context deadline exceeded”] [2020/05/28 18:33:11.650 +08:00] [INFO] [region_cache.go:1523] ["[liveness] request kv status fail"] [store=172.7.160.231:20180] [error=“Get http://172.7.160.231:20180/status: dial tcp 172.7.160.231:20180: connect: connection refused”]

请查看 tikv 172.7.160.231 是否正常,多谢。

  1. 可以使用 pd-ctl 命令 查看 store 状态,判断是否为 up 正常,多谢。

不正常

然后pd服务也不正常了

如何才能重装而不丢失数据

pd 扩容也是同样错误

现在3个pd 都是无法使用 down 状态,td 都缩容去掉了

如何才能重装而不丢数据

  1. 这个应该是扩容前,就已经有问题了吧?由于您反馈的只是扩容的日志,暂时无发确认
  2. 请尝试重启集群,看看再哪里会报错,尝试能否先恢复集群,多谢。

我新建了一个集群

如果将 之前那个集群的 2个tikv 节点导入这个新集群呢。谢谢

之前那个集群彻底不可用了,一开始是硬盘可能出问题了,打算缩容下再扩容,结果各种错误 tidb 和tikv彻底不可用了

sst文件还存在,问下怎么让新集群使用这sst文件

一开始服务响应很慢,看了下监控,5个tikv里io都满了

就先缩容了2个tikv,缩容提示 timeout ,就加了 --force 强制退出了

然后 重启 集群,重启就提示 tidb 无法启动,试了几次也是这样

想着就缩容tidb,重新扩容吧,结果缩容后,就再也装不上 tidb了

然后又升级了tiup到 v1.0

手动重新ssh互信,均无法启动和安装tidb

001.txt (194.5 KB)

到目前为止,只跑起来了一个 pd、一个tikv,另一个tikv状态未知

原本是 3个pd,5个tikv,3个td

有一个这提示 cluster ID mismatch, local 6813007363201430470 != remote 6831897245719706867

新建的集群可以正常使用,有没有办法让新集群使用旧集群的sst数据

你好,

和你确认一些信息,希望积极反馈下,帖子中的信息上传也不是很完全。

当前的排查方向是尽量恢复旧集群

  1. display old-cluster-name,
  2. stop old-cluster-name 并 start old-cluster-name,其中任何步骤报错请上传操作截图debug 日志

感觉问题和复杂呢,我搞了一晚上, 可能执行了一些错误操作,问题很多,能帮助远程看下吗,我给你服务器信息

你好,

请配合上传信息即可,这边尽力帮忙恢复集群的。

start.log (46.8 KB) stop.log (31.1 KB)

这是221 tikv的日志,提示 id 不匹配221tilv.log (16.0 KB)

你好,

  1. 请帮忙返回下下面命令的文本信息: tiup ctl pd -u http://pdip:pdport store
  2. 反馈下 edit-config 并上传下 meta 信息:cat /home/tidb/.tiup/storage/cluster/clusters/youcluster-name/meta.yaml
  3. 关于 stop 集群失败,请检查下 monitor 的 deploy dir 和 data dir 中是否存在 monitor-port 目录,存在辛苦上传下 log 目录
  4. 反馈下 ll /etc/systemd/system/node_exporter-port.serive

editconfig.log (1.3 KB)

[tidb@tidb9 zhuashitidb]$ tiup cluster edit-config zhuashitidb
Starting component `cluster`: /home/tidb/.tiup/components/cluster/v1.0.0/cluster edit-config zhuashitidb

global:
  user: tidb
  ssh_port: 23
  deploy_dir: /data/deploy
  data_dir: /data/deploy/data
  os: linux
  arch: amd64
monitored:
  node_exporter_port: 9100
  blackbox_exporter_port: 9115
  deploy_dir: /data/deploy/monitor-9100
  data_dir: /data/deploy/data/monitor-9100
  log_dir: /data/deploy/monitor-9100/log
server_configs:
  tidb: {}
  tikv:
    readpool.coprocessor.use-unified-pool: true
    readpool.storage.use-unified-pool: false
    readpool.unified.max-thread-count: 13
  pd: {}
  tiflash: {}
  tiflash-learner: {}
  pump: {}
  drainer: {}
  cdc: {}
tidb_servers: []
tikv_servers:
- host: 172.7.160.221
  ssh_port: 23
  port: 20160
  status_port: 20180
  deploy_dir: /data/deploy
  data_dir: /data/deploy/data
  log_dir: /data/deploy/log
  arch: amd64
  os: linux
- host: 172.7.160.235
  ssh_port: 23
  port: 20160
  status_port: 20180
  deploy_dir: /data/deploy
  data_dir: /data/deploy/data
  log_dir: /data/deploy/log
  arch: amd64
  os: linux
tiflash_servers: []
pd_servers:
- host: 172.7.160.216
  ssh_port: 23
  name: pd-172.7.160.216-2379

  1. 失败的221 tikv /data/deploy/monitor/log
blackbox_exporter.log
level=info ts=2020-05-28T13:56:08.81309114Z caller=main.go:213 msg="Starting blackbox_exporter" version="(version=0.12.0, branch=HEAD, revision=4a22506cf0cf139d9b2f9cde099f0012d9fcabde)"
level=info ts=2020-05-28T13:56:08.814162456Z caller=main.go:220 msg="Loaded config file"
level=info ts=2020-05-28T13:56:08.814379285Z caller=main.go:324 msg="Listening on address" address=:9115
level=info ts=2020-05-28T15:17:28.24879446Z caller=main.go:213 msg="Starting blackbox_exporter" version="(version=0.12.0, branch=HEAD, revision=4a22506cf0cf139d9b2f9cde099f0012d9fcabde)"
level=info ts=2020-05-28T15:17:28.250041259Z caller=main.go:220 msg="Loaded config file"
level=info ts=2020-05-28T15:17:28.250258375Z caller=main.go:324 msg="Listening on address" address=:9115
level=info ts=2020-05-29T02:23:31.958024887Z caller=main.go:233 msg="Reloaded config file"

[root@tidb221 log]# cat node_exporter.log
time="2020-05-28T21:56:08+08:00" level=info msg="Starting node_exporter (version=0.17.0, branch=HEAD, revision=f6f6194a436b9a63d0439abc585c76b19a206b21)" source="node_exporter.go:82"
time="2020-05-28T21:56:08+08:00" level=info msg="Build context (go=go1.11.2, user=root@322511e06ced, date=20181130-15:51:33)" source="node_exporter.go:83"
time="2020-05-28T21:56:08+08:00" level=info msg="Enabled collectors:" source="node_exporter.go:90"
time="2020-05-28T21:56:08+08:00" level=info msg=" - arp" source="node_exporter.go:97"
time="2020-05-28T21:56:08+08:00" level=info msg=" - bcache" source="node_exporter.go:97"
time="2020-05-28T21:56:08+08:00" level=info msg=" - bonding" source="node_exporter.go:97"
time="2020-05-28T21:56:08+08:00" level=info msg=" - conntrack" source="node_exporter.go:97"
time="2020-05-28T21:56:08+08:00" level=info msg=" - cpu" source="node_exporter.go:97"
time="2020-05-28T21:56:08+08:00" level=info msg=" - diskstats" source="node_exporter.go:97"
time="2020-05-28T21:56:08+08:00" level=info msg=" - edac" source="node_exporter.go:97"
time="2020-05-28T21:56:08+08:00" level=info msg=" - entropy" source="node_exporter.go:97"
time="2020-05-28T21:56:08+08:00" level=info msg=" - filefd" source="node_exporter.go:97"
time="2020-05-28T21:56:08+08:00" level=info msg=" - filesystem" source="node_exporter.go:97"
time="2020-05-28T21:56:08+08:00" level=info msg=" - hwmon" source="node_exporter.go:97"
time="2020-05-28T21:56:08+08:00" level=info msg=" - infiniband" source="node_exporter.go:97"
time="2020-05-28T21:56:08+08:00" level=info msg=" - interrupts" source="node_exporter.go:97"
time="2020-05-28T21:56:08+08:00" level=info msg=" - ipvs" source="node_exporter.go:97"
time="2020-05-28T21:56:08+08:00" level=info msg=" - loadavg" source="node_exporter.go:97"
time="2020-05-28T21:56:08+08:00" level=info msg=" - mdadm" source="node_exporter.go:97"
time="2020-05-28T21:56:08+08:00" level=info msg=" - meminfo" source="node_exporter.go:97"
time="2020-05-28T21:56:08+08:00" level=info msg=" - meminfo_numa" source="node_exporter.go:97"
time="2020-05-28T21:56:08+08:00" level=info msg=" - mountstats" source="node_exporter.go:97"
time="2020-05-28T21:56:08+08:00" level=info msg=" - netclass" source="node_exporter.go:97"
time="2020-05-28T21:56:08+08:00" level=info msg=" - netdev" source="node_exporter.go:97"
time="2020-05-28T21:56:08+08:00" level=info msg=" - netstat" source="node_exporter.go:97"
time="2020-05-28T21:56:08+08:00" level=info msg=" - nfs" source="node_exporter.go:97"
time="2020-05-28T21:56:08+08:00" level=info msg=" - nfsd" source="node_exporter.go:97"
time="2020-05-28T21:56:08+08:00" level=info msg=" - sockstat" source="node_exporter.go:97"
time="2020-05-28T21:56:08+08:00" level=info msg=" - stat" source="node_exporter.go:97"
time="2020-05-28T21:56:08+08:00" level=info msg=" - systemd" source="node_exporter.go:97"
time="2020-05-28T21:56:08+08:00" level=info msg=" - tcpstat" source="node_exporter.go:97"
time="2020-05-28T21:56:08+08:00" level=info msg=" - textfile" source="node_exporter.go:97"
time="2020-05-28T21:56:08+08:00" level=info msg=" - time" source="node_exporter.go:97"
time="2020-05-28T21:56:08+08:00" level=info msg=" - timex" source="node_exporter.go:97"
time="2020-05-28T21:56:08+08:00" level=info msg=" - uname" source="node_exporter.go:97"
time="2020-05-28T21:56:08+08:00" level=info msg=" - vmstat" source="node_exporter.go:97"
time="2020-05-28T21:56:08+08:00" level=info msg=" - xfs" source="node_exporter.go:97"
time="2020-05-28T21:56:08+08:00" level=info msg=" - zfs" source="node_exporter.go:97"
time="2020-05-28T21:56:08+08:00" level=info msg="Listening on :9100" source="node_exporter.go:111"
time="2020-05-28T23:17:33+08:00" level=info msg="Starting node_exporter (version=0.17.0, branch=HEAD, revision=f6f6194a436b9a63d0439abc585c76b19a206b21)" source="node_exporter.go:82"
time="2020-05-28T23:17:33+08:00" level=info msg="Build context (go=go1.11.2, user=root@322511e06ced, date=20181130-15:51:33)" source="node_exporter.go:83"
time="2020-05-28T23:17:33+08:00" level=info msg="Enabled collectors:" source="node_exporter.go:90"
time="2020-05-28T23:17:33+08:00" level=info msg=" - arp" source="node_exporter.go:97"
time="2020-05-28T23:17:33+08:00" level=info msg=" - bcache" source="node_exporter.go:97"
time="2020-05-28T23:17:33+08:00" level=info msg=" - bonding" source="node_exporter.go:97"
time="2020-05-28T23:17:33+08:00" level=info msg=" - conntrack" source="node_exporter.go:97"
time="2020-05-28T23:17:33+08:00" level=info msg=" - cpu" source="node_exporter.go:97"
time="2020-05-28T23:17:33+08:00" level=info msg=" - diskstats" source="node_exporter.go:97"
time="2020-05-28T23:17:33+08:00" level=info msg=" - edac" source="node_exporter.go:97"
time="2020-05-28T23:17:33+08:00" level=info msg=" - entropy" source="node_exporter.go:97"
time="2020-05-28T23:17:33+08:00" level=info msg=" - filefd" source="node_exporter.go:97"
time="2020-05-28T23:17:33+08:00" level=info msg=" - filesystem" source="node_exporter.go:97"
time="2020-05-28T23:17:33+08:00" level=info msg=" - hwmon" source="node_exporter.go:97"
time="2020-05-28T23:17:33+08:00" level=info msg=" - infiniband" source="node_exporter.go:97"
time="2020-05-28T23:17:33+08:00" level=info msg=" - interrupts" source="node_exporter.go:97"
time="2020-05-28T23:17:33+08:00" level=info msg=" - ipvs" source="node_exporter.go:97"
time="2020-05-28T23:17:33+08:00" level=info msg=" - loadavg" source="node_exporter.go:97"
time="2020-05-28T23:17:33+08:00" level=info msg=" - mdadm" source="node_exporter.go:97"
time="2020-05-28T23:17:33+08:00" level=info msg=" - meminfo" source="node_exporter.go:97"
time="2020-05-28T23:17:33+08:00" level=info msg=" - meminfo_numa" source="node_exporter.go:97"
time="2020-05-28T23:17:33+08:00" level=info msg=" - mountstats" source="node_exporter.go:97"
time="2020-05-28T23:17:33+08:00" level=info msg=" - netclass" source="node_exporter.go:97"
time="2020-05-28T23:17:33+08:00" level=info msg=" - netdev" source="node_exporter.go:97"
time="2020-05-28T23:17:33+08:00" level=info msg=" - netstat" source="node_exporter.go:97"
time="2020-05-28T23:17:33+08:00" level=info msg=" - nfs" source="node_exporter.go:97"
time="2020-05-28T23:17:33+08:00" level=info msg=" - nfsd" source="node_exporter.go:97"
time="2020-05-28T23:17:33+08:00" level=info msg=" - sockstat" source="node_exporter.go:97"
time="2020-05-28T23:17:33+08:00" level=info msg=" - stat" source="node_exporter.go:97"
time="2020-05-28T23:17:33+08:00" level=info msg=" - systemd" source="node_exporter.go:97"
time="2020-05-28T23:17:33+08:00" level=info msg=" - tcpstat" source="node_exporter.go:97"
time="2020-05-28T23:17:33+08:00" level=info msg=" - textfile" source="node_exporter.go:97"
time="2020-05-28T23:17:33+08:00" level=info msg=" - time" source="node_exporter.go:97"
time="2020-05-28T23:17:33+08:00" level=info msg=" - timex" source="node_exporter.go:97"
time="2020-05-28T23:17:33+08:00" level=info msg=" - uname" source="node_exporter.go:97"
time="2020-05-28T23:17:33+08:00" level=info msg=" - vmstat" source="node_exporter.go:97"
time="2020-05-28T23:17:33+08:00" level=info msg=" - xfs" source="node_exporter.go:97"
time="2020-05-28T23:17:33+08:00" level=info msg=" - zfs" source="node_exporter.go:97"
time="2020-05-28T23:17:33+08:00" level=info msg="Listening on :9100" source="node_exporter.go:111"

4.

你好,

非常感谢反馈的信息,很全也很规范,第四步辛苦将 port 换成 monitor 的 port 谢谢,主要是看看 node_exporter 服务文件是否存在。

操作上一个步骤之后,辛苦继续提供下信息

  1. 将 edit-config 中删除 221 tikv 的配置信息。并 reload -R tikv,反馈下 display 的结果
  2. 执行 stop 命令看是否成功,否则反馈下日志 tikv 报错日志和 debug 日志。
[root@tidb221 log]# ll /etc/systemd/system/node_exporter-9100.serive
ls: 无法访问/etc/systemd/system/node_exporter-9100.serive: 没有那个文件或目录
[root@tidb221 log]# ll /etc/systemd/system/node_exporter-9115.serive
ls: 无法访问/etc/systemd/system/node_exporter-9115.serive: 没有那个文件或目录

tidb9 是中控机
tidb221是启动失败的tikv

  1. reload成功了
[tidb@tidb9 ~]$ tiup cluster display zhuashitidb
Starting component `cluster`: /home/tidb/.tiup/components/cluster/v1.0.0/cluster display zhuashitidb
TiDB Cluster: zhuashitidb
TiDB Version: v4.0.0-rc.2
ID                   Role  Host           Ports        OS/Arch       Status     Data Dir           Deploy Dir
--                   ----  ----           -----        -------       ------     --------           ----------
172.7.160.216:2379   pd    172.7.160.216  2379/2380    linux/x86_64  Healthy|L  /data/deploy/data  /data/deploy
172.7.160.235:20160  tikv  172.7.160.235  20160/20180  linux/x86_64  Up         /data/deploy/data  /data/deploy

第二步reload成功了,display正常

stop失败了

[tidb@tidb9 ~]$ tiup cluster stop zhuashitidb
Starting component `cluster`: /home/tidb/.tiup/components/cluster/v1.0.0/cluster stop zhuashitidb
+ [ Serial ] - SSHKeySet: privateKey=/home/tidb/.tiup/storage/cluster/clusters/zhuashitidb/ssh/id_rsa, publicKey=/home/tidb/.tiup/storage/cluster/clusters/zhuashitidb/ssh/id_rsa.pub
+ [Parallel] - UserSSH: user=tidb, host=172.7.160.235
+ [Parallel] - UserSSH: user=tidb, host=172.7.160.216
+ [ Serial ] - ClusterOperate: operation=StopOperation, options={Roles:[] Nodes:[] Force:false SSHTimeout:5 OptTimeout:60 APITimeout:300}
Stopping component tikv
	Stopping instance 172.7.160.235
	Stop tikv 172.7.160.235:20160 success
Stopping component node_exporter
Stopping component blackbox_exporter
Stopping component pd
	Stopping instance 172.7.160.216
	Stop pd 172.7.160.216:2379 success
Stopping component node_exporter
retry error: operation timed out after 1m0s
	pd 172.7.160.216:2379 failed to stop: timed out waiting for port 9100 to be stopped after 1m0s

Error: failed to stop: 	pd 172.7.160.216:2379 failed to stop: timed out waiting for port 9100 to be stopped after 1m0s: timed out waiting for port 9100 to be stopped after 1m0s

Verbose debug logs has been written to /home/tidb/logs/tiup-cluster-debug-2020-05-29-11-31-10.log.
Error: run `/home/tidb/.tiup/components/cluster/v1.0.0/cluster` (wd:/home/tidb/.tiup/data/S0LPdBy) failed: exit status 1

stop-error.log (39.6 KB)

你好,

现在需要恢复下 node_exporter_port

  1. 将 edit-config 中 monitored 配置信息剪贴到 scale-out 文件(vi 一个空文件放进去)中,目的是将 monitored 重新部署,恢复两个监控组件的基本信息。(尝试下,目前没有做过类似的测试)
monitored:
  node_exporter_port: 9100
  blackbox_exporter_port: 9115

我从其他节点复制了 monitor-9100 目录,现在可以启动成功了,并扩容了2个tidb

现在tikv235可以了,但是 tikv 221 节点的数据提示 id不匹配,改如何使用起这个数据,只235的是不是数据不全

[tidb@tidb9 ~]$ tiup cluster display zhuashitidb
Starting component `cluster`: /home/tidb/.tiup/components/cluster/v1.0.0/cluster display zhuashitidb
TiDB Cluster: zhuashitidb
TiDB Version: v4.0.0-rc.2
ID                   Role  Host           Ports        OS/Arch       Status     Data Dir           Deploy Dir
--                   ----  ----           -----        -------       ------     --------           ----------
172.7.160.216:2379   pd    172.7.160.216  2379/2380    linux/x86_64  Healthy|L  /data/deploy/data  /data/deploy
172.7.160.36:4000    tidb  172.7.160.36   4000/10080   linux/x86_64  Up         -                  /data/deploy
172.7.160.37:4000    tidb  172.7.160.37   4000/10080   linux/x86_64  Up         -                  /data/deploy
172.7.160.235:20160  tikv  172.7.160.235  20160/20180  linux/x86_64  Up         /data/deploy/data  /data/deploy