tidb 3.1部署成功,启动时报错 the TiKV port 20160 is not up

ansible.log (234.7 KB)

–启动tidb集群报错
ansible-playbook start.yml -vvv

完整的ansible.log已经上传至附件,请帮忙检查一下

你好,
辛苦检查下 6 7 8 服务器 tikv.log 中是否存在详细的报错信息.
建议使用 tiup 部署 tidb 集群, 并使用 tidb v4.x 版本, 其功能完全覆盖 v3.1 版本.
感谢配合.

tikv_6.log (489.3 KB) tikv_7.log (506.2 KB) tikv_8.log (523.1 KB)

tikv_stderr.log这个文件是空的,6,7,8的tikv.log上传了,好像没看到报错

因为历史原因和条件限制,这个部署只能用ansible
后续的我们会用tiup的,请帮忙解决一下现在的这个问题,谢谢

你好,

[2020/10/08 15:32:24.498 +08:00] [INFO] [mod.rs:28] [“Release Version: 3.0.13”]

确定下 inventory 文件中 tidb version 是如何配置的,并检查 tidb-ansible 的版本是否已经更新到了 3.1.

建议 tidb-ansible 使用正确版本并且 tidb version 指定正确版本,inventory 文件内容建议手动编写,不建议 copy 之前版本的文件

inventory:
ansible_user = tidb
tidb_version = v3.0.13

版本是3.0.13的,标题中我写错了,写成3.1了,是3.0.
inventory是手动编写的,不是copy的,并且之前的deploy过程都成功了

[tidb@tidb1_pd1 tidb-ansible]$ cat inventory.ini
[tidb_servers]
192.168.56.3
192.168.56.4
192.168.56.5

[pd_servers]
192.168.56.3
192.168.56.4
192.168.56.5

[tikv_servers]
192.168.56.6
192.168.56.7
192.168.56.8

[spark_master]

[spark_slaves]

[lightning_server]

[importer_server]

[monitoring_servers]
192.168.56.3

[grafana_servers]
192.168.56.3

[monitored_servers]
192.168.56.3
192.168.56.4
192.168.56.5
192.168.56.6
192.168.56.7
192.168.56.8

[alertmanager_servers]
192.168.56.3

Global variables

[all:vars]
deploy_dir = /data1/deploy

ssh via normal user

cluster_name = hunter-cluster

Connection

ssh via normal user

ansible_user = tidb
tidb_version = v3.0.13

process supervision, [systemd, supervise]

process_supervision = systemd
timezone = Asia/Shanghai
enable_firewalld = False

check NTP service

enable_ntpd = True
set_hostname = False
enable_binlog = False
[tidb@tidb1_pd1 tidb-ansible]$

检查端口是否放通,手工在节点启动是否能够成功?

端口是没有被占用的

手工在节点启动是否能够成功? 是指不运行start.yml,分别启动pd, tikv, tidb吗?

运行deploy/scripts下的run_tikv.sh是没有报错的
[tidb@tikv1 scripts]$ sh run_tikv.sh
sync …
real 0m0.009s
user 0m0.000s
sys 0m0.003s
ok

Killed
[tidb@tikv1 scripts]$

[tidb@tikv1 scripts]$ sh start_tikv.sh
[tidb@tikv1 scripts]$

直接运行start_tikv.sh的话,没看到报错,也无法启动

尝试执行一下以下命令,看返回的报错是什么

sh -x run_tikv.sh

[tidb@tikv1 scripts]$ sh -x run_tikv.sh

  • set -e
  • ulimit -n 1000000
  • cd /data1/deploy
  • export RUST_BACKTRACE=1
  • RUST_BACKTRACE=1
  • export TZ=/etc/localtime
  • TZ=/etc/localtime
  • echo -n 'sync … ’
    sync … ++ sync

real 0m0.003s
user 0m0.001s
sys 0m0.000s

  • stat=

  • echo ok
    ok

  • echo

  • echo 3714

  • exec bin/tikv-server --addr 0.0.0.0:20160 --advertise-addr 192.168.56.6:20160 --status-addr 192.168.56.6:20180 --pd 192.168.56.3:2379,192.168.56.4:2379,192.168.56.5:2379 --data-dir /data1/deploy/data --config conf/tikv.toml --log-file /data1/deploy/log/tikv.log
    Killed
    [tidb@tikv1 scripts]$

/data1/deploy/log/tikv.log 看下谁否有新的日志产生
dmesg -T | grep tikv-server 看下是否有 tikv-server 相关的日志

在运行sh -x run_tikv.sh,看了tikv-server相关的日志,好像是报了oom, memory的限制在什么地方可以调小吗?

[Fri Oct 9 11:41:35 2020] tikv-server invoked oom-killer: gfp_mask=0x280da, order=0, oom_score_adj=0
[Fri Oct 9 11:41:35 2020] tikv-server cpuset=/ mems_allowed=0
[Fri Oct 9 11:41:35 2020] CPU: 2 PID: 5018 Comm: tikv-server Not tainted 3.10.0-1127.el7.x86_64 #1
[Fri Oct 9 11:41:35 2020] [ 5018] 1001 5018 251433 130682 377 0 0 tikv-server
[Fri Oct 9 11:41:35 2020] Out of memory: Kill process 5018 (tikv-server) score 517 or sacrifice child
[Fri Oct 9 11:41:35 2020] Killed process 5018 (tikv-server), UID 1001, total-vm:1005732kB, anon-rss:522728kB, file-rss:0kB, shmem-rss:0kB
[Fri Oct 9 11:41:51 2020] tikv-server invoked oom-killer: gfp_mask=0x201da, order=0, oom_score_adj=0
[Fri Oct 9 11:41:51 2020] tikv-server cpuset=/ mems_allowed=0
[Fri Oct 9 11:41:51 2020] CPU: 2 PID: 5042 Comm: tikv-server Not tainted 3.10.0-1127.el7.x86_64 #1
[Fri Oct 9 11:41:51 2020] [ 5042] 1001 5042 251433 131194 360 0 0 tikv-server
[Fri Oct 9 11:41:51 2020] Out of memory: Kill process 5042 (tikv-server) score 519 or sacrifice child
[Fri Oct 9 11:41:51 2020] Killed process 5042 (tikv-server), UID 1001, total-vm:1005732kB, anon-rss:524776kB, file-rss:0kB, shmem-rss:0kB
[Fri Oct 9 11:42:06 2020] [ 5066] 1001 5066 292393 128575 355 0 0 tikv-server
[Fri Oct 9 11:42:06 2020] Out of memory: Kill process 5066 (tikv-server) score 508 or sacrifice child
[Fri Oct 9 11:42:06 2020] Killed process 5066 (tikv-server), UID 1001, total-vm:1169572kB, anon-rss:514300kB, file-rss:0kB, shmem-rss:0kB
[Fri Oct 9 11:42:22 2020] tikv-server invoked oom-killer: gfp_mask=0x201da, order=0, oom_score_adj=0
[Fri Oct 9 11:42:22 2020] tikv-server cpuset=/ mems_allowed=0
[Fri Oct 9 11:42:22 2020] CPU: 2 PID: 5103 Comm: tikv-server Not tainted 3.10.0-1127.el7.x86_64 #1
[Fri Oct 9 11:42:22 2020] [ 5103] 1001 5103 251433 129900 345 0 0 tikv-server
[Fri Oct 9 11:42:22 2020] Out of memory: Kill process 5103 (tikv-server) score 513 or sacrifice child
[Fri Oct 9 11:42:22 2020] Killed process 5103 (tikv-server), UID 1001, total-vm:1005732kB, anon-rss:519600kB, file-rss:0kB, shmem-rss:0kB
[Fri Oct 9 11:42:25 2020] [ 5127] 1001 5127 251433 131651 357 0 0 tikv-server
[Fri Oct 9 11:42:25 2020] Out of memory: Kill process 5127 (tikv-server) score 520 or sacrifice child
[Fri Oct 9 11:42:25 2020] Killed process 5127 (tikv-server), UID 1001, total-vm:1005732kB, anon-rss:526604kB, file-rss:0kB, shmem-rss:0kB
[Fri Oct 9 11:42:38 2020] [ 5150] 1001 5150 292393 129586 367 0 0 tikv-server
[Fri Oct 9 11:42:38 2020] Out of memory: Kill process 5150 (tikv-server) score 512 or sacrifice child
[Fri Oct 9 11:42:38 2020] Killed process 5150 (tikv-server), UID 1001, total-vm:1169572kB, anon-rss:518344kB, file-rss:0kB, shmem-rss:0kB
[tidb@tikv1 scripts]$

尝试换一个内存大点的服务器来运行 tikv-server?

我是在 虚拟机上搭建测试环境,刚开始deploy的时候报了很多的限制,都通过修改或者屏蔽deploy.yml 弄过去了

这个可以通过调整配置文件规避吗?目前是1G内存,虚拟机最多可能也只能分2G左右

deploy的时候也报了cpu,内存不满足,屏蔽或者调整check通过了

“changed”: false, “msg”: “This machine does not have sufficient RAM to run TiDB, at least 16000 MB.”

“changed”: false, “msg”: “This machine does not have sufficient CPU to run TiDB, at least 8 cores.”

这个可以修改配置规避掉吗?

这是两个问题,deploy 步骤可以通过 --extra-vars “dev_mode=True” 跳过相关检查,完成部署.
部署成功后,启动报 oom,这个无法跳过,tikv-server 运行需要一定的硬件环境支持,我们目前也在优化将 tidb 集群运行硬件环境配置降低,可以在 v5.0 时期待下
https://docs.pingcap.com/zh/tidb/stable/hardware-and-software-requirements

我的tidb和pd也都是1G,按照表格中的配置也不够,但是启动起来了

我把tidb和pd缩小到512M,将tikv 升到2G了,我再试试

tidb/pd 512M
tikv 2G,启动起来了

META: ran handlers
META: ran handlers

PLAY RECAP ****************************************************************************************
192.168.56.3 : ok=36 changed=13 unreachable=0 failed=0
192.168.56.4 : ok=15 changed=4 unreachable=0 failed=0
192.168.56.5 : ok=15 changed=4 unreachable=0 failed=0
192.168.56.6 : ok=14 changed=3 unreachable=0 failed=0
192.168.56.7 : ok=14 changed=3 unreachable=0 failed=0
192.168.56.8 : ok=14 changed=3 unreachable=0 failed=0
localhost : ok=7 changed=4 unreachable=0 failed=0

Congrats! All goes well. :slight_smile:

:call_me_hand: