TiDB 5.0 cluster安装之后无法启动

问题:安装之后无法启动blackbox_exporter:
1)启动卡住:
[tidb@host1 bin]$ tiup cluster start tidb-test
Starting component cluster: /home/tidb/.tiup/components/cluster/v1.4.0/tiup-cluster start tidb-test
Starting cluster tidb-test…

  • [ Serial ] - SSHKeySet: privateKey=/home/tidb/.tiup/storage/cluster/clusters/tidb-test/ssh/id_rsa, publicKey=/home/tidb/.tiup/storage/cluster/clusters/tidb-test/ssh/id_rsa.pub
  • [Parallel] - UserSSH: user=tidb, host=host1
  • [Parallel] - UserSSH: user=tidb, host=host1
  • [Parallel] - UserSSH: user=tidb, host=host2
  • [Parallel] - UserSSH: user=tidb, host=host1
  • [Parallel] - UserSSH: user=tidb, host=host1
  • [Parallel] - UserSSH: user=tidb, host=host3
  • [Parallel] - UserSSH: user=tidb, host=host3
  • [Parallel] - UserSSH: user=tidb, host=host1
  • [Parallel] - UserSSH: user=tidb, host=host2
  • [Parallel] - UserSSH: user=tidb, host=host3
  • [Parallel] - UserSSH: user=tidb, host=host1
  • [Parallel] - UserSSH: user=tidb, host=host2
  • [ Serial ] - StartCluster
    Starting component pd
    Starting instance pd host3:2379
    Starting instance pd host2:2379
    Starting instance pd host1:2379
    Start pd host1:2379 success
    Start pd host3:2379 success
    Start pd host2:2379 success
    Starting component node_exporter
    Starting instance host1
    Start host1 success
    Starting component blackbox_exporter
    Starting instance host1
    systmctl list-unit
    Error: failed to start: pd host1:2379, please check the instance’s log(/data1/tidb-deploy/pd-2379/log) for more detail.: timed out waiting for port 9115 to be started after 2m0s

Verbose debug logs has been written to /home/tidb/.tiup/logs/tiup-cluster-debug-2021-04-13-12-24-03.log.
Error: run /home/tidb/.tiup/components/cluster/v1.4.0/tiup-cluster (wd:/home/tidb/.tiup/data/SUSp6Qz) failed: exit status 1

2)/data1/tidb-deploy/pd-2379/log和/home/tidb/.tiup/logs/tiup-cluster-debug-2021-04-13-12-24-03.log中没有任何实际内容
通过systemctl list-units可以看到blackbox_exporter-9115.service的服务在启动activating,和其他服务状态不一样,不知为何无法启动
atd.service loaded active running Job spooling tools
auditd.service loaded active running Security Auditing Service
avahi-daemon.service loaded active running Avahi mDNS/DNS-SD Stack
blackbox_exporter-9115.service loaded activating auto-restart blackbox_exporter service

3)不知道是不是5.0的原因,{deploy_path}/bin/node_exporter 以及 {deploy_path}/scripts/run_node_exporter.sh 不存在

集群在 deploy 的时候有报错吗?如果方便的话可以提供下集群的拓扑文件。

deploy无报错。
vi simple-mini.yaml

# Global variables are applied to all deployments and used as the default value of

# the deployments if a specific deployment value is missing.

global:
user: “tidb”
ssh_port: 22
deploy_dir: “/data1/tidb-deploy”
data_dir: “/data1/tidb-data”

pd_servers:

  • host: host1
  • host: host2
  • host: host3

tidb_servers:

  • host: host1
  • host: host2
  • host: host3

tikv_servers:

  • host: host1
  • host: host2
  • host: host3

monitoring_servers:

  • host: host1

grafana_servers:

  • host: host1

alertmanager_servers:

  • host: host1

只是OS检查有些fail,,不知是否相关:
Node Check Result Message


host1 os-version Fail os vendor ol not supported
host1 cpu-cores Pass number of CPU cores / threads: 8
host1 swap Fail swap is enabled, please disable it for best performance
host1 memory Pass memory size is 0MB
host1 network Pass network speed of enp0s3 is 1000MB
host1 epoll-exclusive Fail epoll exclusive is not supported
host1 selinux Pass SELinux is disabled
host1 thp Pass THP is disabled
host1 command Fail numactl not usable, bash: numactl: command not found

麻烦提供下 pd 日志,谢谢!

pd.log:

[2021/04/14 01:13:02.331 +00:00] [INFO] [index.go:189] [“compact tree index”] [revision=28047]
[2021/04/14 01:13:02.347 +00:00] [INFO] [kvstore_compaction.go:55] [“finished scheduled compaction”] [compact-revision=28047] [took=14.964564ms]
[2021/04/14 02:13:02.363 +00:00] [INFO] [index.go:189] [“compact tree index”] [revision=29245]
[2021/04/14 02:13:02.378 +00:00] [INFO] [kvstore_compaction.go:55] [“finished scheduled compaction”] [compact-revision=29245] [took=14.902264ms]
[2021/04/14 03:13:02.437 +00:00] [INFO] [index.go:189] [“compact tree index”] [revision=30443]
[2021/04/14 03:13:02.453 +00:00] [INFO] [kvstore_compaction.go:55] [“finished scheduled compaction”] [compact-revision=30443] [took=15.009543ms]
[2021/04/14 04:13:02.459 +00:00] [INFO] [index.go:189] [“compact tree index”] [revision=31642]
[2021/04/14 04:13:02.473 +00:00] [INFO] [kvstore_compaction.go:55] [“finished scheduled compaction”] [compact-revision=31642] [took=13.942621ms]
[2021/04/14 05:13:02.487 +00:00] [INFO] [index.go:189] [“compact tree index”] [revision=32840]
[2021/04/14 05:13:02.502 +00:00] [INFO] [kvstore_compaction.go:55] [“finished scheduled compaction”] [compact-revision=32840] [took=15.088605ms]
[2021/04/14 06:13:02.510 +00:00] [INFO] [index.go:189] [“compact tree index”] [revision=34038]
[2021/04/14 06:13:02.526 +00:00] [INFO] [kvstore_compaction.go:55] [“finished scheduled compaction”] [compact-revision=34038] [took=15.310427ms]
[2021/04/14 06:30:04.212 +00:00] [WARN] [util.go:144] [“apply request took too long”] [took=132.155274ms] [expected-duration=100ms] [prefix=“read-only range “] [request=“key:”/pd/6950436558676464170/dc-location” range_end:”/pd/6950436558676464170/dc-locatioo" “] [response=“range_response_count:0 size:6”] []
[2021/04/14 06:30:04.213 +00:00] [INFO] [trace.go:145] [“trace[1555040713] range”] [detail=”{range_begin:/pd/6950436558676464170/dc-location; range_end:/pd/6950436558676464170/dc-locatioo; response_count:0; response_revision:35696; }"] [duration=132.500577ms] [start=2021/04/14 06:30:04.080 +00:00] [end=2021/04/14 06:30:04.213 +00:00] [steps="[“trace[1555040713] ‘agreement among raft nodes before linearized reading’ (duration: 131.9924ms)”]"]
[2021/04/14 06:31:29.429 +00:00] [WARN] [util.go:144] [“apply request took too long”] [took=1.350623355s] [expected-duration=100ms] [prefix=“read-only range “] [request=“key:”/pd/6950436558676464170/config” “] [response=“range_response_count:1 size:3265”] []
[2021/04/14 06:31:29.429 +00:00] [INFO] [trace.go:145] [“trace[493913785] range”] [detail=”{range_begin:/pd/6950436558676464170/config; range_end:; response_count:1; response_revision:35724; }”] [duration=1.350868085s] [start=2021/04/14 06:31:28.078 +00:00] [end=2021/04/14 06:31:29.429 +00:00] [steps="[“trace[493913785] ‘agreement among raft nodes before linearized reading’ (duration: 1.35041662s)”]"]
[2021/04/14 06:31:29.430 +00:00] [WARN] [etcdutil.go:118] [“kv gets too slow”] [request-key=/pd/6950436558676464170/config] [cost=1.352576725s] []
[2021/04/14 06:32:15.183 +00:00] [WARN] [util.go:144] [“apply request took too long”] [took=103.566014ms] [expected-duration=100ms] [prefix=“read-only range “] [request=“key:”/pd/6950436558676464170/config” “] [response=“range_response_count:1 size:3265”] []
[2021/04/14 06:32:15.184 +00:00] [INFO] [trace.go:145] [“trace[1176880480] range”] [detail=”{range_begin:/pd/6950436558676464170/config; range_end:; response_count:1; response_revision:35739; }”] [duration=103.913274ms] [start=2021/04/14 06:32:15.080 +00:00] [end=2021/04/14 06:32:15.184 +00:00] [steps="[“trace[1176880480] ‘agreement among raft nodes before linearized reading’ (duration: 103.251442ms)”]"]
[2021/04/14 07:13:02.536 +00:00] [INFO] [index.go:189] [“compact tree index”] [revision=35236]
[2021/04/14 07:13:02.552 +00:00] [INFO] [kvstore_compaction.go:55] [“finished scheduled compaction”] [compact-revision=35236] [took=15.001162ms]

[root@celvpvm07848 log]# cat pd_stderr.log
[2021/04/13 12:12:02.287 +00:00] [WARN] [retry_interceptor.go:61] [“retrying of unary invoker failed”] [target=endpoint://client-9a90da33-b853-46e6-a5bb-95b2b8de43e6/celvpvm07848.us.oracle.com:2379] [attempt=0] [error=“rpc error: code = NotFound desc = etcdserver: requested lease not found”]

Please help to check

你这边集群部署的操作系统及版本是什么?

oracle linux 7.6
按照安装说明应该是支持的。

之前有测试过吗? 先部署 4.0 试一下。
怀疑的方向:
(1)看下 ssh 那里是否修改了,单机需要修改 ssh 数量文档有写
(2)端口是否放开?
(3)参考下帖子 TIDB 入门运维基础视频教程(一)-- 快速体验 看看配置正确。

已经设置了MaxSessions 30。
9115端口可以访问。
重启了一下服务器,看到数据库已经在host1上启动了,只是集群还是起不来:
ID Role Host Ports OS/Arch Status Data Dir Deploy Dir


host1:9093 alertmanager host1 9093/9094 linux/x86_64 Up /data1/tidb-data/alertmanager-9093 /data1/tidb-deploy/alertmanager-9093
host1:3000 grafana host1 3000 linux/x86_64 Up - /data1/tidb-deploy/grafana-3000
host2:2379 pd host2 2379/2380 linux/x86_64 Up|L /data1/tidb-data/pd-2379 /data1/tidb-deploy/pd-2379
host1:2379 pd host1 2379/2380 linux/x86_64 Up /data1/tidb-data/pd-2379 /data1/tidb-deploy/pd-2379
host3:2379 pd host3 2379/2380 linux/x86_64 Up|UI /data1/tidb-data/pd-2379 /data1/tidb-deploy/pd-2379
host1:9090 prometheus host1 9090 linux/x86_64 Up /data1/tidb-data/prometheus-9090 /data1/tidb-deploy/prometheus-9090
host2:4000 tidb host2 4000/10080 linux/x86_64 Down - /data1/tidb-deploy/tidb-4000
host1:4000 tidb host1 4000/10080 linux/x86_64 Up - /data1/tidb-deploy/tidb-4000
host3:4000 tidb host3 4000/10080 linux/x86_64 Down - /data1/tidb-deploy/tidb-4000
host2:20160 tikv host2 20160/20180 linux/x86_64 N/A /data1/tidb-data/tikv-20160 /data1/tidb-deploy/tikv-20160
host1:20160 tikv host1 20160/20180 linux/x86_64 Up /data1/tidb-data/tikv-20160 /data1/tidb-deploy/tikv-20160
host3:20160 tikv host3 20160/20180 linux/x86_64 N/A /data1/tidb-data/tikv-20160 /data1/tidb-deploy/tikv-20160

Starting component blackbox_exporter
Starting instance celvpvm07848.us.oracle.com

Error: failed to start: pd celvpvm07848.us.oracle.com:2379, please check the instance’s log(/data1/tidb-deploy/pd-2379/log) for more detail.: timed out waiting for port 9115 to be started after 2m0s

Verbose debug logs has been written to /home/tidb/.tiup/logs/tiup-cluster-debug-2021-04-15-09-10-31.log.
Error: run /home/tidb/.tiup/components/cluster/v1.4.0/tiup-cluster (wd:/home/tidb/.tiup/data/SUdjQF7) failed: exit status 1

问题:可以不启动这个blackbox_exporter组件吗?感觉没啥用,还卡这上面了。

看OS log发现是 /data1/tidb-deploy/monitor-9100/scripts/run_blackbox_exporter.sh启动参数错误:

Apr 15 09:35:05 celvpvm07848 systemd: blackbox_exporter-9115.service: main process exited, code=exited, status=218/CAPABILITIES
Apr 15 09:35:05 celvpvm07848 systemd: Unit blackbox_exporter-9115.service entered failed state.
Apr 15 09:35:05 celvpvm07848 systemd: blackbox_exporter-9115.service failed.
Apr 15 09:35:20 celvpvm07848 systemd: blackbox_exporter-9115.service holdoff time over, scheduling restart.
Apr 15 09:35:20 celvpvm07848 systemd: Stopped blackbox_exporter service.
Apr 15 09:35:20 celvpvm07848 systemd: Started blackbox_exporter service.
Apr 15 09:35:20 celvpvm07848 systemd: Failed at step CAPABILITIES spawning /data1/tidb-deploy/monitor-9100/scripts/run_blackbox_exporter.sh: Invalid argument
Apr 15 09:35:20 celvpvm07848 systemd: blackbox_exporter-9115.service: main process exited, code=exited, status=218/CAPABILITIES
Apr 15 09:35:20 celvpvm07848 systemd: Unit blackbox_exporter-9115.service entered failed state.
Apr 15 09:35:20 celvpvm07848 systemd: blackbox_exporter-9115.service failed.

看shell参数没问题:
cat /data1/tidb-deploy/monitor-9100/scripts/run_blackbox_exporter.sh
#!/bin/bash
set -e

WARNING: This file was auto-generated. Do not edit!

All your edit might be overwritten!

DEPLOY_DIR=/data1/tidb-deploy/monitor-9100
cd “${DEPLOY_DIR}” || exit 1

exec > >(tee -i -a “/data1/tidb-deploy/monitor-9100/log/blackbox_exporter.log”)
exec 2>&1

EXPORTER_BIN=bin/blackbox_exporter/blackbox_exporter
if [ ! -f $EXPORTER_BIN ]; then
EXPORTER_BIN=bin/blackbox_exporter
fi
exec $EXPORTER_BIN
–web.listen-address=":9115"
–log.level=“info”
–config.file=“conf/blackbox.yml”

[root@celvpvm07848 conf]# cat blackbox.yml
modules:
http_2xx:
prober: http
http:
method: GET
http_post_2xx:
prober: http
http:
method: POST
tcp_connect:
prober: tcp
pop3s_banner:
prober: tcp
tcp:
query_response:
- expect: “^+OK”
tls: true
tls_config:
insecure_skip_verify: false
ssh_banner:
prober: tcp
tcp:
query_response:
- expect: “^SSH-2.0-”
irc_banner:
prober: tcp
tcp:
query_response:
- send: “NICK prober”
- send: “USER prober prober prober :prober”
- expect: “PING :([^ ]+)”
send: “PONG ${1}”
- expect: “^:[^ ]+ 001”
icmp:
prober: icmp
timeout: 5s
icmp:
preferred_ip_protocol: “ip4”

应该是OS的内核问题,把blackbox_exporter-9115.service里面这一行注释上了:

#AmbientCapabilities=CAP_NET_RAW
现在blackbox_exporter可以起来了。

问题已经解决,多谢!

厉害,请问您之前这个系统装过 4.0 吗? 都会有这个问题吗?主要还是想确认下,是5.0导致的,还是oracle linux这个操作系统导致的?

没装过4.0,主要是内核参数导致的,错误关键字“status=218/CAPABILITIES”, 可能OS版本不支持AmbientCapabilities=CAP_NET_RAW
类似的问题:
https://github.com/scylladb/scylla/issues/3582

:+1:

如果你注释了CAP_NET_RAW,有可能你就用不了 blackbox_exporter 的 ping 探针了
你看下 net.ipv4.ping_group_range 的配置,运行 blackbox_exporter 的用户在不在这个范围里?

sudo sysctl net.ipv4.ping_group_range

如果不在,systemd 又无法设置 AmbientCapabilities 参数,那你应该就用不了 blackbox_exporter 的 ping 探针

参考:blackbox_exporter#permissions

1 个赞

今天使用tiup重新安装tidb-v4.0.5版本也出现上述的问题了,也是通过屏蔽 #AmbientCapabilities=CAP_NET_RAW 进行解决的, 之前通过版本升级到v4.0.5没有出现这种问题。 对比了2边的配置文件,还是有变动的.,这个AmbientCapabilities参数主要是做什么的?

这个属于 systemd 和 linux capability 的问题,你可以搜索一下。

1 个赞

在oel 7.9上使用tiup安装单机集群环境也遇到相同的错误,tiup 1.5.0,tidb v4.0.13,注释后成功启动