从ansible向tiup迁移时出现问题

你好,我们在从ansible向tiup进行升级时,遇到了问题:
1)当前版本为3.1.0-beta.1
2)在执行 tiup cluster reload test-cluster 时遇到问题,某一tikv节点报错为:
invalid configuration: default rocksdb not exist, buf raftdb exist

请问如何解决,谢谢

查看了下报错信息类似案例都是 “default rocksdb exist, buf raftdb not exist”) 与 你的相反,能否麻烦截图展示下 报错,多谢。

https://github.com/tikv/tikv/issues/2749

看下 tikv.toml,上传下文本信息。
用此处命令在 tikv 服务器执行下。看下屏幕是否输出报错

  1. 我把全部的配置文件都上传了。其中,出问题的是10.12.5.221这个节点。
    ansible-imported-configs.tar.gz (29.2 KB)

  2. 另外,我在221这个节点执行:

    tidb@nine:~$ /home/tidb/deploy/bin/tikv-server --config-check --config=/home/tidb/deploy/conf/tikv.toml --pd=“”

    invalid configuration: default rocksdb not exist, buf raftdb exist

感谢反馈。请问下当前集群的服务器配置可否提供下

请问是否指网络拓扑?

tidb@three:~/.tiup/storage/cluster/clusters/test-cluster/ansible-backup$ cat inventory.ini

## TiDB Cluster Part

[tidb_servers]

10.12.5.114

10.12.5.115

[tikv_servers]

10.12.5.230

10.12.5.220

10.12.5.221

10.12.5.223

10.12.5.224

10.12.5.226

10.12.5.227

10.12.5.228

10.12.5.229

10.12.5.231

10.12.5.233

10.12.5.214

[pd_servers]

10.12.5.113

10.12.5.114

10.12.5.115

[spark_master]

[spark_slaves]

[lightning_server]

[importer_server]

## Monitoring Part

# prometheus and pushgateway servers

[monitoring_servers]

10.12.5.232

[grafana_servers]

10.12.5.232

# node_exporter and blackbox_exporter servers

[monitored_servers]

10.12.5.232

10.12.5.113

10.12.5.114

10.12.5.115

10.12.5.230

10.12.5.220

10.12.5.221

10.12.5.223

10.12.5.226

10.12.5.224

10.12.5.227

10.12.5.228

10.12.5.229

10.12.5.231

10.12.5.233

10.12.5.214

[alertmanager_servers]

10.12.5.232

[kafka_exporter_servers]

## Binlog Part

[pump_servers]

[drainer_servers]

## Group variables

[pd_servers:vars]

# location_labels = ["zone","rack","host"]

location_labels=["host"]

## Global variables

[all:vars]

deploy_dir = /home/tidb/deploy

## Connection

# ssh via normal user

ansible_user = tidb

cluster_name = test-cluster

tidb_version = v3.1.0-beta.1

# process supervision, [systemd, supervise]

process_supervision = systemd

timezone = Asia/Shanghai

enable_firewalld = False

# check NTP service

enable_ntpd = True

set_hostname = False

## binlog trigger

enable_binlog = False

# kafka cluster address for monitoring, example:

# kafka_addrs = "192.168.0.11:9092,192.168.0.12:9092,192.168.0.13:9092"

kafka_addrs = ""

# zookeeper address of kafka cluster for monitoring, example:

# zookeeper_addrs = "192.168.0.11:2181,192.168.0.12:2181,192.168.0.13:2181"

zookeeper_addrs = ""

# enable TLS authentication in the TiDB cluster

enable_tls = False

# KV mode

deploy_without_tidb = False

# wait for region replication complete before start tidb-server.

wait_replication = True

# Optional: Set if you already have a alertmanager server.

# Format: alertmanager_host:alertmanager_port

alertmanager_target = ""

grafana_admin_user = "admin"

grafana_admin_password = "****"

### Collect diagnosis

collect_log_recent_hours = 2

enable_bandwidth_limit = True

# default: 10Mb/s, unit: Kbit/s

collect_bandwidth_limit = 10000

tidb@three:~/.tiup/storage/cluster/clusters/test-cluster/ansible-backup$

上面 tar 包中上传的 toml 文件都是 ansible backup 目录下的配置文件。

我们需要看下 221 节点下 tikv conf tikv.toml 文件。

tiup 生成的新的配置文件元信息(并不建议修改此文件)看下。
~/.tiup/storage/cluster/clusters//meta.yaml

221下的tikv.toml:
tikv.toml (906 字节)

中控机下的meta.yaml:
meta.yaml (5.5 KB)


tikv 的两个配置在 3.1 没有被支持,是 4.x 的功能。通过 edit-config 去掉,reload 之后,start -N 看是否可以将 tikv 启动成功。可以 tail -f 监控下 tikv.log 日志。

按照提示,将两个参数删除,现在config如下:

reload之后,221仍然显示error,如下图所示:

登录221节点,查看tikv.log日志,显示如下:

[2020/09/01 03:20:44.750 +00:00] [INFO] [raft.rs:924] ["[region 908204] 19861858 [term: 2881] received a MsgRequestVote message with higher term from 20236492 [term: 2882]"]

[2020/09/01 03:20:44.750 +00:00] [INFO] [raft.rs:723] ["[region 908204] 19861858 became follower at term 2882"]

[2020/09/01 03:20:44.750 +00:00] [INFO] [raft.rs:1108] ["[region 908204] 19861858 [logterm: 2880, index: 9508, vote: 0] cast MsgRequestVote for 20236492 [logterm: 2880, index: 9508] at term 2882"]

[2020/09/01 03:20:46.562 +00:00] [INFO] [raft.rs:1108] ["[region 2387464] 20607110 [logterm: 12671, index: 12624, vote: 20607110] cast MsgRequestPreVote for 17509099 [logterm: 12671, index: 12624] at term 12672"]

[2020/09/01 03:20:46.606 +00:00] [INFO] [raft.rs:924] ["[region 2387464] 20607110 [term: 12672] received a MsgRequestVote message with higher term from 17509099 [term: 12673]"]

[2020/09/01 03:20:46.606 +00:00] [INFO] [raft.rs:723] ["[region 2387464] 20607110 became follower at term 12673"]

[2020/09/01 03:20:46.606 +00:00] [INFO] [raft.rs:1108] ["[region 2387464] 20607110 [logterm: 12671, index: 12624, vote: 0] cast MsgRequestVote for 17509099 [logterm: 12671, index: 12624] at term 12673"]

[2020/09/01 03:20:48.658 +00:00] [INFO] [raft.rs:1108] ["[region 7121916] 20293864 [logterm: 12421, index: 15790, vote: 20293864] cast MsgRequestPreVote for 17554487 [logterm: 12421, index: 15790] at term 12422"]

[2020/09/01 03:20:48.660 +00:00] [INFO] [raft.rs:924] ["[region 7121916] 20293864 [term: 12422] received a MsgRequestVote message with higher term from 17554487 [term: 12423]"]

[2020/09/01 03:20:48.660 +00:00] [INFO] [raft.rs:723] ["[region 7121916] 20293864 became follower at term 12423"]

  -----------------(省略部分) -----------------

[2020/09/01 03:21:10.254 +00:00] [WARN] [raft_client.rs:132] ["batch_raft/raft RPC finally fail"] [err="RpcFinished(Some(RpcStatus { status: Unavailable, details: Some(\"Connect Failed\") }))"] [to_addr=10.12.5.214:20160]

[2020/09/01 03:21:10.256 +00:00] [WARN] [raft_client.rs:207] ["send to 10.12.5.214:20160 fail, the gRPC connection could be broken"]

[2020/09/01 03:21:10.256 +00:00] [ERROR] [transport.rs:318] ["send raft msg err"] [err="Other(\"[src/server/raft_client.rs:216]: RaftClient send fail\")"]

[2020/09/01 03:21:10.256 +00:00] [INFO] [transport.rs:299] ["resolve store address ok"] [addr=10.12.5.214:20160] [store_id=17388737]

[2020/09/01 03:21:10.256 +00:00] [INFO] [raft_client.rs:50] ["server: new connection with tikv endpoint"] [addr=10.12.5.214:20160]

[2020/09/01 03:21:10.256 +00:00] [INFO] [subchannel.cc:878] ["Connect failed: {\"created\":\"@1598930470.256875206\",\"description\":\"Failed to connect to remote host: OS Error\",\"errno\":111,\"file\":\"/rust/registry/src/github.com-1ecc6299db9ec823/grpcio-sys-0.4.7/grpc/src/core/lib/iomgr/tcp_client_posix.cc\",\"file_line\":207,\"os_error\":\"Connection refused\",\"syscall\":\"connect\",\"target_address\":\"ipv4:10.12.5.214:20160\"}"]

[2020/09/01 03:21:10.257 +00:00] [INFO] [subchannel.cc:760] ["Subchannel 0x7efcc9a51200: Retry in 1000 milliseconds"]

[2020/09/01 03:21:10.257 +00:00] [WARN] [raft_client.rs:118] ["batch_raft RPC finished fail"] [err="RpcFinished(Some(RpcStatus { status: Unavailable, details: Some(\"Connect Failed\") }))"]

[2020/09/01 03:21:10.257 +00:00] [WARN] [raft_client.rs:132] ["batch_raft/raft RPC finally fail"] [err="RpcFinished(Some(RpcStatus { status: Unavailable, details: Some(\"Connect Failed\") }))"] [to_addr=10.12.5.214:20160]

另外,start -N 是如何用?能否说的再详细一些?

上传下 meta 文件和 tikv.toml 文件看下
我的意思是可以 start 故障 tikv 节点即可。tiup cluster start clustername -N tikv-ip:tikv-port(display 第一列显示的内容)

df 看下当前磁盘的使用情况。

  1. 文件:
    tikv.toml (857 字节)
    meta.yaml (5.4 KB)

  2. start故障节点:在中控机输入蓝筐中命令,显示成功,截图如下所示

  1. 221磁盘使用情况:
tidb@nine:~/deploy/conf$ df -h
Filesystem      Size  Used Avail Use% Mounted on
/dev/sda2       885G  701G  140G  84% /

display 看下返回结果,
确认下,reload 没成功,但是 start 可以启动是吧

是的

1.当前display的结果显示节点状态都是在线的:

2.reload的状态221还是失败的:

  1. start是可以启动的:

start 可以成功是因为 start 的时候并不检查配置,reload 可以加上 --ignore-config-check 参数,也会忽略配置检查结果,应该可以成功 reload

出现问题:

Error: 	10.12.5.232 failed to restart: timed out waiting for port 9093 to be started after 2m0s: timed out waiting for port 9093 to be started after 2m0s

Verbose debug logs has been written to /home/tidb/tidb-enterprise-tools-latest-linux-amd64/logs/tiup-cluster-debug-2020-09-01-15-19-52.log.
Error: run `/home/tidb/.tiup/components/cluster/v1.1.0/tiup-cluster` (wd:/home/tidb/.tiup/data/S9LXvh5) failed: exit status 1

对应的log:
tiup-cluster-debug-2020-09-01-15-19-52.log (471.0 KB)

您是否使用的是 tiup-cluster v1.1.0? 这个版本有个小问题,昨晚已修复,麻烦执行 tiup update cluster 升级到 v1.1.1,然后再进行操作

你好,我现在是v1.0.9 tiup版本,是否先要升级?

1.0.9 没有问题,能否上 10.12.5.232 这台机器把 alertmanager 的日志拿下来看看?