3个tikv,缩容扩容一个后,无法启动

3个pd 3个tidb 3个tikv

其中一个tikv down了,再也无法启动,严格按照文档“缩容”方法,先缩容,完毕后, store 5 看到的是 Tombstone 状态。然后再按照文档进行扩容,扩容后,无法启动,错误信息如下

[172.7.160.198]: Ansible FAILED! => playbook: start.yml; TASK: wait until the TiKV port is up; message: {"changed": false, "elapsed": 300, "msg": "the TiKV port 20160 is not up"}


tikv.log日志最后

[2020/04/18 00:38:32.463 +08:00] [INFO] [mod.rs:334] ["starting working thread"] [worker=addr-resolver]
[2020/04/18 00:38:32.501 +08:00] [INFO] [scheduler.rs:257] ["Scheduler::new is called to initialize the transaction scheduler"]
[2020/04/18 00:38:32.711 +08:00] [INFO] [scheduler.rs:278] ["Scheduler::new is finished, the transaction scheduler is initialized"]
[2020/04/18 00:38:32.711 +08:00] [INFO] [mod.rs:334] ["starting working thread"] [worker=gc-worker]
[2020/04/18 00:38:32.711 +08:00] [INFO] [mod.rs:722] ["Storage started."]
[2020/04/18 00:38:32.713 +08:00] [INFO] [server.rs:148] ["listening on addr"] [addr=0.0.0.0:20160]
[2020/04/18 00:38:32.713 +08:00] [INFO] [mod.rs:334] ["starting working thread"] [worker=region-collector-worker]
[2020/04/18 00:38:32.713 +08:00] [INFO] [node.rs:333] ["start raft store thread"] [store_id=19931]
[2020/04/18 00:38:32.713 +08:00] [INFO] [store.rs:800] ["start store"] [takes=36.867µs] [merge_count=0] [applying_count=0] [tombstone_count=0] [region_count=0] [store_id=19931]
[2020/04/18 00:38:32.713 +08:00] [INFO] [store.rs:852] ["cleans up garbage data"] [takes=11.403µs] [garbage_range_count=1] [store_id=19931]
[2020/04/18 00:38:32.714 +08:00] [INFO] [mod.rs:334] ["starting working thread"] [worker=split-check]
[2020/04/18 00:38:32.714 +08:00] [INFO] [mod.rs:334] ["starting working thread"] [worker=snapshot-worker]
[2020/04/18 00:38:32.714 +08:00] [INFO] [mod.rs:334] ["starting working thread"] [worker=raft-gc-worker]
[2020/04/18 00:38:32.714 +08:00] [INFO] [mod.rs:334] ["starting working thread"] [worker=compact-worker]
[2020/04/18 00:38:32.715 +08:00] [INFO] [future.rs:131] ["starting working thread"] [worker=pd-worker]
[2020/04/18 00:38:32.715 +08:00] [INFO] [mod.rs:334] ["starting working thread"] [worker=consistency-check]
[2020/04/18 00:38:32.715 +08:00] [INFO] [mod.rs:334] ["starting working thread"] [worker=cleanup-sst]
[2020/04/18 00:38:32.715 +08:00] [WARN] [store.rs:1118] ["set thread priority for raftstore failed"] [error="Os { code: 13, kind: PermissionDenied, message: \"Permission denied\" }"]
[2020/04/18 00:38:32.715 +08:00] [INFO] [node.rs:161] ["put store to PD"] [store="id: 19931 address: \"172.7.160.235:20160\" version: \"3.0.12\""]
[2020/04/18 00:38:32.716 +08:00] [ERROR] [util.rs:327] ["request failed"] [err="Grpc(RpcFailure(RpcStatus { status: Unknown, details: Some(\"duplicated store address: id:19931 address:\\\"172.7.160.235:20160\\\" version:\\\"3.0.12\\\" , already registered by id:1 address:\\\"172.7.160.235:20160\\\" version:\\\"3.0.12\\\" \") }))"]
[2020/04/18 00:38:32.716 +08:00] [ERROR] [util.rs:327] ["request failed"] [err="Grpc(RpcFailure(RpcStatus { status: Unknown, details: Some(\"duplicated store address: id:19931 address:\\\"172.7.160.235:20160\\\" version:\\\"3.0.12\\\" , already registered by id:1 address:\\\"172.7.160.235:20160\\\" version:\\\"3.0.12\\\" \") }))"]
[2020/04/18 00:38:32.716 +08:00] [ERROR] [util.rs:327] ["request failed"] [err="Grpc(RpcFailure(RpcStatus { status: Unknown, details: Some(\"duplicated store address: id:19931 address:\\\"172.7.160.235:20160\\\" version:\\\"3.0.12\\\" , already registered by id:1 address:\\\"172.7.160.235:20160\\\" version:\\\"3.0.12\\\" \") }))"]
[2020/04/18 00:38:32.716 +08:00] [ERROR] [util.rs:327] ["request failed"] [err="Grpc(RpcFailure(RpcStatus { status: Unknown, details: Some(\"duplicated store address: id:19931 address:\\\"172.7.160.235:20160\\\" version:\\\"3.0.12\\\" , already registered by id:1 address:\\\"172.7.160.235:20160\\\" version:\\\"3.0.12\\\" \") }))"]
[2020/04/18 00:38:32.717 +08:00] [ERROR] [util.rs:327] ["request failed"] [err="Grpc(RpcFailure(RpcStatus { status: Unknown, details: Some(\"duplicated store address: id:19931 address:\\\"172.7.160.235:20160\\\" version:\\\"3.0.12\\\" , already registered by id:1 address:\\\"172.7.160.235:20160\\\" version:\\\"3.0.12\\\" \") }))"]
[2020/04/18 00:38:32.717 +08:00] [ERROR] [util.rs:327] ["request failed"] [err="Grpc(RpcFailure(RpcStatus { status: Unknown, details: Some(\"duplicated store address: id:19931 address:\\\"172.7.160.235:20160\\\" version:\\\"3.0.12\\\" , already registered by id:1 address:\\\"172.7.160.235:20160\\\" version:\\\"3.0.12\\\" \") }))"]
[2020/04/18 00:38:32.717 +08:00] [ERROR] [util.rs:327] ["request failed"] [err="Grpc(RpcFailure(RpcStatus { status: Unknown, details: Some(\"duplicated store address: id:19931 address:\\\"172.7.160.235:20160\\\" version:\\\"3.0.12\\\" , already registered by id:1 address:\\\"172.7.160.235:20160\\\" version:\\\"3.0.12\\\" \") }))"]
[2020/04/18 00:38:32.717 +08:00] [ERROR] [util.rs:327] ["request failed"] [err="Grpc(RpcFailure(RpcStatus { status: Unknown, details: Some(\"duplicated store address: id:19931 address:\\\"172.7.160.235:20160\\\" version:\\\"3.0.12\\\" , already registered by id:1 address:\\\"172.7.160.235:20160\\\" version:\\\"3.0.12\\\" \") }))"]
[2020/04/18 00:38:32.718 +08:00] [ERROR] [util.rs:327] ["request failed"] [err="Grpc(RpcFailure(RpcStatus { status: Unknown, details: Some(\"duplicated store address: id:19931 address:\\\"172.7.160.235:20160\\\" version:\\\"3.0.12\\\" , already registered by id:1 address:\\\"172.7.160.235:20160\\\" version:\\\"3.0.12\\\" \") }))"]
[2020/04/18 00:38:32.718 +08:00] [ERROR] [util.rs:327] ["request failed"] [err="Grpc(RpcFailure(RpcStatus { status: Unknown, details: Some(\"duplicated store address: id:19931 address:\\\"172.7.160.235:20160\\\" version:\\\"3.0.12\\\" , already registered by id:1 address:\\\"172.7.160.235:20160\\\" version:\\\"3.0.12\\\" \") }))"]
[2020/04/18 00:38:32.718 +08:00] [FATAL] [server.rs:264] ["failed to start node: Other(\"[src/pd/util.rs:335]: fail to request\")"]

有什么地方操作不对吗,谢谢!

您好:
1. 从报错看是已经有store注册了当前信息
2. 请执行pd-ctl命令 返回 member 和store 信息
https://pingcap.com/docs-cn/v3.0/reference/tools/pd-control/#下载安装包

扩容前都执行看了
并不存在那个已被缩容的store
缩容的ip是 172.7.160.198 store id 是 5

member

store

您好: 1. 从报错看duplicated store address: id:19931 address:\“172.7.160.235:20160\” version:\“3.0.12\” , already registered by id:1 address ,您是要使用172.7.160.235:20160吗? 当前被store1占用了 2. 请问当前要扩容的inventory里配置的是什么?
3. 请到要扩容的tikv目录,看下是否有残留信息,如果tikv目录有残留信息可以删除后,再扩容

235是正常运行的一个tikv
有 235 220 198 这3个ip后缀的tikv,其中198 出故障了,其他2个正常,然后我将 198 缩容后,又扩容,结果出错了。

比如我 198 的tikv 部署目录在 /data/deploy 下,那么残留信息是指 deploy 这个目录下的所有文件吗

如果没有pd和tidb共同部署在这个目录,那么把data这个目录mv到其他地方先bak,再试试

备份后,再重新进行缩容和扩容是吧

ansible-playbook rolling_update_monitor.yml --tags=prometheus


执行这个命令后,tidb 集群是会重启的,期间无法进行sql 查询的,是吗

你已经缩容了,执行扩容就好了, 这个命令是更新监控,不会重启集群

已经删掉了 tikv原部署目录,还是无法启动,一样的错误
bushu.txt (50.2 KB)

麻烦把inventory.ini日志上传下,还有中控机安装目录下的/log/ansible.log日志上传下,多谢

ansible.log (116.2 KB) inventory.txt (2.0 KB)

看起来扩容的是198,应该没有问题,那麻烦您把这个deploy目录下的都先移动到其他地方,再尝试扩容,多谢

刚才上传的 日志 就是把 /data/deploy 重命名为 /data/deploy.bak ,然后又重新扩容产生的

  1. 根据这个日志,“set thread priority for raftstore failed”] [error=“Os { code: 13, kind: PermissionDenied, message: “Permission denied” }” ,请帮忙在deploy目录手工创建目录和文件试试能否成功,多谢
  2. 请问扩容时,198下有tikv日志吗?如果有,麻烦上传下,多谢

都创建什么目录和文件,直接从其他tikv上复制过来除了 db 目录外的文件和目录,然后重设权限, 可以吗

我看了下tikv节点,那些目录和文件是创建成功的

您好:

    1. 请执行下 curl http://ip:port/pd/api/v1/stores?state=2  (ip和端口为pd的)反馈下信息
    2. 请问您之前有部署过,再重新安装吗? 或者从其他节点copy目录的操作?
    3. 当前的问题看,是先缩容198,再扩容198对吧。 这个198和235
    4. 把deploy目录都删掉看看

1、

[tidb@tidb9 tidb-ansible]$ curl http://172.7.160.216:2379/pd/api/v1/stores?state=2
{
  "count": 1,
  "stores": [
    {
      "store": {
        "id": 5,
        "address": "172.7.160.198:20160",
        "state": 2,
        "version": "3.0.12",
        "state_name": "Tombstone"
      },
      "status": {
        "leader_weight": 1,
        "region_weight": 1,
        "start_ts": "1970-01-01T08:00:00+08:00"
      }
    }
  ]
}

2 / 3 、之前部署的,3台pd和3台tidb,3台tikv(235,220,198),只有198出故障了,无法重启,然后就先缩容再扩容了多次,仍无法启动

4、刚才最后一遍是这样安装的,仍然同样的错误

a/ 删掉 198 上 tidb用户和家目录,删掉deploy目录

b/ 确认防火墙和selinux关闭了

c/ 然后在中控机上手动部署ssh互信,部署成功后,使用 ssh 172.7.160.198 ,没有提示输入密码,确认互信成功

d / 接下来按照 这个文档进行扩容 https://pingcap.com/docs-cn/v3.0/how-to/scale/with-ansible/

e/在最后一步 start.yml -l 提示失败

fatal: [172.7.160.198]: FAILED! => {"changed": false, "elapsed": 300, "msg": "the TiKV port 20160 is not up"}
	to retry, use: --limit @/home/tidb/tidb-ansible/retry_files/start.retry

PLAY RECAP *********************************************************************************************************************************************************************************
172.7.160.198              : ok=10   changed=0    unreachable=0    failed=1   


ERROR MESSAGE SUMMARY **********************************************************************************************************************************************************************
[172.7.160.198]: Ansible FAILED! => playbook: start.yml; TASK: wait until the TiKV port is up; message: {"changed": false, "elapsed": 300, "msg": "the TiKV port 20160 is not up"}

另:198所在ssd盘有几个坏区,新盘过几天才能换上,会是这个原因吗

麻烦把完整的tikv日志上传下,多谢

最后这一次重新缩容扩容的,/data/deploy/log下无日志文件,db也无,附件里是备份的上一次的日志文件。
几次错误都是一样的db.tar.gz (8.5 KB) tikv.tar.gz (52.0 KB)

  1. 最后这次deploy部署成功了吗? 除了没有tikv.log日志,是否其他目录下,特别是/data目录下都有内容?
  2. 请查看安装目录tikv/scripts/start_tikv.sh 尝试启动是否有报错的tikv.log信息,多谢