tikv存储数据文件丢失

  • 【TiDB 版本】:v3.0.12
  • 【问题描述】:tikv存储节点宕机,sst文件丢失
    5节点集群,有3个节点为tikv存储节点,其中有一台节点的服务器异常重启了,服务器重启后提示sst文件丢失,请问如何解决,错误日志如下
    [2020/06/23 14:52:56.181 +08:00] [FATAL] [server.rs:165] [“failed to create kv engine: RocksDb Corruption: Can’t access /000467.sst: IO error: while stat a file for size: /data/tidb-data/tikv-20160/db/000467.sst: No such file or directory\ Can’t access /000325.sst: IO error: while stat a file for size: /data/tidb-data/tikv-20160/db/000325.sst: No such file or directory\ ”]
  1. 请问副本数是几? 能否反馈pd-ctl 中 config show all 和 store 的信息, sync-log 是 true 还是 false?
  2. db 目录下无法找到此文件? 操作系统能否修复? 现在是打算丢失这部分数据,恢复系统?
  1. 使用的是默认配置3副本,sync-log是true,config show all 和 store的信息如下:
    Starting component ctl: /root/.tiup/components/ctl/v3.0.1/ctl pd – -u http://172.16.10.151:2379 config show all
    {
    “client-urls”: “http://172.16.10.151:2379”,
    “peer-urls”: “http://172.16.10.151:2380”,
    “advertise-client-urls”: “http://172.16.10.151:2379”,
    “advertise-peer-urls”: “http://10.1.1.151:2380”,
    “name”: “pd-172.16.10.151-2379”,
    “data-dir”: “/data/dfs01/tidb-data/pd-2379”,
    “force-new-cluster”: false,
    “enable-grpc-gateway”: true,
    “initial-cluster”: “pd-172.16.10.151-2379=http://172.16.10.151:2380”,
    “initial-cluster-state”: “new”,
    “join”: “”,
    “lease”: 3,
    “log”: {
    “level”: “”,
    “format”: “text”,
    “disable-timestamp”: false,
    “file”: {
    “filename”: “/data/dfs00/tidb-deploy/pd-2379/log/pd.log”,
    “log-rotate”: true,
    “max-size”: 300,
    “max-days”: 0,
    “max-backups”: 0
    },
    “development”: false,
    “disable-caller”: false,
    “disable-stacktrace”: false,
    “disable-error-verbose”: true,
    “sampling”: null
    },
    “log-file”: “”,
    “log-level”: “”,
    “tso-save-interval”: “3s”,
    “metric”: {
    “job”: “pd-172.16.10.151-2379”,
    “address”: “”,
    “interval”: “15s”
    },
    “schedule”: {
    “max-snapshot-count”: 3,
    “max-pending-peer-count”: 16,
    “max-merge-region-size”: 20,
    “max-merge-region-keys”: 200000,
    “split-merge-interval”: “1h0m0s”,
    “enable-one-way-merge”: “false”,
    “patrol-region-interval”: “100ms”,
    “max-store-down-time”: “30m0s”,
    “leader-schedule-limit”: 4,
    “region-schedule-limit”: 2048,
    “replica-schedule-limit”: 64,
    “merge-schedule-limit”: 8,
    “hot-region-schedule-limit”: 4,
    “hot-region-cache-hits-threshold”: 3,
    “store-balance-rate”: 15,
    “tolerant-size-ratio”: 0,
    “low-space-ratio”: 0.8,
    “high-space-ratio”: 0.6,
    “scheduler-max-waiting-operator”: 3,
    “disable-raft-learner”: “false”,
    “disable-remove-down-replica”: “false”,
    “disable-replace-offline-replica”: “false”,
    “disable-make-up-replica”: “false”,
    “disable-remove-extra-replica”: “false”,
    “disable-location-replacement”: “false”,
    “disable-namespace-relocation”: “false”,
    “schedulers-v2”: [
    {
    “type”: “balance-region”,
    “args”: null,
    “disable”: false
    },
    {
    “type”: “balance-leader”,
    “args”: null,
    “disable”: false
    },
    {
    “type”: “hot-region”,
    “args”: null,
    “disable”: false
    },
    {
    “type”: “label”,
    “args”: null,
    “disable”: false
    }
    ]
    },
    “replication”: {
    “max-replicas”: 3,
    “location-labels”: “”,
    “strictly-match-label”: “false”
    },
    “namespace”: {},
    “pd-server”: {
    “use-region-storage”: “true”
    },
    “cluster-version”: “3.0.12”,
    “quota-backend-bytes”: “0B”,
    “auto-compaction-mode”: “periodic”,
    “auto-compaction-retention-v2”: “1h”,
    “TickInterval”: “500ms”,
    “ElectionInterval”: “3s”,
    “PreVote”: true,
    “security”: {
    “cacert-path”: “”,
    “cert-path”: “”,
    “key-path”: “”
    },
    “label-property”: {},
    “WarningMsgs”: [
    “Config contains undefined item: replication.enable-placement-rules”
    ],
    “namespace-classifier”: “table”,
    “LeaderPriorityCheckInterval”: “1m0s”
    }
    store信息
    Starting component ctl: /root/.tiup/components/ctl/v3.0.1/ctl pd – -u http://172.16.10.151:2379 store
    {
    “count”: 3,
    “stores”: [
    {
    “store”: {
    “id”: 4,
    “address”: “172.16.10.153:20160”,
    “version”: “3.0.12”,
    “state_name”: “Up”
    },
    “status”: {
    “capacity”: “29.1TiB”,
    “available”: “28.38TiB”,
    “leader_count”: 3220,
    “leader_weight”: 1,
    “leader_score”: 197103,
    “leader_size”: 197103,
    “region_count”: 5286,
    “region_weight”: 1,
    “region_score”: 390688,
    “region_size”: 390688,
    “start_ts”: “2020-06-22T16:41:52+08:00”,
    “last_heartbeat_ts”: “2020-06-24T14:04:31.264975924+08:00”,
    “uptime”: “45h22m39.264975924s”
    }
    },
    {
    “store”: {
    “id”: 29027,
    “address”: “172.16.10.155:20160”,
    “state”: 1,
    “version”: “3.0.12”,
    “state_name”: “Offline”
    },
    “status”: {
    “leader_weight”: 1,
    “region_count”: 3821,
    “region_weight”: 1,
    “region_score”: 258874,
    “region_size”: 258874,
    “start_ts”: “1970-01-01T08:00:00+08:00”
    }
    },
    {
    “store”: {
    “id”: 1,
    “address”: “172.16.10.154:20160”,
    “version”: “3.0.12”,
    “state_name”: “Up”
    },
    “status”: {
    “capacity”: “29.1TiB”,
    “available”: “28.18TiB”,
    “leader_count”: 2066,
    “leader_weight”: 1,
    “leader_score”: 193585,
    “leader_size”: 193585,
    “region_count”: 5286,
    “region_weight”: 1,
    “region_score”: 390688,
    “region_size”: 390688,
    “start_ts”: “2020-06-22T16:43:23+08:00”,
    “last_heartbeat_ts”: “2020-06-24T14:04:36.252147314+08:00”,
    “uptime”: “45h21m13.252147314s”
    }
    }
    ]
    }
  2. db目录下找不到此文件,目前看tidb能正常使用,想让故障节点tikv服务能正常运行起来,采用了scale-in 然后再scale-out故障节点的方式进行处理,但是一直会提示如下报错:
    [2020/06/24 14:07:07.627 +08:00] [ERROR] [util.rs:327] ["request failed"] [err="Grpc(RpcFailure(RpcStatus { status: Unknown, details: Some(\"duplicated store address: id:36001 address:\\\"172.16.10.155:20160\\\" version:\\\"3.0.12\\\" , already registered by id:29027 address:\\\"172.16.10.155:20160\\\" state:Offline version:\\\"3.0.12\\\" \") }))"]
    按照官网的文档我把155节点使用pd-ctl工具执行了删除操作store delete 29027,提示是secuess了,但是查看store时节点仍然存在
  1. store 29027 还没有完全转移 region

image

  1. 可以强制下线 scale in 时使用参数 --force 试试.

使用–force参数,结果一样,store里还是有故障节点,且信息不变

  1. 请将此store直接设置为tombstone状态,
curl -X POST 'http://<pd-address>/pd/api/v1/store/<store_id>/state?state=Tombstone'
  1. 在pd-ctl命令中删除tombstone状态的store
store remove-tombstone

3.重新扩容

scale in后,节点下线太慢,所以一直会有duplicated store address的报错,参考节点下线速度慢,调整了对应参数,下线速度就上来了,目前集群状态已恢复正常,多谢,但我这种恢复方式不知道是不是正确的方法:joy:

是的,如果可以正常下线,那么加快速度就可以了,这样更好,多谢反馈