TIkv 节点不工作，无法启动

xqwbx163 · 2020 年1 月 15 日 08:00

为提高效率，提问时请提供以下信息，问题描述清晰可优先响应。

【TiKV 配置】：单节点配置: 8核 16G 4T SSD 副本数:2
【TiDB 版本】：Release Version: v3.0.5
【问题描述】：[FATAL] [server.rs:100] [“panic_mark_file /data/deploy/data/panic_mark_file exists, there must be something wrong with the db.”]

出现故障点时间是凌晨，当时服务器没有任何操作，服务器没有宕机，没有断电，无故出现。

在tikv节点查看此文件，发现文件为空，手动删除，启动节点时，报如下

[FATAL] [server.rs:168] [“failed to create kv engine: RocksDb Corruption: L5 have overlapping ranges ‘7A7480000000000000FF335F698000000000FF0000010131393039FF32383030FF313130FF3130303338FF3633FF343338353234FF00FF00000000000000F7FF0000000000000000F7FA42053B90B7FFFD’ seq:84190335923, type:1 vs. ‘7A7480000000000000FF335F698000000000FF0000010131393039FF32383030FF313130FF3130383231FF3634FF333034383936FF00FF00000000000000F7FF0000000000000000F7FA4205300157FFFE’ seq:84204503948, type:1”]

yilong · 2020 年1 月 15 日 08:07

请上传一下故障节点tikv日志， {{data-dir}}/db/LOG文件以及 {{data-dir}}/db/ 下面 MANIFEST 前缀开头的所有文件，多谢

xqwbx163 · 2020 年1 月 15 日 08:25

tikv.log (81.9 KB)
日志文件

yilong · 2020 年1 月 15 日 08:32

还请反馈下上面需要的日志，另外需要panic第一次出现的日志，所以方便上传完整的tikv日志吗？

xqwbx163 · 2020 年1 月 15 日 08:43

MANIFEST-2860269.zip (3.5 MB) tikv(1).zip (2.8 MB)

yilong · 2020 年1 月 15 日 09:22

{{data-dir}}/db/LOG麻烦也反馈下，多谢

xqwbx163 · 2020 年1 月 15 日 09:23

LOG.tar.gz (60.8 KB)

yilong · 2020 年1 月 15 日 09:49

./pd-ctl -i -u http://127.0.0.1:2379 执行下config show all

xqwbx163 · 2020 年1 月 15 日 09:53

“client-urls”: “http://10.32.3.76:2379”, “peer-urls”: “http://10.32.3.76:2380”, “advertise-client-urls”: “http://10.32.3.76:2379”, “advertise-peer-urls”: “http://10.32.3.76:2380”, “name”: “pd_backtidbserver2”, “data-dir”: “/data/deploy/data.pd”, “force-new-cluster”: false, “enable-grpc-gateway”: true, “initial-cluster”: “pd_backtidbserver1=http://10.32.3.75:2380,pd_backtidbserver2=http://10.32.3.76:2380,pd_backtidbserver3=http://10.32.3.77:2380”, “initial-cluster-state”: “new”, “join”: “”, “lease”: 3, “log”: { “level”: “info”, “format”: “text”, “disable-timestamp”: false, “file”: { “filename”: “/data/deploy/log/pd.log”, “log-rotate”: true, “max-size”: 300, “max-days”: 0, “max-backups”: 0 }, “development”: false, “disable-caller”: false, “disable-stacktrace”: false, “disable-error-verbose”: true, “sampling”: null }, “log-file”: “”, “log-level”: “”, “tso-save-interval”: “3s”, “metric”: { “job”: “pd_backtidbserver2”, “address”: “”, “interval”: “15s” }, “schedule”: { “max-snapshot-count”: 3, “max-pending-peer-count”: 16, “max-merge-region-size”: 20, “max-merge-region-keys”: 200000, “split-merge-interval”: “1h0m0s”, “enable-one-way-merge”: “false”, “patrol-region-interval”: “100ms”, “max-store-down-time”: “30m0s”, “leader-schedule-limit”: 4, “region-schedule-limit”: 4, “replica-schedule-limit”: 8, “merge-schedule-limit”: 8, “hot-region-schedule-limit”: 4, “hot-region-cache-hits-threshold”: 3, “store-balance-rate”: 15, “tolerant-size-ratio”: 5, “low-space-ratio”: 0.8, “high-space-ratio”: 0.6, “scheduler-max-waiting-operator”: 3, “disable-raft-learner”: “false”, “disable-remove-down-replica”: “false”, “disable-replace-offline-replica”: “false”, “disable-make-up-replica”: “false”, “disable-remove-extra-replica”: “false”, “disable-location-replacement”: “false”, “disable-namespace-relocation”: “false”, “schedulers-v2”: [ { “type”: “balance-region”, “args”: null, “disable”: false }, { “type”: “balance-leader”, “args”: null, “disable”: false }, { “type”: “hot-region”, “args”: null, “disable”: false }, { “type”: “label”, “args”: null, “disable”: false } ] }, “replication”: { “max-replicas”: 2, “location-labels”: “”, “strictly-match-label”: “false” }, “namespace”: {}, “pd-server”: { “use-region-storage”: “true” }, “cluster-version”: “3.0.5”, “quota-backend-bytes”: “0B”, “auto-compaction-mode”: “periodic”, “auto-compaction-retention-v2”: “1h”, “TickInterval”: “500ms”, “ElectionInterval”: “3s”, “PreVote”: true, “security”: { “cacert-path”: “”, “cert-path”: “”, “key-path”: “” }, “label-property”: {}, “WarningMsgs”: null, “namespace-classifier”: “table”, “LeaderPriorityCheckInterval”: “1m0s” }

yilong · 2020 年1 月 15 日 10:09

麻烦再帮忙确认下，出问题前，是2副本还是3副本，多谢

yilong · 2020 年1 月 15 日 10:21

1 ./pd-ctl -u http://:<pd_client_port> store --jq=".stores[].store | { id, address, state_name}" 替换ip和端口安装jq反馈以下信息 2. cat tikv.yml | grep sync-log

xqwbx163 · 2020 年1 月 15 日 13:01

/home/tidb/tidb-ansible/resources/bin/pd-ctl -u http://10.32.3.75:2379 store --jq=".stores[].store | { id, address, state_name}" {“id”:1,“address”:“10.32.3.78:20160”,“state_name”:“Up”} {“id”:4,“address”:“10.32.3.79:20160”,“state_name”:“Up”} {“id”:5,“address”:“10.32.3.80:20160”,“state_name”:“Offline”}

cat tikv.yml | grep sync-log 这个是空的

cat tikv.toml | grep sync-log sync-log = false

yilong · 2020 年1 月 15 日 13:35

1.禁止调用 ./pd-ctl -u <pd-server ip>:<pd_client_port> -i
config set leader-schedule-limit 0
config set region-schedule-limit 0
config set replica-schedule-limit 0
config set merge-schedule-limit 0
config set hot-region-schedule-limit 0

使用 pd-ctl 检查大于等于一半副本数在故障节点上的 Region，并记录它们的 ID 要求：PD 处于运行状态./pd-ctl -u <pd-server ip>:<pd_client_port> -d region --jq=’.regions[] | {id: .id, peer_stores: [.peers[].store_id] | select(length as $total | map(if .==(1,4) then . else empty end) | length>=$total-length)}’

xuyong1874441 · 2020 年1 月 16 日 01:14

你好，换一个账号和你同步下信息。目前我正在做的操作是：执行了启动失败TIKV节点的删除操作 pd-clt delete -u “http://ip:2379” -d store delete 5 使用operator show 查看后台显示从删除节点remove peer 从昨天中午到现在该节点一致一直处于offline
原来打算是等下线节点的状态变为 Tombstone 时候在执行您提供的操作步骤，目前来看短时间无法完成。后续我想新增一个tikv节点，并把副本数设置为3 请问下我现在我应该操作，尽快恢复集群可用和确保集群数据的安全性。

GangShen · 2020 年1 月 16 日 02:05

首先用 pd-ctl 检查一下副本数为 1 副本的数量

» region --jq=".regions[] | {id: .id, peer_stores: [.peers[].store_id] | select(length != 3)}"

通过 pd-ctl 可以调整副本数据，设置 max-replica 参数
如果有1副本的情况，需要通过 unsafe-recovery 的方式强制让1副本提供服务

https://pingcap.com/docs-cn/stable/reference/tools/tikv-control/#强制-region-从多副本失败状态恢复服务

看下监控上下线节点上的 region 和 leader 数量是否是在减少的，如果在减少，那说明 offline 过程是正常的
新增 tikv 节点可以参考官网扩容的方式

https://pingcap.com/docs-cn/stable/how-to/scale/with-ansible/#扩容-tidbtikv-节点

xuyong1874441 · 2020 年1 月 16 日 02:14

select(length != 3)} 这个是表示副本数量不等于3吗？我们配置的副本数为2，现在下线节点根本就启不来，看不到下线节点的region和leader

GangShen · 2020 年1 月 16 日 02:43

修改一下

» region --jq=".regions[] | {id: .id, peer_stores: [.peers[].store_id] | select(length = 1)}"

xuyong1874441 · 2020 年1 月 16 日 03:08

xuyong1874441 · 2020 年1 月 16 日 03:11

[tidb@backtidbserver1 tidb-ansible]$ /home/tidb/tidb-ansible/resources/bin/pd-ctl -u “http://10.32.3.75:2379” region --jq=".regions[] | {id: .id, peer_stores: [.peers[].store_id] | select(length = 1)}" jq: error (at :1650283): Invalid path expression with result 2 exit status 5 执行这个就报错

yilong · 2020 年1 月 16 日 03:14

你只有2副本，现在一个节点无法启动，所以选不出leader，可能无法完成delete store的操作，先恢复region. 别着急增删节点了.

先检查！=2的region信息 ./pd-ctl -u :<pd_client_port> -d region --jq=".regions[] | {id: .id, peer_stores: [.peers[].store_id] | select(length != 2)}"

可以在所有未发生掉电故障的实例上，对所有 Region 移除掉所有位于故障节点上的 Peer；要求：在未发生掉电故障的机器上运行；TiKV 处于关闭状态注意：如果是低版本可能没有 -s --all-regions参数 tikv-ctl --db /path/to/tikv-data/db unsafe-recover remove-fail-stores -s <s1,s2> --all-regions

       需要先关闭tikv:

cd /scripts ./stop_tikv.sh

再执行,你的store 5是offline，所以执行以下命令，先恢复store： tikv-ctl --db /data/db unsafe-recover remove-fail-stores -s 5 --all-regions