TIkv 节点不工作,无法启动

为提高效率,提问时请提供以下信息,问题描述清晰可优先响应。

  • 【TiKV 配置】:单节点配置: 8核 16G 4T SSD 副本数:2
  • 【TiDB 版本】:Release Version: v3.0.5
  • 【问题描述】:[FATAL] [server.rs:100] [“panic_mark_file /data/deploy/data/panic_mark_file exists, there must be something wrong with the db.”]

出现故障点时间是 凌晨 ,当时服务器没有任何操作,服务器没有宕机,没有断电,无故出现。

在tikv节点查看此文件,发现文件为空,手动删除,启动节点时,报如下

[FATAL] [server.rs:168] [“failed to create kv engine: RocksDb Corruption: L5 have overlapping ranges ‘7A7480000000000000FF335F698000000000FF0000010131393039FF32383030FF313130FF3130303338FF3633FF343338353234FF00FF00000000000000F7FF0000000000000000F7FA42053B90B7FFFD’ seq:84190335923, type:1 vs. ‘7A7480000000000000FF335F698000000000FF0000010131393039FF32383030FF313130FF3130383231FF3634FF333034383936FF00FF00000000000000F7FF0000000000000000F7FA4205300157FFFE’ seq:84204503948, type:1”]

请上传一下故障节点tikv日志, {{data-dir}}/db/LOG文件 以及 {{data-dir}}/db/ 下面 MANIFEST 前缀开头的所有文件,多谢

tikv.log (81.9 KB)
日志文件

还请反馈下上面需要的日志,另外需要panic第一次出现的日志,所以方便上传完整的tikv日志吗?

MANIFEST-2860269.zip (3.5 MB) tikv(1).zip (2.8 MB)

{{data-dir}}/db/LOG麻烦也反馈下,多谢

LOG.tar.gz (60.8 KB)

./pd-ctl -i -u http://127.0.0.1:2379 执行下config show all

“client-urls”: “http://10.32.3.76:2379”, “peer-urls”: “http://10.32.3.76:2380”, “advertise-client-urls”: “http://10.32.3.76:2379”, “advertise-peer-urls”: “http://10.32.3.76:2380”, “name”: “pd_backtidbserver2”, “data-dir”: “/data/deploy/data.pd”, “force-new-cluster”: false, “enable-grpc-gateway”: true, “initial-cluster”: “pd_backtidbserver1=http://10.32.3.75:2380,pd_backtidbserver2=http://10.32.3.76:2380,pd_backtidbserver3=http://10.32.3.77:2380”, “initial-cluster-state”: “new”, “join”: “”, “lease”: 3, “log”: { “level”: “info”, “format”: “text”, “disable-timestamp”: false, “file”: { “filename”: “/data/deploy/log/pd.log”, “log-rotate”: true, “max-size”: 300, “max-days”: 0, “max-backups”: 0 }, “development”: false, “disable-caller”: false, “disable-stacktrace”: false, “disable-error-verbose”: true, “sampling”: null }, “log-file”: “”, “log-level”: “”, “tso-save-interval”: “3s”, “metric”: { “job”: “pd_backtidbserver2”, “address”: “”, “interval”: “15s” }, “schedule”: { “max-snapshot-count”: 3, “max-pending-peer-count”: 16, “max-merge-region-size”: 20, “max-merge-region-keys”: 200000, “split-merge-interval”: “1h0m0s”, “enable-one-way-merge”: “false”, “patrol-region-interval”: “100ms”, “max-store-down-time”: “30m0s”, “leader-schedule-limit”: 4, “region-schedule-limit”: 4, “replica-schedule-limit”: 8, “merge-schedule-limit”: 8, “hot-region-schedule-limit”: 4, “hot-region-cache-hits-threshold”: 3, “store-balance-rate”: 15, “tolerant-size-ratio”: 5, “low-space-ratio”: 0.8, “high-space-ratio”: 0.6, “scheduler-max-waiting-operator”: 3, “disable-raft-learner”: “false”, “disable-remove-down-replica”: “false”, “disable-replace-offline-replica”: “false”, “disable-make-up-replica”: “false”, “disable-remove-extra-replica”: “false”, “disable-location-replacement”: “false”, “disable-namespace-relocation”: “false”, “schedulers-v2”: [ { “type”: “balance-region”, “args”: null, “disable”: false }, { “type”: “balance-leader”, “args”: null, “disable”: false }, { “type”: “hot-region”, “args”: null, “disable”: false }, { “type”: “label”, “args”: null, “disable”: false } ] }, “replication”: { “max-replicas”: 2, “location-labels”: “”, “strictly-match-label”: “false” }, “namespace”: {}, “pd-server”: { “use-region-storage”: “true” }, “cluster-version”: “3.0.5”, “quota-backend-bytes”: “0B”, “auto-compaction-mode”: “periodic”, “auto-compaction-retention-v2”: “1h”, “TickInterval”: “500ms”, “ElectionInterval”: “3s”, “PreVote”: true, “security”: { “cacert-path”: “”, “cert-path”: “”, “key-path”: “” }, “label-property”: {}, “WarningMsgs”: null, “namespace-classifier”: “table”, “LeaderPriorityCheckInterval”: “1m0s” }

麻烦再帮忙确认下,出问题前,是2副本还是3副本,多谢

1 ./pd-ctl -u http://:<pd_client_port> store --jq=".stores[].store | { id, address, state_name}" 替换ip和端口安装jq反馈以下信息 2. cat tikv.yml | grep sync-log

/home/tidb/tidb-ansible/resources/bin/pd-ctl -u http://10.32.3.75:2379 store --jq=".stores[].store | { id, address, state_name}" {“id”:1,“address”:“10.32.3.78:20160”,“state_name”:“Up”} {“id”:4,“address”:“10.32.3.79:20160”,“state_name”:“Up”} {“id”:5,“address”:“10.32.3.80:20160”,“state_name”:“Offline”}

cat tikv.yml | grep sync-log 这个是空的

cat tikv.toml | grep sync-log sync-log = false

1.禁止调用 ./pd-ctl -u <pd-server ip>:<pd_client_port> -i
config set leader-schedule-limit 0
config set region-schedule-limit 0
config set replica-schedule-limit 0
config set merge-schedule-limit 0
config set hot-region-schedule-limit 0

  1. 使用 pd-ctl 检查大于等于一半副本数在故障节点上的 Region,并记录它们的 ID 要求:PD 处于运行状态./pd-ctl -u <pd-server ip>:<pd_client_port> -d region --jq=’.regions[] | {id: .id, peer_stores: [.peers[].store_id] | select(length as $total | map(if .==(1,4) then . else empty end) | length>=$total-length)}’

你好,换一个账号和你同步下信息。 目前我正在做的操作是:执行了启动失败TIKV节点的删除操作 pd-clt delete -u “http://ip:2379” -d store delete 5 使用operator show 查看后台显示从删除节点remove peer 从昨天中午到现在 该节点一致一直处于offline
原来打算是等下线节点的状态变为 Tombstone 时候在执行您提供的操作步骤,目前来看 短时间无法完成。 后续我想新增一个tikv节点,并把副本数设置为3 请问下 我现在我应该操作,尽快恢复集群可用和确保集群数据的安全性。

  1. 首先用 pd-ctl 检查一下副本数为 1 副本的数量
» region --jq=".regions[] | {id: .id, peer_stores: [.peers[].store_id] | select(length != 3)}"
  1. 通过 pd-ctl 可以调整副本数据,设置 max-replica 参数

  2. 如果有1副本的情况,需要通过 unsafe-recovery 的方式强制让1副本提供服务

https://pingcap.com/docs-cn/stable/reference/tools/tikv-control/#强制-region-从多副本失败状态恢复服务

  1. 看下监控上下线节点上的 region 和 leader 数量是否是在减少的,如果在减少,那说明 offline 过程是正常的

  2. 新增 tikv 节点可以参考官网扩容的方式

https://pingcap.com/docs-cn/stable/how-to/scale/with-ansible/#扩容-tidbtikv-节点

select(length != 3)} 这个是表示副本数量不等于3吗? 我们配置的副本数为2,现在下线节点根本就启不来,看不到下线节点的region和leader

修改一下

» region --jq=".regions[] | {id: .id, peer_stores: [.peers[].store_id] | select(length = 1)}"

现在新增一个节点 报错: [2020/01/16 11:08:08.551 +08:00] [INFO] [region_cache.go:393] [“switch region peer to next due to send request fail”] [current=“region ID: 158589, meta: id:158589 end_key:“mDDLJobLi\377st\000\000\000\000\000\000\371\000\000\000\000\000\000\000l\200\000\000\000\000\000\000\000” region_epoch:<conf_ver:12 version:3 > peers:<id:158590 store_id:5 > peers:<id:158591 store_id:4 > , peer: id:158590 store_id:5 , addr: 10.32.3.80:20160, idx: 0”] [needReload=false] [error=“context deadline exceeded”] [errorVerbose=“context deadline exceeded\ngithub.com/pingcap/errors.AddStack\ \tgithub.com/pingcap/errors@v0.11.4/errors.go:174\ github.com/pingcap/errors.Trace\ \tgithub.com/pingcap/errors@v0.11.4/juju_adaptor.go:15\ github.com/pingcap/tidb/store/tikv.sendBatchRequest\ \tgithub.com/pingcap/tidb@/store/tikv/client_batch.go:543\ github.com/pingcap/tidb/store/tikv.(*rpcClient).SendRequest\ \tgithub.com/pingcap/tidb@/store/tikv/client.go:281\ github.com/pingcap/tidb/store/tikv.(*RegionRequestSender).sendReqToRegion\ \tgithub.com/pingcap/tidb@/store/tikv/region_request.go:145\ github.com/pingcap/tidb/store/tikv.(*RegionRequestSender).SendReqCtx\ \tgithub.com/pingcap/tidb@/store/tikv/region_request.go:116\ github.com/pingcap/tidb/store/tikv.(*RegionRequestSender).SendReq\ \tgithub.com/pingcap/tidb@/store/tikv/region_request.go:72\ github.com/pingcap/tidb/store/tikv.(*tikvSnapshot).get\ \tgithub.com/pingcap/tidb@/store/tikv/snapshot.go:305\ github.com/pingcap/tidb/store/tikv.(*tikvSnapshot).Get\ \tgithub.com/pingcap/tidb@/store/tikv/snapshot.go:265\ github.com/pingcap/tidb/kv.(*unionStore).Get\ \tgithub.com/pingcap/tidb@/kv/union_store.go:194\ github.com/pingcap/tidb/store/tikv.(*tikvTxn).Get\ \tgithub.com/pingcap/tidb@/store/tikv/txn.go:133\ github.com/pingcap/tidb/structure.(*TxStructure).Get\ \tgithub.com/pingcap/tidb@/structure/string.go:35\ github.com/pingcap/tidb/structure.(*TxStructure).GetInt64\ \tgithub.com/pingcap/tidb@/structure/string.go:44\ github.com/pingcap/tidb/meta.(*Meta).GetBootstrapVersion\ \tgithub.com/pingcap/tidb@/meta/meta.go:697\ github.com/pingcap/tidb/session.getStoreBootstrapVersion.func1\ \tgithub.com/pingcap/tidb@/session/session.go:1631\ github.com/pingcap/tidb/kv.RunInNewTxn\ \tgithub.com/pingcap/tidb@/kv/txn.go:50\ github.com/pingcap/tidb/session.getStoreBootstrapVersion\ \tgithub.com/pingcap/tidb@/session/session.go:1628\ github.com/pingcap/tidb/session.BootstrapSession\ \tgithub.com/pingcap/tidb@/session/session.go:1469\ main.createStoreAndDomain\ \tgithub.com/pingcap/tidb@/tidb-server/main.go:205\ main.main\ \tgithub.com/pingcap/tidb@/tidb-server/main.go:171\ runtime.main\ \truntime/proc.go:203\ runtime.goexit\ \truntime/asm_amd64.s:1357”]

[tidb@backtidbserver1 tidb-ansible]$ /home/tidb/tidb-ansible/resources/bin/pd-ctl -u “http://10.32.3.75:2379” region --jq=".regions[] | {id: .id, peer_stores: [.peers[].store_id] | select(length = 1)}" jq: error (at :1650283): Invalid path expression with result 2 exit status 5 执行这个就报错

你只有2副本,现在一个节点无法启动,所以选不出leader,可能无法完成delete store的操作,先恢复region. 别着急增删节点了.

先检查!=2的region信息 ./pd-ctl -u :<pd_client_port> -d region --jq=".regions[] | {id: .id, peer_stores: [.peers[].store_id] | select(length != 2)}"

可以在所有未发生掉电故障的实例上,对所有 Region 移除掉所有位于故障节点上的 Peer; 要求:在未发生掉电故障的机器上运行;TiKV 处于关闭状态 注意:如果是低版本可能没有 -s --all-regions参数 tikv-ctl --db /path/to/tikv-data/db unsafe-recover remove-fail-stores -s <s1,s2> --all-regions

       需要先关闭tikv:

cd /scripts ./stop_tikv.sh

再执行,你的store 5是offline,所以执行以下命令,先恢复store: tikv-ctl --db /data/db unsafe-recover remove-fail-stores -s 5 --all-regions