tidb,安装在rancher上,之前运行正常,今天tikv突然不无法启动,tidb也无法启动
这个估计很少有人知道 把日志报错发出来
2023/11/09 10:07:22.513 +08:00] [WARN] [store.rs:1211] [“set thread priority for raftstore failed”] [error=“Os { code: 13, kind: PermissionDenied, message: "Permission denied" }”]
[2023/11/09 10:07:22.513 +08:00] [INFO] [node.rs:174] [“put store to PD”] [store=“id: 1 address: "tidb-qas-cluster-tikv-2.tidb-qas-cluster-tikv-peer.wanda-db.svc:20160" version: "4.0.4" status_address: "0.0.0.0:20180" git_hash: "28e3d44b00700137de4fa933066ab83e5f8306cf" start_timestamp: 1699495634 deploy_path: "/"”]
[2023/11/09 10:07:22.513 +08:00] [INFO] [mod.rs:335] [“starting working thread”] [worker=cdc]
[2023/11/09 10:07:22.513 +08:00] [INFO] [future.rs:136] [“starting working thread”] [worker=waiter-manager]
[2023/11/09 10:07:22.513 +08:00] [INFO] [future.rs:136] [“starting working thread”] [worker=deadlock-detector]
[2023/11/09 10:07:22.513 +08:00] [INFO] [mod.rs:335] [“starting working thread”] [worker=backup-endpoint]
[2023/11/09 10:07:22.513 +08:00] [INFO] [] [“Failed to add :: listener, the environment may not support IPv6: {"created":"@1699495642.312002333","description":"Address family not supported by protocol","errno":97,"file":"/rust/registry/src/github.com-1ecc6299db9ec823/grpcio-sys-0.5.3/grpc/src/core/lib/iomgr/socket_utils_common_posix.cc","file_line":406,"os_error":"Address family not supported by protocol","syscall":"socket","target_address":"[::]:20160"}”]
[2023/11/09 10:07:22.513 +08:00] [INFO] [mod.rs:335] [“starting working thread”] [worker=snap-handler]
[2023/11/09 10:07:22.513 +08:00] [INFO] [server.rs:223] [“listening on addr”] [addr=0.0.0.0:20160]
[2023/11/09 10:07:22.514 +08:00] [INFO] [server.rs:248] [“TiKV is ready to serve”]
[2023/11/09 10:07:22.514 +08:00] [WARN] [mod.rs:489] [“failed to register addr to pd”] [body=Body(Streaming)] [“status code”=400]
[2023/11/09 10:07:22.514 +08:00] [INFO] [util.rs:412] [“connecting to PD endpoint”] [endpoints= ]
[2023/11/09 10:07:22.514 +08:00] [INFO] [] [“New connected subchannel at 0x7fa20b418a80 for subchannel 0x7fa20b439540”]
[2023/11/09 10:07:22.514 +08:00] [INFO] [util.rs:412] [“connecting to PD endpoint”] [endpoints= ]
[2023/11/09 10:07:22.514 +08:00] [INFO] [util.rs:477] [“connected to PD leader”] [endpoints= ]
[2023/11/09 10:07:22.514 +08:00] [INFO] [util.rs:188] [“heartbeat sender and receiver are stale, refreshing …”]
[2023/11/09 10:07:22.514 +08:00] [WARN] [util.rs:207] [“updating PD client done”] [spend=47.145199ms]
[2023/11/09 10:07:22.514 +08:00] [WARN] [mod.rs:489] [“failed to register addr to pd”] [body=Body(Streaming)] [“status code”=400]
[2023/11/09 10:07:22.514 +08:00] [WARN] [mod.rs:489] [“failed to register addr to pd”] [body=Body(Streaming)] [“status code”=400]
[2023/11/09 10:07:22.514 +08:00] [WARN] [mod.rs:489] [“failed to register addr to pd”] [body=Body(Streaming)] [“status code”=400]
[2023/11/09 10:07:22.514 +08:00] [WARN] [mod.rs:489] [“failed to register addr to pd”] [body=Body(Streaming)] [“status code”=400]
[2023/11/09 10:07:22.514 +08:00] [WARN] [mod.rs:499] [“failed to register addr to pd after 5 tries”]
[2023/11/09 10:07:22.795 +08:00] [FATAL] [lib.rs:481] [“entries[6:5580] is unavailable from storage, raft_id: 92010, region_id: 92009”] [backtrace="stack backtrace:\n 0: tikv_util::set_panic_hook::{{closure}}\n at components/tikv_util/src/lib.rs:480\n 1: std::panicking::rust_panic_with_hook\n at src/libstd/panicking.rs:475\n 2: rust_begin_unwind\n at src/libstd/panicking.rs:375\n 3: std::panicking::begin_panic_fmt\n at src/libstd/panicking.rs:326\n 4: raft::raft_log::RaftLog::slice\n at home/jenkins/agent/workspace/ld_tikv_multi_branch_release-4.0/tikv/<::std::macros::panic macros>:9\n 5: raft::raft_log::RaftLog::next_entries_since\n at rust/git/checkouts/raft-rs-841f8a6db665c5c0/b5f5830/src/raft_log.rs:362\n raft::raw_node::Ready::new\n at rust/git/checkouts/raft-rs-841f8a6db665c5c0/b5f5830/src/raw_node.rs:129\n raft::raw_node::RawNode::ready_since\n at rust/git/checkouts/raft-rs-841f8a6db665c5c0/b5f5830/src/raw_node.rs:346\n
看日志,你region好像有问题,检查下region 92009
怎么检查
这里报错显示region的存储不可用,查 raft_id和region_id
https://docs.pingcap.com/zh/tidb/stable/tidb-troubleshooting-map#服务不可用
你region有问题如果解决不了就把这个region下线吧
怎么下线region
设置一个 Region 副本为 tombstone 状态
tombstone
命令常用于没有开启 sync-log,因为机器掉电导致 Raft 状态机丢失部分写入的情况。它可以在一个 TiKV 实例上将一些 Region 的副本设置为 Tombstone 状态,从而在重启时跳过这些 Region,避免因为这些 Region 的副本的 Raft 状态机损坏而无法启动服务。这些 Region 应该在其他 TiKV 上有足够多的健康的副本以便能够继续通过 Raft 机制进行读写。
一般情况下,可以先在 PD 上将 Region 的副本通过 remove-peer
命令删除掉:
pd-ctl>> operator add remove-peer <region_id> <store_id>
然后再用 tikv-ctl 在那个 TiKV 实例上将 Region 的副本标记为 tombstone 以便跳过启动时对他的健康检查:
tikv-ctl --data-dir /path/to/tikv tombstone -p 127.0.0.1:2379 -r <region_id>
success!
但是有些情况下,当不能方便地从 PD 上移除这个副本时,可以指定 tikv-ctl 的 --force
选项来强制设置它为 tombstone:
tikv-ctl --data-dir /path/to/tikv tombstone -p 127.0.0.1:2379 -r <region_id>,<region_id> --force
success!
注意
- 该命令只支持本地模式
-
-p
选项的参数指定 PD 的 endpoints,无需http
前缀。指定 PD 的 endpoints 是为了询问 PD 是否可以安全切换至 Tombstone 状态。
引用页
https://docs.pingcap.com/zh/tidb/stable/tikv-control#设置一个-region-副本为-tombstone-状态
明确解决的,记得随手标记最佳答案哦
解决了请选出最佳选择