TiKV 节点无法下线,leader_count 和 region_count 已经为 0

哇,那么晚还在,辛苦辛苦。

tiup ctl:v5.2.0 pd -u 172.21.11.59:2379 region store 89241
Starting component ctl: /home/tidb/.tiup/components/ctl/v5.2.0/ctl pd -u 172.21.11.59:2379 region store 89241
{
“count”: 2,
“regions”: [
{
“id”: 18201,
“start_key”: “7480000000000005FFF05F728000000000FF16DDD60000000000FA”,
“end_key”: “7480000000000005FFF05F728000000000FF1C56DC0000000000FA”,
“epoch”: {
“conf_ver”: 17,
“version”: 1412
},
“peers”: [
{
“id”: 94548,
“store_id”: 89241,
“role_name”: “Voter”
},
{
“id”: 120830,
“store_id”: 103455,
“role_name”: “Voter”
},
{
“id”: 151290,
“store_id”: 135592,
“role_name”: “Voter”
}
],
“leader”: {
“role_name”: “Voter”
},
“written_bytes”: 0,
“read_bytes”: 0,
“written_keys”: 0,
“read_keys”: 0,
“approximate_size”: 0,
“approximate_keys”: 0
},
{
“id”: 18277,
“start_key”: “7480000000000005FFF05F728000000000FF2CC2A00000000000FA”,
“end_key”: “7480000000000005FFF05F728000000000FF323C000000000000FA”,
“epoch”: {
“conf_ver”: 17,
“version”: 1416
},
“peers”: [
{
“id”: 100011,
“store_id”: 89241,
“role_name”: “Voter”
},
{
“id”: 119934,
“store_id”: 103455,
“role_name”: “Voter”
},
{
“id”: 151291,
“store_id”: 135592,
“role_name”: “Voter”
}
],
“leader”: {
“role_name”: “Voter”
},
“written_bytes”: 0,
“read_bytes”: 0,
“written_keys”: 0,
“read_keys”: 0,
“approximate_size”: 0,
“approximate_keys”: 0
}
]
}

从上面提供的信息看, store 89241 一直没有下线成功的原因是上面还存在两个 region (region id 分别为18201 和 18277),但这两个 region 没有 leader ,需要确认下原因,麻烦提供下涉及到这两个 region 的日志。

这两天在救火,aws机器例行维护;我们新加节点迁移数据,还没迁移完他们就关机了,因为使用的是本地nvme的磁盘,数据全丢了;这个时候发现gc任务已经停了半年了,现在手工gc /home/tidb/tikv-ctl --host 127.0.0.1:20160 compact &;真是 屋漏偏逢连夜雨,那2个region,我看下能否强制删除不,不要了

这有点惨 :sweat_smile: 请问这个集群是生产环境吗?

是啊 大数据的;哎

tiup ctl:v5.2.0 pd -u 172.21.11.59:2379 operator add remove-peer 18201 103455
Starting component ctl: /home/tidb/.tiup/components/ctl/v5.2.0/ctl pd -u 172.21.11.59:2379 operator add remove-peer 18201 103455
Failed! [500] “cannot build operator for region with no leader”

这两个 region 副本所在的 store_id”: 103455 和 “store_id”: 135592 是当前集群中的 Store 吗?如果方便的话麻烦还是提供下涉及到这两个 region 的日志,需要分析下没有选举出 leader 原因,可能需要进行 unsafe recover

好的,谢谢;
下面是搜索18201 这个region的信息

[2021/10/14 03:00:18.099 +00:00] [INFO] [endpoint.rs:382] [“deregister observe region”] [observe_id=ObserveID(33779)] [region_id=1427722] [store_id=Some(103455)]
[2021/10/14 03:00:18.099 +00:00] [INFO] [endpoint.rs:309] [“register observe region”] [region=“id: 1427722 start_key: 7480000000000022FF085F728000000020FFCAAA630000000000FA end_key: 7480000000000022FF085F728000000020FFD01B950000000000FA region_epoch { conf_ver: 27 version: 5116 } peers { id: 1427723 store_id: 103455 } peers { id: 1427725 store_id: 151301 } peers { id: 4398788 store_id: 918785 }”]
[2021/10/14 03:00:18.105 +00:00] [INFO] [raft.rs:1517] [“starting a new election”] [term=51] [raft_id=120830] [region_id=18201]
[2021/10/14 03:00:18.105 +00:00] [INFO] [raft.rs:1142] [“became pre-candidate at term 51”] [term=51] [raft_id=120830] [region_id=18201]
[2021/10/14 03:00:18.105 +00:00] [INFO] [raft.rs:1271] [“broadcasting vote request”] [to="[94548]"] [log_index=63] [log_term=51] [term=51] [type=MsgRequestPreVote] [raft_id=120830] [region_id=18201]
[2021/10/14 03:00:18.138 +00:00] [INFO] [pd.rs:1146] [“try to merge”] [merge=“target { id: 1083746 start_key: 7480000000000029FF485F698000000000FF0000030380000178FF7F24994A03800000FF00958EBAF2000000FC end_key: 7480000000000029FF485F698000000000FF0000030380000178FF7F3F034003800000FF0095AC1B55000000FC region_epoch { conf_ver: 23 version: 5677 } peers { id: 1083747 store_id: 103455 } peers { id: 1083749 store_id: 151301 } peers { id: 1311171 store_id: 918785 } }”] [region_id=3349587]
[2021/10/14 03:00:18.169 +00:00] [INFO] [raft.rs:1336] [“received a message with higher term from 4398896”] [“msg type”=MsgRequestVote] [message_term=6] [term=5] [from=4398896] [raft_id=4398895] [region_id=4398894]
[2021/10/14 03:00:18.169 +00:00] [INFO] [raft.rs:1092] [“became follower at term 6”] [term=6] [raft_id=4398895] [region_id=4398894]
[2021/10/14 03:00:18.169 +00:00] [INFO] [raft.rs:1532] ["[logterm: 5, index: 5, vote: 0] cast vote for 4398896 [logterm: 5, index: 5] at term 6"] [“msg type”=MsgRequestVote] [term=6] [msg_index=5] [msg_term=5] [from=4398896] [vote=0] [log_index=5] [log_term=5] [raft_id=4398895] [region_id=4398894]

9月3号,发帖子的时候 store 89241上面region已经是0了哈

现在通过 pd-ctl 看下 region 18201 和 region 103455 ,里面还残留 store 89241 的信息吗?

有的:joy:

感觉陷入死循环了

./tikv-ctl --data-dir /data1/tidb-deploy/data/tikv-20160/db/ --config /data1/tidb-deploy/tikv-20160/conf/tikv.toml unsafe-recover remove-fail-stores -s 89241 --all-regions
[2021/10/14 03:34:42.292 +00:00] [INFO] [mod.rs:118] [“encryption: none of key dictionary and file dictionary are found.”]
[2021/10/14 03:34:42.292 +00:00] [INFO] [mod.rs:479] [“encryption is disabled.”]
[2021/10/14 03:34:42.293 +00:00] [WARN] [config.rs:587] [“compaction guard is disabled due to region info provider not available”]
thread ‘[main’ panicked at ‘called Result::unwrap() on an Err value: Os { code: 2, kind: NotFound, message: “No such file or directory” }2021/10/14 03:34:42.293 +00:00’, ]cmd/tikv-ctl/src/main.rs :[121WARN:]57
[note: run with RUST_BACKTRACE=1 environment variable to display a backtrace
config.rs:682] [“compaction guard is disabled due to region info provider not available”]

你这里是已经关闭了 tikv 在所有 tikv 节点上执行了一遍 unsafe recover ?

tikv版本是 5.2.1, 7月份的时候北京参会 唐刘大佬说可以在线修复的哈

额,目前还不支持在线 unsafe recover ,功能还在开发中,需要停掉 tikv 节点。。。

那个关闭影响有点大,请教有优雅的关闭方案吗?
手工操作调度leader把别节点关闭,这个能集成到stop命令里吗?

目前没有, region 18201 和 region 103455 除掉位于不可用的 store 89241 副本外,还有两个副本是在 store_id”: 103455 和 “store_id”: 135592 上,若这两个 store 是状态 Up 的,正常来说是满足多数派原则可以选举出 leader ,但现在一直无法选出 leader 。所以首先需要确认下 store 状态是否正常,其次获取完整 tikv 上的 region 日志,分析下无法选举出 leader 原因。

嗯嗯,我现在把leader迁走、关机、unsafe recover;

哎,前几天aws维护 本地nvme磁盘的机器down掉1台; 这两天新加了节点一直在迁移region过去;今天晚上才迁移的差不多了,但是还有遗留的region, 我一并unsafe recoer吧;

关机后,还是删不掉;

export RUST_BACKTRACE=full
[tidb@eu-bigdata-tidb-tikv-02 ~]$
[tidb@eu-bigdata-tidb-tikv-02 ~]$
[tidb@eu-bigdata-tidb-tikv-02 ~]$ ./tikv-ctl --data-dir /data1/tidb-deploy/data/tikv-20160/db/ unsafe-recover remove-fail-stores -s 135592 --all-regions
[2021/10/14 12:19:03.787 +00:00] [INFO] [mod.rs:118] [“encryption: none of key dictionary and file dictionary are found.”]
[2021/10/14 12:19:03.787 +00:00] [INFO] [mod.rs:479] [“encryption is disabled.”]
[2021/10/14 12:19:03.789 +00:00] [WARN] [config.rs:587] [“compaction guard is disabled due to region info provider not available”]
[2021/10/14 12:19:03.789 +00:00] thread ‘[mainWARN’ panicked at ']called Result::unwrap() on an Err value: Os { code: 2, kind: NotFound, message: “No such file or directory” } ', [cmd/tikv-ctl/src/main.rsconfig.rs::121682:]57
[“stack backtrace:
compaction guard is disabled due to region info provider not available”]
0: 0x55b512b71203 - std::backtrace_rs::backtrace::libunwind::trace::h99dbb39dca18857d
at /rustc/2faabf579323f5252329264cc53ba9ff803429a3/library/std/src/…/…/backtrace/src/backtrace/libunwind.rs:90:5
1: 0x55b512b71203 - std::backtrace_rs::backtrace::trace_unsynchronized::h832861927e9cfedf
at /rustc/2faabf579323f5252329264cc53ba9ff803429a3/library/std/src/…/…/backtrace/src/backtrace/mod.rs:66:5
2: 0x55b512b71203 - std::sys_common::backtrace::_print_fmt::h3d18154c77dcf310
at /rustc/2faabf579323f5252329264cc53ba9ff803429a3/library/std/src/sys_common/backtrace.rs:67:5
3: 0x55b512b71203 - <std::sys_common::backtrace::_print::DisplayBacktrace as core::fmt::Display>::fmt::he312f4ad5b9bb346
at /rustc/2faabf579323f5252329264cc53ba9ff803429a3/library/std/src/sys_common/backtrace.rs:46:22
4: 0x55b5127659fc - core::fmt::write::h9a6d9c74526a6c1b
at /rustc/2faabf579323f5252329264cc53ba9ff803429a3/library/core/src/fmt/mod.rs:1115:17
5: 0x55b512b6f544 - std::io::Write::write_fmt::h6aced00850e8186f
at /rustc/2faabf579323f5252329264cc53ba9ff803429a3/library/std/src/io/mod.rs:1665:15
6: 0x55b512b702bb - std::sys_common::backtrace::_print::h65d996766de40da4
at /rustc/2faabf579323f5252329264cc53ba9ff803429a3/library/std/src/sys_common/backtrace.rs:49:5
7: 0x55b512b702bb - std::sys_common::backtrace::print::h40df9727e635f303
at /rustc/2faabf579323f5252329264cc53ba9ff803429a3/library/std/src/sys_common/backtrace.rs:36:9
8: 0x55b512b702bb - std::panicking::default_hook::{{closure}}::hd2da4327dea91a51
at /rustc/2faabf579323f5252329264cc53ba9ff803429a3/library/std/src/panicking.rs:208:50
9: 0x55b512b6f14a - std::panicking::default_hook::h3d55120ad6ada158
at /rustc/2faabf579323f5252329264cc53ba9ff803429a3/library/std/src/panicking.rs:225:9
10: 0x55b512b6f14a - std::panicking::rust_panic_with_hook::hf85dd0bb545e3b55
at /rustc/2faabf579323f5252329264cc53ba9ff803429a3/library/std/src/panicking.rs:622:17
11: 0x55b512b89128 - std::panicking::begin_panic_handler::{{closure}}::h736ae969434da9fa
at /rustc/2faabf579323f5252329264cc53ba9ff803429a3/library/std/src/panicking.rs:519:13
12: 0x55b512b8909c - std::sys_common::backtrace::__rust_end_short_backtrace::h6133bb80b1d6c3e0
at /rustc/2faabf579323f5252329264cc53ba9ff803429a3/library/std/src/sys_common/backtrace.rs:141:18
13: 0x55b512b8904d - rust_begin_unwind
at /rustc/2faabf579323f5252329264cc53ba9ff803429a3/library/std/src/panicking.rs:515:5
14: 0x55b5123f1a70 - core::panicking::panic_fmt::hcf5f6d96e1dd7099
at /rustc/2faabf579323f5252329264cc53ba9ff803429a3/library/core/src/panicking.rs:92:14
15: 0x55b5123f1da2 - core::result::unwrap_failed::he898b02f57993c42
at /rustc/2faabf579323f5252329264cc53ba9ff803429a3/library/core/src/result.rs:1599:5
16: 0x55b5126104be - core::result::Result<T,E>::unwrap::h739ccad6819ded2e
at /rustc/2faabf579323f5252329264cc53ba9ff803429a3/library/core/src/result.rs:1281:23
17: 0x55b5126104be - tikv_ctl::new_debug_executor::hfcc27cbfe9010899
at /home/jenkins/agent/workspace/optimization-build-tidb-linux-amd/go/src/github.com/pingcap/tikv/cmd/tikv-ctl/src/main.rs:121:19
18: 0x55b51266d752 - tikv_ctl::main::hf405c4e76e87aca5
at /home/jenkins/agent/workspace/optimization-build-tidb-linux-amd/go/src/github.com/pingcap/tikv/cmd/tikv-ctl/src/main.rs:2089:9
19: 0x55b5125280e3 - core::ops::function::FnOnce::call_once::hafb3e66f5d667bfd
at /rustc/2faabf579323f5252329264cc53ba9ff803429a3/library/core/src/ops/function.rs:227:5
20: 0x55b5125280e3 - std::sys_common::backtrace::__rust_begin_short_backtrace::h19cb421d8d2633ef
at /rustc/2faabf579323f5252329264cc53ba9ff803429a3/library/std/src/sys_common/backtrace.rs:125:18
21: 0x55b51268ab24 - main
22: 0x7f38c15c313a - __libc_start_main
23: 0x55b5124a7067 -
24: 0x0 -

:sweat_smile: unsafe recover 是高危操作,只有确认发生了多副本丢失且无法恢复后才能执行,生产上执行请慎重 ,否则会有未知风险