【 TiDB 使用环境`】生产环境
【 TiDB 版本】 v5.3.0
【遇到的问题】没有leader的region导致GC失败
【复现路径】
【问题现象及影响】
没有leader的region导致GC失败,空间一直不能回收
可以参考下这个文章,在确认问题region没有数据的情况下,使用tombone region方式清理region。
看过这篇文章,也已经操作过。
在"store_id": 4 上操作的,也提示success了,另外两个store因为异常已经不能启动。
region还是存在。
Starting component ctl
: /home/maintain/.tiup/components/ctl/v5.3.0/ctl pd -u http://10.11.46.23:2379 region 20521929
{
“id”: 20521929,
“start_key”: “748000000000000AFF045F698000000000FF0000030380000000FF0000000003800000FF0000000000038000FF0000000000000380FF0000001A8A4ABE00FE”,
“end_key”: “748000000000000AFF045F698000000000FF0000030380000000FF0000000003800000FF0000000065038000FF0000000000020380FF0000001930881600FE”,
“epoch”: {
“conf_ver”: 3539,
“version”: 8621
},
“peers”: [
{
“id”: 27764108,
“store_id”: 4,
“role_name”: “Voter”
},
{
“id”: 28149959,
“store_id”: 28149807,
“role_name”: “Voter”
},
{
“id”: 28149992,
“store_id”: 28149806,
“role_name”: “Voter”
}
],
“leader”: {
“role_name”: “Voter”
},
“written_bytes”: 0,
“read_bytes”: 0,
“written_keys”: 0,
“read_keys”: 0,
“approximate_size”: 0,
“approximate_keys”: 0
}
异常store报的什么错误
异常store之前用的云主机,很久之前(“start_ts”: “2021-09-15T23:32:00+08:00”)就下线了,但是不知道为何一直没有变为tombone状态,猜测也是因为这几个region的原因。
可以理解为这两个store的主机已经不存在了。
Starting component ctl
: /home/maintain/.tiup/components/ctl/v5.3.0/ctl pd -u http://10.11.46.23:2379 store 28149808
{
“store”: {
“id”: 28149808,
“address”: “alibjf-op-tidb-server5-vm:4000”,
“state”: 1,
“version”: “4.0.8”,
“status_address”: “alibjf-op-tidb-server5-vm:10080”,
“git_hash”: “83091173e960e5a0f5f417e921a0801d2f6635ae”,
“start_timestamp”: 1631719920,
“deploy_path”: “/home/shared/tidb-deploy/tidb-4000/bin”,
“last_heartbeat”: 1631720260888908076,
“state_name”: “Offline”
},
“status”: {
“capacity”: “0B”,
“available”: “0B”,
“used_size”: “0B”,
“leader_count”: 0,
“leader_weight”: 0,
“leader_score”: 0,
“leader_size”: 0,
“region_count”: 0,
“region_weight”: 0,
“region_score”: 0,
“region_size”: 0,
“slow_score”: 0,
“start_ts”: “2021-09-15T23:32:00+08:00”,
“last_heartbeat_ts”: “2021-09-15T23:37:40.888908076+08:00”,
“uptime”: “5m40.888908076s”
}
}
pd-ctl region store xxxx看看这几个下线的store 内的region
Starting component ctl
: /home/maintain/.tiup/components/ctl/v5.3.0/ctl pd -u http://10.11.46.23:2379 region store 28149808
{
“count”: 2,
“regions”: [
{
“id”: 19006903,
“start_key”: “748000000000000AFF0E5F7280000001B6FF5A4AB80000000000FA”,
“end_key”: “748000000000000AFF1200000000000000F8”,
“epoch”: {
“conf_ver”: 9737,
“version”: 12680
},
“peers”: [
{
“id”: 27941117,
“store_id”: 26634932,
“role_name”: “Voter”
},
{
“id”: 28150112,
“store_id”: 28149806,
“role_name”: “Voter”
},
{
“id”: 28150747,
“store_id”: 28149808,
“role_name”: “Voter”
}
],
“leader”: {
“role_name”: “Voter”
},
“written_bytes”: 0,
“read_bytes”: 0,
“written_keys”: 0,
“read_keys”: 0,
“approximate_size”: 0,
“approximate_keys”: 0
},
{
“id”: 24703116,
“start_key”: “748000000000001EFF955F728000000000FF06AAC70000000000FA”,
“end_key”: “748000000000001EFF955F728000000000FF07133B0000000000FA”,
“epoch”: {
“conf_ver”: 3052,
“version”: 23116
},
“peers”: [
{
“id”: 26705681,
“store_id”: 26634932,
“role_name”: “Voter”
},
{
“id”: 28150999,
“store_id”: 28149807,
“role_name”: “Voter”
},
{
“id”: 28158402,
“store_id”: 28149808,
“role_name”: “Voter”
}
],
“leader”: {
“role_name”: “Voter”
},
“written_bytes”: 0,
“read_bytes”: 0,
“written_keys”: 0,
“read_keys”: 0,
“approximate_size”: 0,
“approximate_keys”: 0
}
]
}
Starting component ctl
: /home/maintain/.tiup/components/ctl/v5.3.0/ctl pd -u http://10.11.46.23:2379 region store 28149807
{
“count”: 2,
“regions”: [
{
“id”: 20521929,
“start_key”: “748000000000000AFF045F698000000000FF0000030380000000FF0000000003800000FF0000000000038000FF0000000000000380FF0000001A8A4ABE00FE”,
“end_key”: “748000000000000AFF045F698000000000FF0000030380000000FF0000000003800000FF0000000065038000FF0000000000020380FF0000001930881600FE”,
“epoch”: {
“conf_ver”: 3539,
“version”: 8621
},
“peers”: [
{
“id”: 27764108,
“store_id”: 4,
“role_name”: “Voter”
},
{
“id”: 28149959,
“store_id”: 28149807,
“role_name”: “Voter”
},
{
“id”: 28149992,
“store_id”: 28149806,
“role_name”: “Voter”
}
],
“leader”: {
“role_name”: “Voter”
},
“written_bytes”: 0,
“read_bytes”: 0,
“written_keys”: 0,
“read_keys”: 0,
“approximate_size”: 0,
“approximate_keys”: 0
},
{
“id”: 24703116,
“start_key”: “748000000000001EFF955F728000000000FF06AAC70000000000FA”,
“end_key”: “748000000000001EFF955F728000000000FF07133B0000000000FA”,
“epoch”: {
“conf_ver”: 3052,
“version”: 23116
},
“peers”: [
{
“id”: 26705681,
“store_id”: 26634932,
“role_name”: “Voter”
},
{
“id”: 28150999,
“store_id”: 28149807,
“role_name”: “Voter”
},
{
“id”: 28158402,
“store_id”: 28149808,
“role_name”: “Voter”
}
],
“leader”: {
“role_name”: “Voter”
},
“written_bytes”: 0,
“read_bytes”: 0,
“written_keys”: 0,
“read_keys”: 0,
“approximate_size”: 0,
“approximate_keys”: 0
}
]
}
Starting component ctl
: /home/maintain/.tiup/components/ctl/v5.3.0/ctl pd -u http://10.11.46.23:2379 region store 28149806
{
“count”: 2,
“regions”: [
{
“id”: 20521929,
“start_key”: “748000000000000AFF045F698000000000FF0000030380000000FF0000000003800000FF0000000000038000FF0000000000000380FF0000001A8A4ABE00FE”,
“end_key”: “748000000000000AFF045F698000000000FF0000030380000000FF0000000003800000FF0000000065038000FF0000000000020380FF0000001930881600FE”,
“epoch”: {
“conf_ver”: 3539,
“version”: 8621
},
“peers”: [
{
“id”: 27764108,
“store_id”: 4,
“role_name”: “Voter”
},
{
“id”: 28149959,
“store_id”: 28149807,
“role_name”: “Voter”
},
{
“id”: 28149992,
“store_id”: 28149806,
“role_name”: “Voter”
}
],
“leader”: {
“role_name”: “Voter”
},
“written_bytes”: 0,
“read_bytes”: 0,
“written_keys”: 0,
“read_keys”: 0,
“approximate_size”: 0,
“approximate_keys”: 0
},
{
“id”: 19006903,
“start_key”: “748000000000000AFF0E5F7280000001B6FF5A4AB80000000000FA”,
“end_key”: “748000000000000AFF1200000000000000F8”,
“epoch”: {
“conf_ver”: 9737,
“version”: 12680
},
“peers”: [
{
“id”: 27941117,
“store_id”: 26634932,
“role_name”: “Voter”
},
{
“id”: 28150112,
“store_id”: 28149806,
“role_name”: “Voter”
},
{
“id”: 28150747,
“store_id”: 28149808,
“role_name”: “Voter”
}
],
“leader”: {
“role_name”: “Voter”
},
“written_bytes”: 0,
“read_bytes”: 0,
“written_keys”: 0,
“read_keys”: 0,
“approximate_size”: 0,
“approximate_keys”: 0
}
]
}
用tombstone region方法把这几个下线的store上的region先处理了
昨天已经在“store_id”: 4和“store_id”: 26634932上处理了一次,但是pd还是能看到这些region。
是需要在所有tikv节点都处理一次吗?
这个是在“store_id”: 4和“store_id”: 26634932上的处理:
./tikv-ctl --data-dir “/ssd/tidb-data/tikv-20160” tombstone -p 10.11.46.23:2379 -r 19006903,24703116,20521929 --force
处理后pd里还是有这几个region。
unsafe recovery没敢在线上用。
我个人理解得在这几个region分布的tikv上都执行
有没有什么办法让GC跳过这些region?
肯定没有了,只能一个个处理
找到一个解决办法。
如果确定region不再使用了,可以用recreate-region重建一下对应region,后续的流程就可以继续下去了。
尽量在操作 tikv 的时候,多考虑下region leader 的驱逐和迁移,减少这方面的问题,不然就一堆操作来弥补了
说实话,官方的下线文档并不太完善,有不少人下线中会遇到各种各样的问题,已经看到好多例了。
嗯,表示理解,这样才会让文档越来越完善了~
其实tomebone region/recreate region都不算是正常的处理操作,以前还有个手工设置tombone store的命令,后来被官方取消了。最重的问题还是raft 选不出Leader还有就是不同region命令结果不一样