【 TiDB 使用环境】
【概述】场景+问题概述
tikv下线,卡在Pending Offline状态,刚开始region持续减少,最后"region_count" 只剩下1个 卡那了。
【背景】做过哪些操作
修改过集群的gc life时间,不确定这个有没影响,但是把gc时间改回来了,也是卡着
【现象】业务和数据库现象
【业务影响】
【TiDB 版本】:v4.0.9
【附件】
看到过这个,他的情况和我的不太一样
可以提供一下这个节点 tikv 的日志信息嘛
在pd-ctl上执行:
region store 189492
找出对应的region_id
然后
scheduler show
看看当天有什么调度在执行
tikv-ctl --host 192.168.241.56:20160 raft region -r ${region_id}
看看这region的状态信息
tikv和pd都没看到啥异常的日志:
[root@b56 tikv]# tail -50 /disk1/tikv/log/tikv.log
[2021/11/04 08:02:04.190 +08:00] [INFO] [util.rs:190] [“heartbeat sender and receiver are stale, refreshing …”]
[2021/11/04 08:02:04.190 +08:00] [WARN] [util.rs:209] [“updating PD client done”] [spend=4.093136ms]
[2021/11/04 08:07:27.064 +08:00] [INFO] [gc_manager.rs:416] [“gc_worker: start auto gc”] [safe_point=428862518923624448]
[2021/11/04 08:07:27.064 +08:00] [INFO] [gc_manager.rs:456] [“gc_worker: finished auto gc”] [processed_regions=0]
[2021/11/04 08:12:04.191 +08:00] [INFO] [util.rs:419] [“connecting to PD endpoint”] [endpoints=http://192.168.241.26:12379]
[2021/11/04 08:12:04.192 +08:00] [INFO] [] [“New connected subchannel at 0x7f6a89a3c130 for subchannel 0x7f6a89a11d00”]
[2021/11/04 08:12:04.193 +08:00] [INFO] [util.rs:419] [“connecting to PD endpoint”] [endpoints=http://192.168.241.49:12379]
[2021/11/04 08:12:04.195 +08:00] [INFO] [util.rs:484] [“connected to PD leader”] [endpoints=http://192.168.241.49:12379]
[2021/11/04 08:12:04.195 +08:00] [INFO] [util.rs:190] [“heartbeat sender and receiver are stale, refreshing …”]
[2021/11/04 08:12:04.195 +08:00] [WARN] [util.rs:209] [“updating PD client done”] [spend=4.443234ms]
[2021/11/04 08:22:04.196 +08:00] [INFO] [util.rs:419] [“connecting to PD endpoint”] [endpoints=http://192.168.241.26:12379]
[2021/11/04 08:22:04.198 +08:00] [INFO] [] [“New connected subchannel at 0x7f6a89a3c130 for subchannel 0x7f6a89a11d00”]
[2021/11/04 08:22:04.200 +08:00] [INFO] [util.rs:419] [“connecting to PD endpoint”] [endpoints=http://192.168.241.49:12379]
[2021/11/04 08:22:04.202 +08:00] [INFO] [util.rs:484] [“connected to PD leader”] [endpoints=http://192.168.241.49:12379]
[2021/11/04 08:22:04.202 +08:00] [INFO] [util.rs:190] [“heartbeat sender and receiver are stale, refreshing …”]
[2021/11/04 08:22:04.202 +08:00] [WARN] [util.rs:209] [“updating PD client done”] [spend=6.334641ms]
[2021/11/04 08:32:04.203 +08:00] [INFO] [util.rs:419] [“connecting to PD endpoint”] [endpoints=http://192.168.241.26:12379]
[2021/11/04 08:32:04.204 +08:00] [INFO] [] [“New connected subchannel at 0x7f6a89a3c130 for subchannel 0x7f6a89a11d00”]
[2021/11/04 08:32:04.205 +08:00] [INFO] [util.rs:419] [“connecting to PD endpoint”] [endpoints=http://192.168.241.49:12379]
[2021/11/04 08:32:04.207 +08:00] [INFO] [util.rs:484] [“connected to PD leader”] [endpoints=http://192.168.241.49:12379]
[2021/11/04 08:32:04.207 +08:00] [INFO] [util.rs:190] [“heartbeat sender and receiver are stale, refreshing …”]
[2021/11/04 08:32:04.207 +08:00] [WARN] [util.rs:209] [“updating PD client done”] [spend=4.166262ms]
[2021/11/04 08:37:27.128 +08:00] [INFO] [gc_manager.rs:416] [“gc_worker: start auto gc”] [safe_point=428862990782824448]
[2021/11/04 08:37:27.128 +08:00] [INFO] [gc_manager.rs:456] [“gc_worker: finished auto gc”] [processed_regions=0]
[2021/11/04 08:42:04.208 +08:00] [INFO] [util.rs:419] [“connecting to PD endpoint”] [endpoints=http://192.168.241.26:12379]
[2021/11/04 08:42:04.209 +08:00] [INFO] [] [“New connected subchannel at 0x7f6a89a3c130 for subchannel 0x7f6a89a11d00”]
[2021/11/04 08:42:04.210 +08:00] [INFO] [util.rs:419] [“connecting to PD endpoint”] [endpoints=http://192.168.241.49:12379]
[2021/11/04 08:42:04.212 +08:00] [INFO] [util.rs:484] [“connected to PD leader”] [endpoints=http://192.168.241.49:12379]
[2021/11/04 08:42:04.212 +08:00] [INFO] [util.rs:190] [“heartbeat sender and receiver are stale, refreshing …”]
[2021/11/04 08:42:04.212 +08:00] [WARN] [util.rs:209] [“updating PD client done”] [spend=3.89979ms]
[2021/11/04 08:52:04.213 +08:00] [INFO] [util.rs:419] [“connecting to PD endpoint”] [endpoints=http://192.168.241.26:12379]
[2021/11/04 08:52:04.214 +08:00] [INFO] [] [“New connected subchannel at 0x7f6a89a3c130 for subchannel 0x7f6a89a11d00”]
[2021/11/04 08:52:04.215 +08:00] [INFO] [util.rs:419] [“connecting to PD endpoint”] [endpoints=http://192.168.241.49:12379]
[2021/11/04 08:52:04.217 +08:00] [INFO] [util.rs:484] [“connected to PD leader”] [endpoints=http://192.168.241.49:12379]
[2021/11/04 08:52:04.217 +08:00] [INFO] [util.rs:190] [“heartbeat sender and receiver are stale, refreshing …”]
[2021/11/04 08:52:04.217 +08:00] [WARN] [util.rs:209] [“updating PD client done”] [spend=4.194713ms]
[2021/11/04 09:02:04.218 +08:00] [INFO] [util.rs:419] [“connecting to PD endpoint”] [endpoints=http://192.168.241.26:12379]
[2021/11/04 09:02:04.219 +08:00] [INFO] [] [“New connected subchannel at 0x7f6a89a3c130 for subchannel 0x7f6a89a11d00”]
[2021/11/04 09:02:04.221 +08:00] [INFO] [util.rs:419] [“connecting to PD endpoint”] [endpoints=http://192.168.241.49:12379]
[2021/11/04 09:02:04.222 +08:00] [INFO] [util.rs:484] [“connected to PD leader”] [endpoints=http://192.168.241.49:12379]
[2021/11/04 09:02:04.222 +08:00] [INFO] [util.rs:190] [“heartbeat sender and receiver are stale, refreshing …”]
[2021/11/04 09:02:04.222 +08:00] [WARN] [util.rs:209] [“updating PD client done”] [spend=4.170834ms]
[2021/11/04 09:07:27.193 +08:00] [INFO] [gc_manager.rs:416] [“gc_worker: start auto gc”] [safe_point=428863462642024448]
[2021/11/04 09:07:27.193 +08:00] [INFO] [gc_manager.rs:456] [“gc_worker: finished auto gc”] [processed_regions=0]
[2021/11/04 09:12:04.223 +08:00] [INFO] [util.rs:419] [“connecting to PD endpoint”] [endpoints=http://192.168.241.26:12379]
[2021/11/04 09:12:04.224 +08:00] [INFO] [] [“New connected subchannel at 0x7f6a89a3c130 for subchannel 0x7f6a89a11d00”]
[2021/11/04 09:12:04.225 +08:00] [INFO] [util.rs:419] [“connecting to PD endpoint”] [endpoints=http://192.168.241.49:12379]
[2021/11/04 09:12:04.227 +08:00] [INFO] [util.rs:484] [“connected to PD leader”] [endpoints=http://192.168.241.49:12379]
[2021/11/04 09:12:04.227 +08:00] [INFO] [util.rs:190] [“heartbeat sender and receiver are stale, refreshing …”]
[2021/11/04 09:12:04.227 +08:00] [WARN] [util.rs:209] [“updating PD client done”] [spend=4.297281ms]
[root@b56 tikv]#
[tidb@b16 ~]$ pd-ctl -u 192.168.241.24:12379
» region store 189492
{
“count”: 1,
“regions”: [
{
“id”: 4549271,
“start_key”: “7480000000000008FFB65F698000000000FF0000060380000000FF0000000103800000FF0851CA3863000000FC”,
“end_key”: “7480000000000008FFB65F698000000000FF0000060380000000FF0000000103800000FF095675FCF2000000FC”,
“epoch”: {
“conf_ver”: 18776,
“version”: 8239
},
“peers”: [
{
“id”: 6203802,
“store_id”: 3353181
},
{
“id”: 6193654,
“store_id”: 326765
},
{
“id”: 6193745,
“store_id”: 189492
},
{
“id”: 6193932,
“store_id”: 503256,
“is_learner”: true
}
],
“leader”: {
“id”: 6193654,
“store_id”: 326765
},
“down_peers”: [
{
“peer”: {
“id”: 6193932,
“store_id”: 503256,
“is_learner”: true
},
“down_seconds”: 21893575
}
],
“pending_peers”: [
{
“id”: 6193932,
“store_id”: 503256,
“is_learner”: true
}
],
“written_bytes”: 0,
“read_bytes”: 0,
“written_keys”: 0,
“read_keys”: 0,
“approximate_size”: 57,
“approximate_keys”: 892697
}
]
}
» scheduler show
[
“balance-hot-region-scheduler”,
“balance-leader-scheduler”,
“balance-region-scheduler”,
“label-scheduler”
]
»
[tidb@b16 ~]$ tikv-ctl --host 192.168.241.56:20160 raft region -r 4549271
region id: 4549271
region state key: \001\003\000\000\000\000\000Ej\227\001
region state: Some(region {id: 4549271 start_key: “t\200\000\000\000\000\000\010\377\266_i\200\000\000\000\000\377\000\000\006\003\200\000\000\000\377\000\000\000\001\003\200\000\000\377\010Q\3128c\000\000\000\374” end_key: “t\200\000\000\000\000\000\010\377\266_i\200\000\000\000\000\377\000\000\006\003\200\000\000\000\377\000\000\000\001\003\200\000\000\377\tVu\374\362\000\000\000\374” region_epoch {conf_ver: 18776 version: 8239} peers {id: 6203802 store_id: 3353181} peers {id: 6193654 store_id: 326765} peers {id: 6193745 store_id: 189492} peers {id: 6193932 store_id: 503256 is_learner: true}})
raft state key: \001\002\000\000\000\000\000Ej\227\002
raft state: Some(hard_state {term: 1098 vote: 6193654 commit: 23877700} last_index: 23877700)
apply state key: \001\002\000\000\000\000\000Ej\227\003
apply state: Some(applied_index: 23877700 truncated_state {index: 23852168 term: 1098})
[tidb@b16 ~]$
region peer 所在的 store 503256 现在是什么情况,从状态中看是 down 的,该 store 是 tiflash 节点吗?
不是,我们集群中没有tiflash,只有tikv
请确认下这个 store 503256 的状态是否正常吧
我看503256是正常的,是这样看吗?
[tidb@b16 ~]$ pd-ctl -u 192.168.241.24:12379
» store 503256
{
“store”: {
“id”: 503256,
“address”: “192.168.241.11:20160”,
“version”: “4.0.9”,
“status_address”: “192.168.241.11:20180”,
“git_hash”: “18dec72b12eafdc40a463eee8f6c32594ee4a9ff”,
“start_timestamp”: 1612784076,
“deploy_path”: “/disk1/tikv/bin”,
“last_heartbeat”: 1636006907178636325,
“state_name”: “Up”
},
“status”: {
“capacity”: “1.791TiB”,
“available”: “932.3GiB”,
“used_size”: “874.4GiB”,
“leader_count”: 15030,
“leader_weight”: 1,
“leader_score”: 15030,
“leader_size”: 1012691,
“region_count”: 44821,
“region_weight”: 1,
“region_score”: 3002792,
“region_size”: 3002792,
“start_ts”: “2021-02-08T19:34:36+08:00”,
“last_heartbeat_ts”: “2021-11-04T14:21:47.178636325+08:00”,
“uptime”: “6450h47m11.178636325s”
}
}
1.这个节点状态看起来是正常的,麻烦在该 tikv 的日志中检索下有无 region 4549271 的信息,看下里面是否相关提示信息;
2.如果日志中找不到该 region 信息,可以尝试下能否通过 pd-ctl 将该 region peer 从 store 189492 移除掉:operator add remove-peer 4549271 189492
,或者强制下线该 tikv 节点:tiup cluster scale-in {cluster-name} -N {ip:port} --force
继续请教一下大佬,这个问题可能是什么原因引起的呢? 有可能是历史遗留数据吗?因为系统是前人部署的我刚接手不久
不太确定是不是最后一个 region transfer 到 store 503256 过程中 pd leader 和 store 503256 通信临时有问题导致迁移卡主了,缺少对应的日志信息,不太好确认根本原因
好的,thanks~
此话题已在最后回复的 1 分钟后被自动关闭。不再允许新回复。