tikv 升级到 5.0时,evict leader 不生效

为提高效率,请提供以下信息,问题描述清晰能够更快得到解决:

【概述】升级tidb 4.0.13 集群到 v5.0.3

【背景】10个tikv

【现象】夜间22点开始操作。总共10台tikv,已经升级6个tikv到5.0.3, 再升级第7个tikv 时,pd执行evict leader,tikv leader 不降到0

【业务影响】集群无法升级

【TiDB 版本】v4.0.13

【附件】

22点50 pd执行evict leader,等了10分钟,leader不下降,pd执行 remove evict-leader-scheduler ,继续别的tikv节点升级,最后只剩下18.158之后,在23:18 再次 尝试升级,这次成功了

异常tikv 123.59.18.158日志

链接: 百度网盘-链接不存在 提取码: 98y2

1 个赞

从日志上看 22:25 开始这个 KV 节点跟其他节点的通讯中断了,所以 evict leader 失效是符合预期的,后面在 22:37 的时候节点正常之后所以 evict leader 生效且正常执行。日志最后时间是。22:38 ,后面不知道发生了什么。所以需要看下最开始 22:25 为啥这个 KV 节点会跟其他节点失联。

人工 操作 只有 pd evict leader,然后观察 leader数量。还需要哪些信息,这边可以提供。
我升级4个 4.0.版本集群 到5.0 , 每个集群在升级期间都会 发生 某个tikv evict leader 失效的问题。

正常的升级操作不需要手动 evict leader,手动 evict leader 的初衷是因为什么?

这是昨晚又找个 4.0 集群 升级,问题复现了
19:40 开始升级集群, 21点左右升级完毕,期间18.141 发生evict leader 失败
pd 操作如下:
[root@emarsys105016 logs]# date && pd-ctl -u http://pd:2479 scheduler add evict-leader-scheduler 459653
Tue Jul 6 20:06:09 CST 2021
Success!

[root@emarsys105016 logs]# date && pd-ctl -u http://pd:2479 store 459653
Tue Jul 6 20:08:34 CST 2021
{
“store”: {
“id”: 459653,
“address”: “123.59.18.141:20160”,
“labels”: [
{
“key”: “rack”,
“value”: “5F-A8-01”
}
],
“version”: “4.0.13”,
“status_address”: “123.59.18.141:20180”,
“git_hash”: “a448d617f79ddf545be73931525bb41af0f790f3”,
“start_timestamp”: 1623481939,
“deploy_path”: “/”,
“last_heartbeat”: 1625573310245948904,
“state_name”: “Up”
},
“status”: {
“capacity”: “892.6GiB”,
“available”: “623.3GiB”,
“used_size”: “51.59GiB”,
“leader_count”: 1374,
“leader_weight”: 4,
“leader_score”: 343.5,
“leader_size”: 92893,
“region_count”: 5163,
“region_weight”: 4,
“region_score”: 86283.75,
“region_size”: 345135,
“start_ts”: “2021-06-12T07:12:19Z”,
“last_heartbeat_ts”: “2021-07-06T12:08:30.245948904Z”,
“uptime”: “580h56m11.245948904s”
}
}

[root@emarsys105016 logs]# date && pd-ctl -u http://pd:2479 store 459653
Tue Jul 6 20:13:01 CST 2021
{
“store”: {
“id”: 459653,
“address”: “123.59.18.141:20160”,
“labels”: [
{
“key”: “rack”,
“value”: “5F-A8-01”
}
],
“version”: “4.0.13”,
“status_address”: “123.59.18.141:20180”,
“git_hash”: “a448d617f79ddf545be73931525bb41af0f790f3”,
“start_timestamp”: 1623481939,
“deploy_path”: “/”,
“last_heartbeat”: 1625573580283732716,
“state_name”: “Up”
},
“status”: {
“capacity”: “892.6GiB”,
“available”: “625.1GiB”,
“used_size”: “51.5GiB”,
“leader_count”: 1344,
“leader_weight”: 4,
“leader_score”: 336,
“leader_size”: 90629,
“region_count”: 5162,
“region_weight”: 4,
“region_score”: 86080.25,
“region_size”: 344321,
“start_ts”: “2021-06-12T07:12:19Z”,
“last_heartbeat_ts”: “2021-07-06T12:13:00.283732716Z”,
“uptime”: “581h0m41.283732716s”
}
}

[root@emarsys105016 logs]# date && pd-ctl -u http://pd:2479 store 459653
Tue Jul 6 20:15:02 CST 2021
{
“store”: {
“id”: 459653,
“address”: “123.59.18.141:20160”,
“labels”: [
{
“key”: “rack”,
“value”: “5F-A8-01”
}
],
“version”: “4.0.13”,
“status_address”: “123.59.18.141:20180”,
“git_hash”: “a448d617f79ddf545be73931525bb41af0f790f3”,
“start_timestamp”: 1623481939,
“deploy_path”: “/”,
“last_heartbeat”: 1625573700334779338,
“state_name”: “Up”
},
“status”: {
“capacity”: “892.6GiB”,
“available”: “625GiB”,
“used_size”: “51.61GiB”,
“leader_count”: 1345,
“leader_weight”: 4,
“leader_score”: 336.25,
“leader_size”: 90625,
“region_count”: 5162,
“region_weight”: 4,
“region_score”: 86004.5,
“region_size”: 344018,
“start_ts”: “2021-06-12T07:12:19Z”,
“last_heartbeat_ts”: “2021-07-06T12:15:00.334779338Z”,
“uptime”: “581h2m41.334779338s”
}
}

等了10分钟。leader数没有变化,就动别的tikv了

异常tikv日志
tikvlog.gz (2.1 MB)

pd evict leader ,等leader变为0 ,再升级 这个节点。
这边是手动升级,不是用的tiup

此话题已在最后回复的 1 分钟后被自动关闭。不再允许新回复。