TiKV个别节点leader region数目为0

为提高效率,提问时请尽量提供详细背景信息,问题描述清晰可优先响应。以下信息点请尽量提供:

  • 系统版本 & kernel 版本

CentOS7.6 3.10.0-1062.4.1.el7.x86_64

  • TiDB 版本

v3.0.5

  • 磁盘型号

ssd

  • 集群节点分布

服务器:3台;

TikV节点:3个;

PD节点:1个;

TIDB节点:1个

  • 数据量 & region 数量 & 副本数

  • 问题描述(我做了什么)

这是我的测试环境,3个kv节点中有一个节点的leader region数量为零。

尝试去使用 operator add transfer-leader 命令去转移leader,但是转移失败,节点192.168.113.20没办法接受任何的leader节点的转移。

image

我的测试环境下有个大表trips,里面一共有9个region,其中的region 2001 的leader在节点4(192.168.113.21),我尝试把这个region的leader转移到 节点1(192.168.113.20),但是最后的结果却是转移到了节点5(192.168.113.22)上了,请问这是为什么呢?

附上 192.168.113.20 (node 1)和 192.168.113.22 (node 5)的日志:

192.168.113.20的tikv日志:

[2019/10/31 17:47:08.165 +08:00] [INFO] [raft.rs:1662] ["[region 2001] 2004 [term 15] received MsgTimeoutNow from 2002 and starts an election to get leadership."]
[2019/10/31 17:47:08.165 +08:00] [INFO] [raft.rs:1094] ["[region 2001] 2004 is starting a new election at term 15"]
[2019/10/31 17:47:08.165 +08:00] [INFO] [raft.rs:743] ["[region 2001] 2004 became candidate at term 16"]
[2019/10/31 17:47:08.165 +08:00] [INFO] [raft.rs:858] ["[region 2001] 2004 received MsgRequestVoteResponse from 2004 at term 16"]
[2019/10/31 17:47:08.165 +08:00] [INFO] [raft.rs:832] ["[region 2001] 2004 [logterm: 15, index: 15] sent MsgRequestVote request to 2002 at term 16"]
[2019/10/31 17:47:08.165 +08:00] [INFO] [raft.rs:832] ["[region 2001] 2004 [logterm: 15, index: 15] sent MsgRequestVote request to 2003 at term 16"]
[2019/10/31 17:47:08.171 +08:00] [INFO] [raft.rs:858] ["[region 2001] 2004 received MsgRequestVoteResponse from 2003 at term 16"]
[2019/10/31 17:47:08.171 +08:00] [INFO] [raft.rs:1587] ["[region 2001] 2004 [quorum:2] has received 2 MsgRequestVoteResponse votes and 0 vote rejections"]
[2019/10/31 17:47:08.171 +08:00] [INFO] [raft.rs:793] ["[region 2001] 2004 became leader at term 16"]
[2019/10/31 17:47:10.271 +08:00] [INFO] [pd.rs:566] ["try to transfer leader"] [to_peer="id: 2003 store_id: 5"] [from_peer="id: 2004 store_id: 1"] [region_id=2001]
[2019/10/31 17:47:10.271 +08:00] [INFO] [peer.rs:1762] ["transfer leader"] [peer="id: 2003 store_id: 5"] [peer_id=2004] [region_id=2001]
[2019/10/31 17:47:10.271 +08:00] [INFO] [raft.rs:1294] ["[region 2001] 2004 [term 16] starts to transfer leadership to 2003"]
[2019/10/31 17:47:10.271 +08:00] [INFO] [raft.rs:1304] ["[region 2001] 2004 sends MsgTimeoutNow to 2003 immediately as 2003 already has up-to-date log"]
[2019/10/31 17:47:10.273 +08:00] [INFO] [raft.rs:924] ["[region 2001] 2004 [term: 16] received a MsgRequestVote message with higher term from 2003 [term: 17]"]
[2019/10/31 17:47:10.273 +08:00] [INFO] [raft.rs:723] ["[region 2001] 2004 became follower at term 17"]
[2019/10/31 17:47:10.273 +08:00] [INFO] [raft.rs:1108] ["[region 2001] 2004 [logterm: 16, index: 16, vote: 0] cast MsgRequestVote for 2003 [logterm: 16, index: 16] at term 17"]

192.168.113.22的tikv日志

[2019/10/31 17:47:08.173 +08:00] [INFO] [raft.rs:924] ["[region 2001] 2003 [term: 15] received a MsgRequestVote message with higher term from 2004 [term: 16]"]
[2019/10/31 17:47:08.173 +08:00] [INFO] [raft.rs:723] ["[region 2001] 2003 became follower at term 16"]
[2019/10/31 17:47:08.173 +08:00] [INFO] [raft.rs:1108] ["[region 2001] 2003 [logterm: 15, index: 15, vote: 0] cast MsgRequestVote for 2004 [logterm: 15, index: 15] at term 16"]
[2019/10/31 17:47:10.278 +08:00] [INFO] [raft.rs:1662] ["[region 2001] 2003 [term 16] received MsgTimeoutNow from 2004 and starts an election to get leadership."]
[2019/10/31 17:47:10.278 +08:00] [INFO] [raft.rs:1094] ["[region 2001] 2003 is starting a new election at term 16"]
[2019/10/31 17:47:10.278 +08:00] [INFO] [raft.rs:743] ["[region 2001] 2003 became candidate at term 17"]
[2019/10/31 17:47:10.278 +08:00] [INFO] [raft.rs:858] ["[region 2001] 2003 received MsgRequestVoteResponse from 2003 at term 17"]
[2019/10/31 17:47:10.278 +08:00] [INFO] [raft.rs:832] ["[region 2001] 2003 [logterm: 16, index: 16] sent MsgRequestVote request to 2004 at term 17"]
[2019/10/31 17:47:10.278 +08:00] [INFO] [raft.rs:832] ["[region 2001] 2003 [logterm: 16, index: 16] sent MsgRequestVote request to 2002 at term 17"]
[2019/10/31 17:47:10.280 +08:00] [INFO] [raft.rs:858] ["[region 2001] 2003 received MsgRequestVoteResponse from 2004 at term 17"]
[2019/10/31 17:47:10.280 +08:00] [INFO] [raft.rs:1587] ["[region 2001] 2003 [quorum:2] has received 2 MsgRequestVoteResponse votes and 0 vote rejections"]
[2019/10/31 17:47:10.280 +08:00] [INFO] [raft.rs:793] ["[region 2001] 2003 became leader at term 17"]
[2019/10/31 17:49:19.854 +08:00] [INFO] [gc_worker.rs:861] ["gc_worker: start auto gc"] [safe_point=412225269236236288]
[2019/10/31 17:49:19.861 +08:00] [INFO] [gc_worker.rs:901] ["gc_worker: finished auto gc"] [processed_regions=7]

  • 关键词

TiKV,region

这里想要问一下为什么那个kv没办法成为region的leader呢?

使用pd-ctl工具,看下scheduler show的情况以及member的情况,pd-ctl使用可以参考

  • scheduler show 的结果
[
  "balance-region-scheduler",
  "balance-leader-scheduler",
  "balance-hot-region-scheduler",
  "label-scheduler",
  "evict-leader-scheduler-1"
]
  • member的结果:
{
  "header": {
    "cluster_id": 6753054008701595112
  },
  "members": [
    {
      "name": "pd_tidb-pd-tikv01",
      "member_id": 3528673005875425082,
      "peer_urls": [
        "http://192.168.113.20:2380"
      ],
      "client_urls": [
        "http://192.168.113.20:2379"
      ]
    }
  ],
  "leader": {
    "name": "pd_tidb-pd-tikv01",
    "member_id": 3528673005875425082,
    "peer_urls": [
      "http://192.168.113.20:2380"
    ],
    "client_urls": [
      "http://192.168.113.20:2379"
    ]
  },
  "etcd_leader": {
    "name": "pd_tidb-pd-tikv01",
    "member_id": 3528673005875425082,
    "peer_urls": [
      "http://192.168.113.20:2380"
    ],
    "client_urls": [
      "http://192.168.113.20:2379"
    ]
  }
}

谢谢提示,已经找到原因了。 是因为执行rolling update tikv的时候出现了错误,结果那个调度器没有正常被移除掉。 删除这个调度器之后恢复了正常。