咨询tikv leader 频繁切换问题?

tidb版本:

MySQL [test]> select tidb_version()\G
*************************** 1. row ***************************
tidb_version(): Release Version: v4.0.4
Edition: Community
Git Commit Hash: c61fc7247e9f6bc773761946d5b5294d3f2699a5
Git Branch: heads/refs/tags/v4.0.4
UTC Build Time: 2020-07-31 07:50:19
GoVersion: go1.13
Race Enabled: false
TiKV Min Version: v3.0.0-60965b006877ca7234adaced7890d7b029ed1306
Check Table Before Drop: false
1 row in set (0.02 sec)

MySQL [test]>
集群配置信息:


问题描述:tikv中的其中一台leader数不稳定,频繁切换

看下 region 和 leader 的监控
反馈下 pd-ctl store 和 scheduler show 的信息

» store
{
  "count": 3,
  "stores": [
    {
      "store": {
        "id": 1,
        "address": "10.205.115.176:20160",
        "version": "4.0.4",
        "status_address": "10.205.115.176:20180",
        "git_hash": "28e3d44b00700137de4fa933066ab83e5f8306cf",
        "start_timestamp": 1596434810,
        "deploy_path": "/data/deploy/tikv-20160/bin",
        "last_heartbeat": 1597374325297816963,
        "state_name": "Up"
      },
      "status": {
        "capacity": "590.5GiB",
        "available": "236GiB",
        "used_size": "302.1GiB",
        "leader_count": 8945,
        "leader_weight": 1,
        "leader_score": 8945,
        "leader_size": 702553,
        "region_count": 17881,
        "region_weight": 1,
        "region_score": 1400735,
        "region_size": 1400735,
        "start_ts": "2020-08-03T14:06:50+08:00",
        "last_heartbeat_ts": "2020-08-14T11:05:25.297816963+08:00",
        "uptime": "260h58m35.297816963s"
      }
    },
    {
      "store": {
        "id": 4,
        "address": "10.205.115.177:20160",
        "version": "4.0.4",
        "status_address": "10.205.115.177:20180",
        "git_hash": "28e3d44b00700137de4fa933066ab83e5f8306cf",
        "start_timestamp": 1596765346,
        "deploy_path": "/data/deploy/tikv-20160/bin",
        "last_heartbeat": 1597374299396790129,
        "state_name": "Disconnected"
      },
      "status": {
        "capacity": "590.5GiB",
        "available": "250.3GiB",
        "used_size": "302.9GiB",
        "leader_count": 0,
        "leader_weight": 1,
        "leader_score": 0,
        "leader_size": 0,
        "region_count": 17881,
        "region_weight": 1,
        "region_score": 1399857,
        "region_size": 1399857,
        "start_ts": "2020-08-07T09:55:46+08:00",
        "last_heartbeat_ts": "2020-08-14T11:04:59.396790129+08:00",
        "uptime": "169h9m13.396790129s"
      }
    },
    {
      "store": {
        "id": 5,
        "address": "10.205.115.178:20160",
        "version": "4.0.4",
        "status_address": "10.205.115.178:20180",
        "git_hash": "28e3d44b00700137de4fa933066ab83e5f8306cf",
        "start_timestamp": 1596434999,
        "deploy_path": "/data/deploy/tikv-20160/bin",
        "last_heartbeat": 1597374323235548464,
        "state_name": "Up"
      },
      "status": {
        "capacity": "590.5GiB",
        "available": "247.7GiB",
        "used_size": "302.1GiB",
        "leader_count": 8936,
        "leader_weight": 1,
        "leader_score": 8936,
        "leader_size": 697304,
        "region_count": 17881,
        "region_weight": 1,
        "region_score": 1400735,
        "region_size": 1400735,
        "start_ts": "2020-08-03T14:09:59+08:00",
        "last_heartbeat_ts": "2020-08-14T11:05:23.235548464+08:00",
        "uptime": "260h55m24.235548464s"
      }
    }
  ]
}

» scheduler show 
[
  "balance-hot-region-scheduler",
  "balance-leader-scheduler",
  "balance-region-scheduler",
  "label-scheduler"
]

»  

tidb最初的版本是v4.0.0-rc.1,使用tiup做的升级操作

store 4 的 state_nam 是 disconnected,这台服务器应该离线已经超过 30 min 了,辛苦上传下 sotre 4 的 tikv log 吧。

tikv+pd日志
链接: https://pan.baidu.com/s/1yRb9O_6PN1HCy1pZH15-nQ 提取码: 8d3z
有时候store显示是正常的,这个是刚截的图

» store
{
  "count": 3,
  "stores": [
    {
      "store": {
        "id": 5,
        "address": "10.205.115.178:20160",
        "version": "4.0.4",
        "status_address": "10.205.115.178:20180",
        "git_hash": "28e3d44b00700137de4fa933066ab83e5f8306cf",
        "start_timestamp": 1596434999,
        "deploy_path": "/data/deploy/tikv-20160/bin",
        "last_heartbeat": 1597378272100867999,
        "state_name": "Up"
      },
      "status": {
        "capacity": "590.5GiB",
        "available": "250.2GiB",
        "used_size": "302.3GiB",
        "leader_count": 5970,
        "leader_weight": 1,
        "leader_score": 5970,
        "leader_size": 467130,
        "region_count": 17901,
        "region_weight": 1,
        "region_score": 1404812,
        "region_size": 1404812,
        "start_ts": "2020-08-03T14:09:59+08:00",
        "last_heartbeat_ts": "2020-08-14T12:11:12.100867999+08:00",
        "uptime": "262h1m13.100867999s"
      }
    },
    {
      "store": {
        "id": 1,
        "address": "10.205.115.176:20160",
        "version": "4.0.4",
        "status_address": "10.205.115.176:20180",
        "git_hash": "28e3d44b00700137de4fa933066ab83e5f8306cf",
        "start_timestamp": 1596434810,
        "deploy_path": "/data/deploy/tikv-20160/bin",
        "last_heartbeat": 1597378275658833581,
        "state_name": "Up"
      },
      "status": {
        "capacity": "590.5GiB",
        "available": "235.7GiB",
        "used_size": "302.5GiB",
        "leader_count": 5968,
        "leader_weight": 1,
        "leader_score": 5968,
        "leader_size": 466997,
        "region_count": 17901,
        "region_weight": 1,
        "region_score": 1404812,
        "region_size": 1404812,
        "start_ts": "2020-08-03T14:06:50+08:00",
        "last_heartbeat_ts": "2020-08-14T12:11:15.658833581+08:00",
        "uptime": "262h4m25.658833581s"
      }
    },
    {
      "store": {
        "id": 4,
        "address": "10.205.115.177:20160",
        "version": "4.0.4",
        "status_address": "10.205.115.177:20180",
        "git_hash": "28e3d44b00700137de4fa933066ab83e5f8306cf",
        "start_timestamp": 1596765346,
        "deploy_path": "/data/deploy/tikv-20160/bin",
        "last_heartbeat": 1597378281193497378,
        "state_name": "Up"
      },
      "status": {
        "capacity": "590.5GiB",
        "available": "252.3GiB",
        "used_size": "303.2GiB",
        "leader_count": 5963,
        "leader_weight": 1,
        "leader_score": 5963,
        "leader_size": 470685,
        "region_count": 17901,
        "region_weight": 1,
        "region_score": 1404812,
        "region_size": 1404812,
        "start_ts": "2020-08-07T09:55:46+08:00",
        "last_heartbeat_ts": "2020-08-14T12:11:21.193497378+08:00",
        "uptime": "170h15m35.193497378s"
      }
    }
  ]
}

»  

看上面监控截图,115.177 cpu 很高,而且只有它高,并且伴随不定时的 disconnect 状态发生,看下当时是否存在写入热点?导致 tikv 存在热点现象,pd 与 tikv 之前的心跳无法正常进行。
辛苦简述下当时的业务情况。是否对某些表集中写入,
使用以下方式上传下 tikv trouble shooting 的截图:


打开 grafana 监控,先按 d 再按 shift+e 可以打开所有监控项。

(1)、chrome 安装这个插件https://chrome.google.com/webstore/detail/full-page-screen-capture/fdpohaocaechififmbbbbbknoalclacl

(2)、鼠标焦点置于 Dashboard 上,按 ?可显示所有快捷键,先按 d 再按 E 可将所有 Rows 的 Panels 打开,需等待一段时间待页面加载完成。

(3)、使用这个 full-page-screen-capture 插件进行截屏保存


业务描述:
当前在往4个表中,写入的测试数据,就已有一个session在写。之前在没session写入数据的时候,115.177这台机器Leader,有时也会出现leader频繁切换的情况.
dashboard中的写入情况

以下操作建议保留操作记录,并将返回结果完整的整理出来。
通过 sql 确认写热点 region

SQL> select * from information_schema.TIDB_HOT_REGIONS where type = 'write'\G

保留 region_id,再从 Region ID 定位到表或索引:

$ curl http://{TiDBIP}:10080/regions/{RegionId}

发现一个问题,10.205.115.176和10.205.115.178互相长ping,延迟都在1ms以下。两个节点ping10.205.115.177主机,有时候会超过1ms。这个有影响吗?
image

需要从网络环境再让。网络的同事帮忙看看了,因为这个问题有两方面原因,一就是网络问题,2就是。负载问题,但是从监控上来看,你这个负载问题才是最明显的。所以先着手准备一下热点问题。的处理