为什么开了tiflash 会显示大量的miss-peer region

为什么开了tiflash会显示大量的miss-peer region?

用tiup check命令,出现一大堆miss-peer region,实在是迷惑

tiup cluster check <cluster-name> --cluster

这是正常现象,对表设置了TiFlash副本后需要时间去同步副本,TiFlash副本是以learner角色存在于raft group中,在同步过程中就会提示region miss-peer,随着同步逐步完成,这个数量会慢慢变小,你会发现learner-peer-region-count数量会变多。

在tiflash完全同步好之后,miss-peer没有任何减少就固定不变了,导致tiup cluster check检查出大量的miss-peer region问题

麻烦你发一下tiup cluster check的结果,还有display的结果

一、display结果

tiup cluster display bi_prod
Starting component `cluster`: /root/.tiup/components/cluster/v1.7.0/tiup-cluster display bi_prod
Cluster type:       tidb
Cluster name:       bi_prod
Cluster version:    v5.2.2
Deploy user:        tidb
SSH type:           builtin
Dashboard URL:      http://10.5.17.99:2379/dashboard
ID                Role          Host         Ports                            OS/Arch       Status  Data Dir                 Deploy Dir
--                ----          ----         -----                            -------       ------  --------                 ----------
10.5.17.100:9093  alertmanager  10.5.17.100  9093/9094                        linux/x86_64  Up      /data/alertmanager/data  /data/alertmanager
10.5.17.100:3000  grafana       10.5.17.100  3000                             linux/x86_64  Up      -                        /data/grafana
10.5.17.97:2379   pd            10.5.17.97   2379/2380                        linux/x86_64  Up      /data/pd/data            /data/pd
10.5.17.98:2379   pd            10.5.17.98   2379/2380                        linux/x86_64  Up|L    /data/pd/data            /data/pd
10.5.17.99:2379   pd            10.5.17.99   2379/2380                        linux/x86_64  Up|UI   /data/pd/data            /data/pd
10.5.17.100:9090  prometheus    10.5.17.100  9090                             linux/x86_64  Up      /data/prometheus/data    /data/prometheus
10.5.17.97:4000   tidb          10.5.17.97   4000/10080                       linux/x86_64  Up      -                        /data/tidb
10.5.17.98:4000   tidb          10.5.17.98   4000/10080                       linux/x86_64  Up      -                        /data/tidb
10.5.17.99:4000   tidb          10.5.17.99   4000/10080                       linux/x86_64  Up      -                        /data/tidb
10.5.17.100:9000  tiflash       10.5.17.100  9000/8123/3930/20170/20292/8234  linux/x86_64  Up      /data/tiflash/data       /data/tiflash
10.5.17.97:20160  tikv          10.5.17.97   20160/20180                      linux/x86_64  Up      /data/disk3/tikv/store   /data/disk3/tikv
10.5.17.97:20161  tikv          10.5.17.97   20161/20181                      linux/x86_64  Up      /data/disk4/tikv/store   /data/disk4/tikv
10.5.17.98:20160  tikv          10.5.17.98   20160/20180                      linux/x86_64  Up      /data/disk3/tikv/store   /data/disk3/tikv
10.5.17.98:20161  tikv          10.5.17.98   20161/20181                      linux/x86_64  Up      /data/disk4/tikv/store   /data/disk4/tikv
10.5.17.99:20160  tikv          10.5.17.99   20160/20180                      linux/x86_64  Up      /data/disk3/tikv/store   /data/disk3/tikv
10.5.17.99:20161  tikv          10.5.17.99   20161/20181                      linux/x86_64  Up      /data/disk4/tikv/store   /data/disk4/tikv
Total nodes: 16

二、tiup check结果
输出太多了,只发结果

Checking region status of the cluster bi_prod...
Regions are not fully healthy: 85 miss-peer
Please fix unhealthy regions before other operations.

三、pd region health监控

1、现在集群总region有多少,657个空region感觉有点多,可以考虑做一下合并
2、是否最近做了扩缩容的操作
3、用pd-ctl查一下具体的miss-peer是哪些region:https://docs.pingcap.com/zh/tidb/stable/pd-control#region-check-miss-peer--extra-peer--down-peer--pending-peer--offline-peer--empty-region--hist-size--hist-keys

好的,考虑做下合并操作

1、总的peer count是12777

2、最近没有做任何扩缩容操作,只有升级操作,从5.0.1升级到5.2.2

3、miss-peer region有85个,下面只列出其中几个,都差不多类似

 {
      "id": 199243,
      "start_key": "7480000000000003FF7E5F728000000001FF81A7130000000000FA",
      "end_key": "7480000000000003FF7E5F728000000001FF87DB280000000000FA",
      "epoch": {
        "conf_ver": 141,
        "version": 460
      },
      "peers": [
        {
          "id": 199244,
          "store_id": 6,
          "role_name": "Voter"
        },
        {
          "id": 199245,
          "store_id": 8,
          "role_name": "Voter"
        },
        {
          "id": 199246,
          "store_id": 1,
          "role_name": "Voter"
        },
        {
          "id": 199247,
          "store_id": 3009,
          "role": 1,
          "role_name": "Learner",
          "is_learner": true
        }
      ],
      "leader": {
        "id": 199246,
        "store_id": 1,
        "role_name": "Voter"
      },
      "written_bytes": 0,
      "read_bytes": 0,
      "written_keys": 0,
      "read_keys": 0,
      "approximate_size": 99,
      "approximate_keys": 420643
    },
    {
      "id": 199716,
      "start_key": "7480000000000003FF7E5F728000000001FFFD766B0000000000FA",
      "end_key": "7480000000000003FF7E5F728000000002FF03A45F0000000000FA",
      "epoch": {
        "conf_ver": 141,
        "version": 480
      },
      "peers": [
        {
          "id": 199717,
          "store_id": 6,
          "role_name": "Voter"
        },
        {
          "id": 199718,
          "store_id": 8,
          "role_name": "Voter"
        },
        {
          "id": 199719,
          "store_id": 1,
          "role_name": "Voter"
        },
        {
          "id": 199720,
          "store_id": 3009,
          "role": 1,
          "role_name": "Learner",
          "is_learner": true
        }
      ],
      "leader": {
        "id": 199717,
        "store_id": 6,
        "role_name": "Voter"
      },
      "written_bytes": 0,
      "read_bytes": 0,
      "written_keys": 0,
      "read_keys": 0,
      "approximate_size": 97,
      "approximate_keys": 434508
    },
    {
      "id": 44054,
      "start_key": "7480000000000003FF7E5F728000000000FF0405860000000000FA",
      "end_key": "7480000000000003FF7E5F728000000000FF093E520000000000FA",
      "epoch": {
        "conf_ver": 150,
        "version": 399
      },
      "peers": [
        {
          "id": 44067,
          "store_id": 9,
          "role_name": "Voter"
        },
        {
          "id": 44066,
          "store_id": 10,
          "role_name": "Voter"
        },
        {
          "id": 197394,
          "store_id": 8,
          "role_name": "Voter"
        },
        {
          "id": 197889,
          "store_id": 3009,
          "role": 1,
          "role_name": "Learner",
          "is_learner": true
        }
      ],
      "leader": {
        "id": 44066,
        "store_id": 10,
        "role_name": "Voter"
      },
      "written_bytes": 0,
      "read_bytes": 0,
      "written_keys": 0,
      "read_keys": 0,
      "approximate_size": 75,
      "approximate_keys": 312221
    },
    {
      "id": 198388,
      "start_key": "7480000000000003FF7E5F728000000000FF774F830000000000FA",
      "end_key": "7480000000000003FF7E5F728000000000FF7D7AEE0000000000FA",
      "epoch": {
        "conf_ver": 141,
        "version": 417
      },
      "peers": [
        {
          "id": 198389,
          "store_id": 6,
          "role_name": "Voter"
        },
        {
          "id": 198390,
          "store_id": 8,
          "role_name": "Voter"
        },
        {
          "id": 198391,
          "store_id": 1,
          "role_name": "Voter"
        },
        {
          "id": 198392,
          "store_id": 3009,
          "role": 1,
          "role_name": "Learner",
          "is_learner": true
        }
      ],
      "leader": {
        "id": 198390,
        "store_id": 8,
        "role_name": "Voter"
      },
      "written_bytes": 0,
      "read_bytes": 0,
      "written_keys": 0,
      "read_keys": 0,
      "approximate_size": 96,
      "approximate_keys": 403065
    },
    {
      "id": 199556,
      "start_key": "7480000000000003FF7E5F728000000001FFCBEDE10000000000FA",
      "end_key": "7480000000000003FF7E5F728000000001FFD2210B0000000000FA",
      "epoch": {
        "conf_ver": 141,
        "version": 472
      },
      "peers": [
        {
          "id": 199557,
          "store_id": 6,
          "role_name": "Voter"
        },
        {
          "id": 199558,
          "store_id": 8,
          "role_name": "Voter"
        },
        {
          "id": 199559,
          "store_id": 1,
          "role_name": "Voter"
        },
        {
          "id": 199560,
          "store_id": 3009,
          "role": 1,
          "role_name": "Learner",
          "is_learner": true
        }
      ],
      "leader": {
        "id": 199558,
        "store_id": 8,
        "role_name": "Voter"
      },
      "written_bytes": 0,
      "read_bytes": 0,
      "written_keys": 0,
      "read_keys": 0,
      "approximate_size": 92,
      "approximate_keys": 386275
    }

1、升级前就一直有miss-peer吗,还是升级后才发生的
2、参考这个case排查看看https://github.com/pingcap/tidb-map/blob/master/maps/diagnose-case-study/case801.md

1、应该是升级后才发生miss-peer,更准确点是开启了tiflash之后使用tiup check会发生miss-peer region问题

2、空间是足够的,貌似不是这个问题

我试着关闭tiflash同步,发现miss-peer 会减少

 tiup cluster check bi_prod --cluster

Checking region status of the cluster bi_prod...
Regions are not fully healthy: 30 miss-peer
Please fix unhealthy regions before other operations

那是不是有大表正在创建tiflash副本导致其他region创建副本变慢,看下能否复现

不是了,查看information_schema.tiflash_replica中PROGRESS字段都为1

确实有些奇怪,再次开启tiflash同步,就不会出现miss-peer,至少目前没有出现

:joy:好神奇。。tiflash的日志是否方便上传一下

日志有点不太好传:joy:

收回前话,我看了下监控,发现miss-peer应该是在升级之前就有了,升级之后又加了2个

miss-peer为0之后再看下region id=199243的信息,和之前那个对比是否有啥不同的

leader id变了

» region 199243
{
  "id": 199243,
  "start_key": "7480000000000003FF7E5F728000000001FF81A7130000000000FA",
  "end_key": "7480000000000003FF7E5F728000000001FF87DB280000000000FA",
  "epoch": {
    "conf_ver": 143,
    "version": 460
  },
  "peers": [
    {
      "id": 199244,
      "store_id": 6,
      "role_name": "Voter"
    },
    {
      "id": 199245,
      "store_id": 8,
      "role_name": "Voter"
    },
    {
      "id": 199246,
      "store_id": 1,
      "role_name": "Voter"
    },
    {
      "id": 1246993,
      "store_id": 3009,
      "role": 1,
      "role_name": "Learner",
      "is_learner": true
    }
  ],
  "leader": {
    "id": 199244,
    "store_id": 6,
    "role_name": "Voter"
  },
  "written_bytes": 0,
  "read_bytes": 0,
  "written_keys": 0,
  "read_keys": 0,
  "approximate_size": 99,
  "approximate_keys": 420655
}

请问问题解决了吗?

问题是解决了,但是原因还不清楚:joy:

发一下placement rule

想问下,当时出现问题那会设置的TiFlash副本数是多少呢

从当时的learner count 和 miss-peer count数量来看,怀疑是设置的TiFlash副本数超过了TiFlash节点数