为什么开了tiflash会显示大量的miss-peer region?
用tiup check命令,出现一大堆miss-peer region,实在是迷惑
tiup cluster check <cluster-name> --cluster
为什么开了tiflash会显示大量的miss-peer region?
用tiup check命令,出现一大堆miss-peer region,实在是迷惑
tiup cluster check <cluster-name> --cluster
这是正常现象,对表设置了TiFlash副本后需要时间去同步副本,TiFlash副本是以learner角色存在于raft group中,在同步过程中就会提示region miss-peer,随着同步逐步完成,这个数量会慢慢变小,你会发现learner-peer-region-count数量会变多。
在tiflash完全同步好之后,miss-peer没有任何减少就固定不变了,导致tiup cluster check
检查出大量的miss-peer region问题
麻烦你发一下tiup cluster check的结果,还有display的结果
一、display结果
tiup cluster display bi_prod
Starting component `cluster`: /root/.tiup/components/cluster/v1.7.0/tiup-cluster display bi_prod
Cluster type: tidb
Cluster name: bi_prod
Cluster version: v5.2.2
Deploy user: tidb
SSH type: builtin
Dashboard URL: http://10.5.17.99:2379/dashboard
ID Role Host Ports OS/Arch Status Data Dir Deploy Dir
-- ---- ---- ----- ------- ------ -------- ----------
10.5.17.100:9093 alertmanager 10.5.17.100 9093/9094 linux/x86_64 Up /data/alertmanager/data /data/alertmanager
10.5.17.100:3000 grafana 10.5.17.100 3000 linux/x86_64 Up - /data/grafana
10.5.17.97:2379 pd 10.5.17.97 2379/2380 linux/x86_64 Up /data/pd/data /data/pd
10.5.17.98:2379 pd 10.5.17.98 2379/2380 linux/x86_64 Up|L /data/pd/data /data/pd
10.5.17.99:2379 pd 10.5.17.99 2379/2380 linux/x86_64 Up|UI /data/pd/data /data/pd
10.5.17.100:9090 prometheus 10.5.17.100 9090 linux/x86_64 Up /data/prometheus/data /data/prometheus
10.5.17.97:4000 tidb 10.5.17.97 4000/10080 linux/x86_64 Up - /data/tidb
10.5.17.98:4000 tidb 10.5.17.98 4000/10080 linux/x86_64 Up - /data/tidb
10.5.17.99:4000 tidb 10.5.17.99 4000/10080 linux/x86_64 Up - /data/tidb
10.5.17.100:9000 tiflash 10.5.17.100 9000/8123/3930/20170/20292/8234 linux/x86_64 Up /data/tiflash/data /data/tiflash
10.5.17.97:20160 tikv 10.5.17.97 20160/20180 linux/x86_64 Up /data/disk3/tikv/store /data/disk3/tikv
10.5.17.97:20161 tikv 10.5.17.97 20161/20181 linux/x86_64 Up /data/disk4/tikv/store /data/disk4/tikv
10.5.17.98:20160 tikv 10.5.17.98 20160/20180 linux/x86_64 Up /data/disk3/tikv/store /data/disk3/tikv
10.5.17.98:20161 tikv 10.5.17.98 20161/20181 linux/x86_64 Up /data/disk4/tikv/store /data/disk4/tikv
10.5.17.99:20160 tikv 10.5.17.99 20160/20180 linux/x86_64 Up /data/disk3/tikv/store /data/disk3/tikv
10.5.17.99:20161 tikv 10.5.17.99 20161/20181 linux/x86_64 Up /data/disk4/tikv/store /data/disk4/tikv
Total nodes: 16
二、tiup check结果
输出太多了,只发结果
Checking region status of the cluster bi_prod...
Regions are not fully healthy: 85 miss-peer
Please fix unhealthy regions before other operations.
三、pd region health监控
1、现在集群总region有多少,657个空region感觉有点多,可以考虑做一下合并
2、是否最近做了扩缩容的操作
3、用pd-ctl查一下具体的miss-peer是哪些region:https://docs.pingcap.com/zh/tidb/stable/pd-control#region-check-miss-peer--extra-peer--down-peer--pending-peer--offline-peer--empty-region--hist-size--hist-keys
好的,考虑做下合并操作
1、总的peer count是12777
2、最近没有做任何扩缩容操作,只有升级操作,从5.0.1升级到5.2.2
3、miss-peer region有85个,下面只列出其中几个,都差不多类似
{
"id": 199243,
"start_key": "7480000000000003FF7E5F728000000001FF81A7130000000000FA",
"end_key": "7480000000000003FF7E5F728000000001FF87DB280000000000FA",
"epoch": {
"conf_ver": 141,
"version": 460
},
"peers": [
{
"id": 199244,
"store_id": 6,
"role_name": "Voter"
},
{
"id": 199245,
"store_id": 8,
"role_name": "Voter"
},
{
"id": 199246,
"store_id": 1,
"role_name": "Voter"
},
{
"id": 199247,
"store_id": 3009,
"role": 1,
"role_name": "Learner",
"is_learner": true
}
],
"leader": {
"id": 199246,
"store_id": 1,
"role_name": "Voter"
},
"written_bytes": 0,
"read_bytes": 0,
"written_keys": 0,
"read_keys": 0,
"approximate_size": 99,
"approximate_keys": 420643
},
{
"id": 199716,
"start_key": "7480000000000003FF7E5F728000000001FFFD766B0000000000FA",
"end_key": "7480000000000003FF7E5F728000000002FF03A45F0000000000FA",
"epoch": {
"conf_ver": 141,
"version": 480
},
"peers": [
{
"id": 199717,
"store_id": 6,
"role_name": "Voter"
},
{
"id": 199718,
"store_id": 8,
"role_name": "Voter"
},
{
"id": 199719,
"store_id": 1,
"role_name": "Voter"
},
{
"id": 199720,
"store_id": 3009,
"role": 1,
"role_name": "Learner",
"is_learner": true
}
],
"leader": {
"id": 199717,
"store_id": 6,
"role_name": "Voter"
},
"written_bytes": 0,
"read_bytes": 0,
"written_keys": 0,
"read_keys": 0,
"approximate_size": 97,
"approximate_keys": 434508
},
{
"id": 44054,
"start_key": "7480000000000003FF7E5F728000000000FF0405860000000000FA",
"end_key": "7480000000000003FF7E5F728000000000FF093E520000000000FA",
"epoch": {
"conf_ver": 150,
"version": 399
},
"peers": [
{
"id": 44067,
"store_id": 9,
"role_name": "Voter"
},
{
"id": 44066,
"store_id": 10,
"role_name": "Voter"
},
{
"id": 197394,
"store_id": 8,
"role_name": "Voter"
},
{
"id": 197889,
"store_id": 3009,
"role": 1,
"role_name": "Learner",
"is_learner": true
}
],
"leader": {
"id": 44066,
"store_id": 10,
"role_name": "Voter"
},
"written_bytes": 0,
"read_bytes": 0,
"written_keys": 0,
"read_keys": 0,
"approximate_size": 75,
"approximate_keys": 312221
},
{
"id": 198388,
"start_key": "7480000000000003FF7E5F728000000000FF774F830000000000FA",
"end_key": "7480000000000003FF7E5F728000000000FF7D7AEE0000000000FA",
"epoch": {
"conf_ver": 141,
"version": 417
},
"peers": [
{
"id": 198389,
"store_id": 6,
"role_name": "Voter"
},
{
"id": 198390,
"store_id": 8,
"role_name": "Voter"
},
{
"id": 198391,
"store_id": 1,
"role_name": "Voter"
},
{
"id": 198392,
"store_id": 3009,
"role": 1,
"role_name": "Learner",
"is_learner": true
}
],
"leader": {
"id": 198390,
"store_id": 8,
"role_name": "Voter"
},
"written_bytes": 0,
"read_bytes": 0,
"written_keys": 0,
"read_keys": 0,
"approximate_size": 96,
"approximate_keys": 403065
},
{
"id": 199556,
"start_key": "7480000000000003FF7E5F728000000001FFCBEDE10000000000FA",
"end_key": "7480000000000003FF7E5F728000000001FFD2210B0000000000FA",
"epoch": {
"conf_ver": 141,
"version": 472
},
"peers": [
{
"id": 199557,
"store_id": 6,
"role_name": "Voter"
},
{
"id": 199558,
"store_id": 8,
"role_name": "Voter"
},
{
"id": 199559,
"store_id": 1,
"role_name": "Voter"
},
{
"id": 199560,
"store_id": 3009,
"role": 1,
"role_name": "Learner",
"is_learner": true
}
],
"leader": {
"id": 199558,
"store_id": 8,
"role_name": "Voter"
},
"written_bytes": 0,
"read_bytes": 0,
"written_keys": 0,
"read_keys": 0,
"approximate_size": 92,
"approximate_keys": 386275
}
1、升级前就一直有miss-peer吗,还是升级后才发生的
2、参考这个case排查看看https://github.com/pingcap/tidb-map/blob/master/maps/diagnose-case-study/case801.md
1、应该是升级后才发生miss-peer,更准确点是开启了tiflash之后使用tiup check
会发生miss-peer region问题
2、空间是足够的,貌似不是这个问题
我试着关闭tiflash同步,发现miss-peer 会减少
tiup cluster check bi_prod --cluster
Checking region status of the cluster bi_prod...
Regions are not fully healthy: 30 miss-peer
Please fix unhealthy regions before other operations
那是不是有大表正在创建tiflash副本导致其他region创建副本变慢,看下能否复现
不是了,查看information_schema.tiflash_replica中PROGRESS字段都为1
确实有些奇怪,再次开启tiflash同步,就不会出现miss-peer,至少目前没有出现
好神奇。。tiflash的日志是否方便上传一下
miss-peer为0之后再看下region id=199243的信息,和之前那个对比是否有啥不同的
leader id变了
» region 199243
{
"id": 199243,
"start_key": "7480000000000003FF7E5F728000000001FF81A7130000000000FA",
"end_key": "7480000000000003FF7E5F728000000001FF87DB280000000000FA",
"epoch": {
"conf_ver": 143,
"version": 460
},
"peers": [
{
"id": 199244,
"store_id": 6,
"role_name": "Voter"
},
{
"id": 199245,
"store_id": 8,
"role_name": "Voter"
},
{
"id": 199246,
"store_id": 1,
"role_name": "Voter"
},
{
"id": 1246993,
"store_id": 3009,
"role": 1,
"role_name": "Learner",
"is_learner": true
}
],
"leader": {
"id": 199244,
"store_id": 6,
"role_name": "Voter"
},
"written_bytes": 0,
"read_bytes": 0,
"written_keys": 0,
"read_keys": 0,
"approximate_size": 99,
"approximate_keys": 420655
}
请问问题解决了吗?
问题是解决了,但是原因还不清楚
发一下placement rule
想问下,当时出现问题那会设置的TiFlash副本数是多少呢
从当时的learner count 和 miss-peer count数量来看,怀疑是设置的TiFlash副本数超过了TiFlash节点数