leader抖动问题

昨晚缩容48主机,同时在一台物理级上扩容了3个tikv如下图:

但发现早上还是有抖动的情况
image

抖动时间,leader 下降的节点上的 disk write latency 也是对应有上升的么?

这么看,是有升高的,感觉还是虚拟机的磁盘存在问题。我们昨天今晚,上了3台物理机,共计9个tikv节点,下周会把虚拟机的tikv节点全部缩容掉,等缩容完再看看效果了


嗯嗯,好的

我有点懵,很慌。
上周刚上的物理机节点,竟然也会抖动这么厉害,相同的时间段,虚拟机还6个节点,但是虚拟级的leader是很稳定的。
物理机节点:


虚拟机节点:

看了下有影响的时间点,感觉是有大批量的写入


  1. 当时业务上的确是有大量写入的情况么?
  2. 集群中是既有物理机部署的 TiKV 节点又有虚拟机部署的 TiKV 节点?这么部署的原因是什么?

1、我最近主要抓大批量写入的情况,看能不能找到是什么调度任务导致的。
2、之前整个集群都是虚拟机环境,但虚拟机的磁盘io一直存在问题。上周扩容了9个节点的物理机环境(3台物理机,每台3个tikv节点),今晚计划把虚拟机环境剩下的6个tikv节点全部下掉(上周之前12个虚拟环境的tikv节点,上周下线了6个节点,还剩6个节点没有下线掉)

另外,如果tidb、pd是虚拟机环境、tikv是物理机环境,这种部署方式是否会存在什么隐患?

虚拟机遇物理机混用理论上没有什么问题,只是如果是 tikv 节点物理机和虚拟机混用的时候,对于排查问题可能会多一层考虑而已,引进了排查复杂度。

大批量写入的情况可以确认一下,另外磁盘的性能也可以确认一下,从监控上看磁盘带宽最高到 200MB。

好的,我这边再观察下情况,看能不能找出问题所在,谢谢

:handshake:

大佬,还能拯救吗?一直都这样,我这边看没啥大批量的数据写入的

  1. 这是今天的吗? 只有这一个实例抖动严重?
  2. 看下对应时间段的dashboard慢日志,是否能找到有某些大sql写入或者查询
  3. 10:35–10:50 这个tikv的日志麻烦发一下,多谢。

1、是今天的,这个实例抖动最厉害,还有其他实例也会抖动。这个现象和之前虚拟机一样,也是有个实例抖动特别厉害,其他个别实例抖动相对厉害。
2、慢sql的情况,对于前4条sql是每个整点都会跑一次的,而且这条sql今年1月28号就上线了。
这个tidb集群存在很多定时调度任务,在3月之前虽然IO一直很高,但leader是很平稳的 ,本月开始才出现leader出现抖动的情况。


3、日志如下
tikv.log.tar.gz (4.4 MB)

  1. 看日志,都有很多region重新选举。麻烦执行下 tiup cluster display 结果,拓扑是怎么样的?
  2. 根据慢日志看起来有很多大sql,检查下网络是否被打满了。可以上传下 over-view 监控
  3. 硬件是什么配置?包含网卡

1、拓扑,tikv是物理机,其他主机全是虚拟机


2、
3、TIKV节点9个节点3台物理机,每台物理机72C768G,磁盘是SSD,每个节点磁盘1.7T。
image
网卡型号如下
lshw -c network
WARNING: you should run this program as super-user.
*-network:0
description: Ethernet interface
product: 82599ES 10-Gigabit SFI/SFP+ Network Connection
vendor: Intel Corporation
physical id: 0
bus info: pci@0000:19:00.0
logical name: ens3f0
version: 01
serial: 74:3a:20:2b:c3:d0
size: 10Gbit/s
capacity: 10Gbit/s
width: 64 bits
clock: 33MHz
capabilities: bus_master cap_list rom ethernet physical fibre 10000bt-fd
configuration: autonegotiation=off broadcast=yes driver=ixgbe driverversion=5.1.0-k-rh7.7 duplex=full firmware=0x800006db latency=0 link=yes multicast=yes port=fibre slave=yes speed=10Gbit/s
resources: irq:120 memory:a9800000-a9ffffff ioport:4020(size=32) memory:aa804000-aa807fff memory:aa400000-aa7fffff memory:381ffff00000-381fffffffff memory:381fffe00000-381fffefffff
*-network:1
description: Ethernet interface
product: 82599ES 10-Gigabit SFI/SFP+ Network Connection
vendor: Intel Corporation
physical id: 0.1
bus info: pci@0000:19:00.1
logical name: ens3f1
version: 01
serial: 74:3a:20:2b:c3:d2
capacity: 10Gbit/s
width: 64 bits
clock: 33MHz
capabilities: bus_master cap_list rom ethernet physical fibre 10000bt-fd
configuration: autonegotiation=off broadcast=yes driver=ixgbe driverversion=5.1.0-k-rh7.7 firmware=0x800006db latency=0 link=no multicast=yes port=fibre
resources: irq:215 memory:a9000000-a97fffff ioport:4000(size=32) memory:aa800000-aa803fff memory:aa000000-aa3fffff memory:381fffd00000-381fffdfffff memory:381fffc00000-381fffcfffff
*-network:0
description: Ethernet interface
product: 82599ES 10-Gigabit SFI/SFP+ Network Connection
vendor: Intel Corporation
physical id: 0
bus info: pci@0000:5e:00.0
logical name: ens1f0
version: 01
serial: 74:3a:20:2b:c3:d0
size: 10Gbit/s
capacity: 10Gbit/s
width: 64 bits
clock: 33MHz
capabilities: bus_master cap_list rom ethernet physical fibre 10000bt-fd
configuration: autonegotiation=off broadcast=yes driver=ixgbe driverversion=5.1.0-k-rh7.7 duplex=full firmware=0x800006db latency=0 link=yes multicast=yes port=fibre slave=yes speed=10Gbit/s
resources: irq:290 memory:c4800000-c4ffffff ioport:9020(size=32) memory:c5804000-c5807fff memory:c5400000-c57fffff memory:383ffff00000-383fffffffff memory:383fffe00000-383fffefffff
*-network:1
description: Ethernet interface
product: 82599ES 10-Gigabit SFI/SFP+ Network Connection
vendor: Intel Corporation
physical id: 0.1
bus info: pci@0000:5e:00.1
logical name: ens1f1
version: 01
serial: 74:3a:20:2b:8b:7e
capacity: 10Gbit/s
width: 64 bits
clock: 33MHz
capabilities: bus_master cap_list rom ethernet physical fibre 10000bt-fd
configuration: autonegotiation=off broadcast=yes driver=ixgbe driverversion=5.1.0-k-rh7.7 firmware=0x800006db latency=0 link=no multicast=yes port=fibre
resources: irq:355 memory:c4000000-c47fffff ioport:9000(size=32) memory:c5800000-c5803fff memory:c5000000-c53fffff memory:383fffd00000-383fffdfffff memory:383fffc00000-383fffcfffff
*-network:0
description: Ethernet interface
product: I350 Gigabit Network Connection
vendor: Intel Corporation
physical id: 0
bus info: pci@0000:86:00.0
logical name: ens4f0
version: 01
serial: 74:3a:20:2b:17:e8
capacity: 1Gbit/s
width: 32 bits
clock: 33MHz
capabilities: bus_master cap_list rom ethernet physical tp 10bt 10bt-fd 100bt 100bt-fd 1000bt-fd autonegotiation
configuration: autonegotiation=on broadcast=yes driver=igb driverversion=5.6.0-k firmware=1.63, 0x800009fc latency=0 link=no multicast=yes port=twisted pair
resources: irq:119 memory:df000000-df7fffff ioport:d060(size=32) memory:e080c000-e080ffff memory:e0400000-e07fffff memory:385ffffe0000-385fffffffff memory:385ffffc0000-385ffffdffff
*-network:1
description: Ethernet interface
product: I350 Gigabit Network Connection
vendor: Intel Corporation
physical id: 0.1
bus info: pci@0000:86:00.1
logical name: ens4f1
version: 01
serial: 74:3a:20:2b:17:e9
size: 1Gbit/s
capacity: 1Gbit/s
width: 32 bits
clock: 33MHz
capabilities: bus_master cap_list rom ethernet physical tp 10bt 10bt-fd 100bt 100bt-fd 1000bt-fd autonegotiation
configuration: autonegotiation=on broadcast=yes driver=igb driverversion=5.6.0-k duplex=full firmware=1.63, 0x800009fc ip=10.71.97.10 latency=0 link=yes multicast=yes port=twisted pair speed=1Gbit/s
resources: irq:131 memory:de800000-deffffff ioport:d040(size=32) memory:e0808000-e080bfff memory:e0000000-e03fffff memory:385ffffa0000-385ffffbffff memory:385ffff80000-385ffff9ffff
*-network:2
description: Ethernet interface
product: I350 Gigabit Network Connection
vendor: Intel Corporation
physical id: 0.2
bus info: pci@0000:86:00.2
logical name: ens4f2
version: 01
serial: 74:3a:20:2b:17:ea
capacity: 1Gbit/s
width: 32 bits
clock: 33MHz
capabilities: bus_master cap_list rom ethernet physical tp 10bt 10bt-fd 100bt 100bt-fd 1000bt-fd autonegotiation
configuration: autonegotiation=on broadcast=yes driver=igb driverversion=5.6.0-k firmware=1.63, 0x800009fc latency=0 link=no multicast=yes port=twisted pair
resources: irq:141 memory:de000000-de7fffff ioport:d020(size=32) memory:e0804000-e0807fff memory:dfc00000-dfffffff memory:385ffff60000-385ffff7ffff memory:385ffff40000-385ffff5ffff
*-network:3
description: Ethernet interface
product: I350 Gigabit Network Connection
vendor: Intel Corporation
physical id: 0.3
bus info: pci@0000:86:00.3
logical name: ens4f3
version: 01
serial: 74:3a:20:2b:17:eb
size: 1Gbit/s
capacity: 1Gbit/s
width: 32 bits
clock: 33MHz
capabilities: bus_master cap_list rom ethernet physical tp 10bt 10bt-fd 100bt 100bt-fd 1000bt-fd autonegotiation
configuration: autonegotiation=on broadcast=yes driver=igb driverversion=5.6.0-k duplex=full firmware=1.63, 0x800009fc latency=0 link=yes multicast=yes port=twisted pair speed=1Gbit/s
resources: irq:216 memory:dd800000-ddffffff ioport:d000(size=32) memory:e0800000-e0803fff memory:df800000-dfbfffff memory:385ffff20000-385ffff3ffff memory:385ffff00000-385ffff1ffff
*-network:0
description: Ethernet interface
physical id: 1
logical name: bond1
serial: 74:3a:20:2b:c3:d0
capabilities: ethernet physical
configuration: autonegotiation=off broadcast=yes driver=bonding driverversion=3.7.1 duplex=full firmware=2 link=yes master=yes multicast=yes
*-network:1
description: Ethernet interface
physical id: 2
logical name: bond1.139
serial: 74:3a:20:2b:c3:d0
capabilities: ethernet physical
configuration: autonegotiation=off broadcast=yes driver=802.1Q VLAN Support driverversion=1.8 duplex=full firmware=N/A ip=10.71.99.228 link=yes multicast=yes

分析定位中~

Region静默开关开启,不活跃的region之间的heartbeat心跳数从100K ops下降到30K OPS,leader抖动问题解决。
打开region静默开关
image
开启前rafe message
image
开启后rafe message
image

此话题已在最后回复的 1 分钟后被自动关闭。不再允许新回复。