从 6.5 升级至 7.5 负载高时 tikv 掉 leader

GreenGuan · 2024 年12 月 31 日 10:01

经观察业务高峰期(大约凌晨 4 点)，在 6.5.7 上没有掉 leader 的情况，但是升级到 7.5.4 上发生了掉 leader 的情况，我们采用的是原地升级方式，配置都没有变，报错时间点 tikv 也没有 oom 的报错，我们的 TiDB 的部署方式为 tikv 节点单机多实例，一个机器上同时有 4 个 tikv 实例

之前也发生过一次类似的问题，通过修改 SQL 的方式规避了，但这次还是发生了这个现象，我比较担心找不到问题不敢往上升，而且 8.1 or 8.5 是否有这个问题也不太确定，麻烦大佬帮忙排查一下

The-Fallen-Angel · 2024 年12 月 31 日 10:16

大量插入的，可能是io问题引起的，比如磁盘响应速度过慢，导致了过载。

WalterWj · 2024 年12 月 31 日 10:17

截图下这部分内容：

GreenGuan · 2024 年12 月 31 日 10:27

有猫万事足 · 2024 年12 月 31 日 14:38

github.com

tikv/pd/blob/master/pkg/schedule/schedulers/evict_slow_trend.go#L536


      
          		return
          	}
          
          	if !checkStoreSlowerThanOthers(cluster, store) {
          		log.Info("evict-slow-trend-scheduler failed to confirm candidate: it's not slower than others", zap.Uint64("store-id", store.GetID()))
          		storeSlowTrendActionStatusGauge.WithLabelValues("candidate", "none_not_slower").Inc()
          		return
          	}
          
          	storeSlowTrendActionStatusGauge.WithLabelValues("candidate", "add").Inc()
          	log.Info("evict-slow-trend-scheduler captured candidate", zap.Uint64("store-id", store.GetID()))
          	return store
          }
          
          func checkStoresAreUpdated(cluster sche.SchedulerCluster, slowStoreID uint64, slowStoreRecordTS time.Time) bool {
          	stores := cluster.GetStores()
          	if len(stores) <= 1 {
          		return false
          	}
          	expected := (len(stores) + 1) / 2
          	updatedStores := 0

掉leader的时间点附近， pd的日志里面，有这么一句嘛？

evict-slow-trend-scheduler captured candidate

dfzxc · 2025 年1 月 1 日 13:21

这一句能定位哪方面的问题吗

GreenGuan · 2025 年1 月 2 日 01:44

没有

h5n1 · 2025 年1 月 2 日 02:56

tikv-detail → pd → store slow score 看看，还有leader drop时的磁盘性能

GreenGuan · 2025 年1 月 3 日 01:57

slow score 确实有增加，请问什么因素会导致这个增加呢，我查看过主机 cpu，网络，磁盘均没有瓶颈，就看到单机多实例中部署的其中一个 store 有热点写问题

h5n1 · 2025 年1 月 3 日 02:01

你看这个store的磁盘了吗，一般这种磁盘出现问题的情况比较多，你的磁盘是什么类型的

GreenGuan · 2025 年1 月 3 日 02:07

磁盘没有满，要是满的话应该不会恢复把，单机多实例部署，每个磁盘都是 nvme2.0 ssd

WalterWj · 2025 年1 月 3 日 03:31

你要不试试将 239 23180 的 leader 手动驱逐一下，然后观察下

terry0219 · 2025 年1 月 3 日 09:29

看看tikv – cluster — IO utilization