TiKV 节点负载特别高

【集群情况】

  • 集群版本:v5.1.1
  • TiKV 节点:5 个

【遇到的问题】

1、最近两天发现 tidb 慢日志越来越多,发现一些原来执行速度很快的 SQL 突然飙升到几百秒

2、发现其中一个 TiKV 节点的负载特别高,登进去服务器都特别卡

3、该节点的日志在疯狂刷

[2022/06/14 16:22:15.516 +08:00] [INFO] [tracker.rs:255] [slow-query] [perf_stats.internal_delete_skipped_count=0] [perf_stats.internal_key_skipped_count=0] [perf_stats.block_read_byte=48736] [perf_stats.block_read_count=2] [perf_stats.block_cache_hit_count=21] [scan.range.first="Some(start: 7480000000000002485F72800000008FEC1E34 end: 7480000000000002485F72800000008FEC1E35)"] [scan.ranges=2] [scan.total=2] [scan.processed=2] [scan.is_desc=false] [tag=select] [table_id=584] [txn_start_ts=433899413354053884] [total_suspend_time=0ns] [total_process_time=77ms] [handler_build_time=0ns] [wait_time.snapshot=0ns] [wait_time.schedule=2.101s] [wait_time=2.101s] [total_lifetime=2.178s] [remote_host=ipv4:192.168.1.20:54728] [region_id=1335237651]
[2022/06/14 16:22:15.516 +08:00] [INFO] [tracker.rs:255] [slow-query] [perf_stats.internal_delete_skipped_count=0] [perf_stats.internal_key_skipped_count=0] [perf_stats.block_read_byte=0] [perf_stats.block_read_count=0] [perf_stats.block_cache_hit_count=12] [scan.range.first="Some(start: 7480000000000002485F728000000091AA7A6F end: 7480000000000002485F728000000091AA7A70)"] [scan.ranges=1] [scan.total=1] [scan.processed=1] [scan.is_desc=false] [tag=select] [table_id=584] [txn_start_ts=433899409027366962] [total_suspend_time=0ns] [total_process_time=0ns] [handler_build_time=0ns] [wait_time.snapshot=0ns] [wait_time.schedule=2.178s] [wait_time=2.178s] [total_lifetime=2.178s] [remote_host=ipv4:192.168.1.21:60418] [region_id=1336868458]
[2022/06/14 16:22:15.517 +08:00] [INFO] [tracker.rs:255] [slow-query] [perf_stats.internal_delete_skipped_count=0] [perf_stats.internal_key_skipped_count=0] [perf_stats.block_read_byte=0] [perf_stats.block_read_count=0] [perf_stats.block_cache_hit_count=23] [scan.range.first="Some(start: 7480000000000002485F728000000091E59469 end: 7480000000000002485F728000000091E5946A)"] [scan.ranges=2] [scan.total=2] [scan.processed=2] [scan.is_desc=false] [tag=select] [table_id=584] [txn_start_ts=433899404584026210] [total_suspend_time=0ns] [total_process_time=1ms] [handler_build_time=0ns] [wait_time.snapshot=2.178s] [wait_time.schedule=2.691s] [wait_time=4.869s] [total_lifetime=4.87s] [remote_host=ipv4:192.168.1.21:60434] [region_id=1337111165]
[2022/06/14 16:22:15.517 +08:00] [INFO] [tracker.rs:255] [slow-query] [perf_stats.internal_delete_skipped_count=0] [perf_stats.internal_key_skipped_count=0] [perf_stats.block_read_byte=0] [perf_stats.block_read_count=0] [perf_stats.block_cache_hit_count=12] [scan.range.first="Some(start: 7480000000000002485F72800000009460CDF0 end: 7480000000000002485F72800000009460CDF1)"] [scan.ranges=1] [scan.total=1] [scan.processed=1] [scan.is_desc=false] [tag=select] [table_id=584] [txn_start_ts=433899414361997446] [total_suspend_time=0ns] [total_process_time=0ns] [handler_build_time=0ns] [wait_time.snapshot=0ns] [wait_time.schedule=2.175s] [wait_time=2.175s] [total_lifetime=2.175s] [remote_host=ipv4:192.168.1.21:60422] [region_id=1339316288]
[2022/06/14 16:22:15.520 +08:00] [INFO] [tracker.rs:255] [slow-query] [perf_stats.internal_delete_skipped_count=0] [perf_stats.internal_key_skipped_count=2] [perf_stats.block_read_byte=48436] [perf_stats.block_read_count=2] [perf_stats.block_cache_hit_count=22] [scan.range.first="Some(start: 7480000000000002485F7280000000932F25FB end: 7480000000000002485F7280000000932F25FD)"] [scan.ranges=2] [scan.total=4] [scan.processed=3] [scan.is_desc=false] [tag=select] [table_id=584] [txn_start_ts=433899407166144574] [total_suspend_time=0ns] [total_process_time=1ms] [handler_build_time=0ns] [wait_time.snapshot=0ns] [wait_time.schedule=2.163s] [wait_time=2.163s] [total_lifetime=2.164s] [remote_host=ipv4:192.168.1.20:54730] [region_id=1338184470]

4、重启该 tikv 节点后还是会很慢

5、直接关闭该节点,TiDB 集群瞬间就活了,慢速日志也没在刷了

【疑问】

1、日志的内容是什么意思,为什么只有这个节点在疯狂刷,其他节点不会?

1、节点停止(down)后,按道理 region 应该会迁移到其他节点(配置1小时),但是过了一段时间发现其他节点的 region 并没有上升的趋势

2、目前该节点挂载了一块固态硬盘,计划把 tikv 数据目录迁移进去,请问怎么操作比较方便呢?

热点问题吧

N 个节点,因为结构和配置没满足打散请求的条件,导致所有的读写都指向 同一个节点(把这个节点忙死了,其他的节点都是空闲的)

可以参考
https://docs.pingcap.com/zh/tidb/stable/troubleshoot-hot-spot-issues#tidb-热点问题处理

  1. 优先解决热点表,该换结构的就换下,该前置打散就打散一下
  2. 解决热点索引问题,会导致 tikv is busy
  3. 定位 慢查询,优化掉这些查询问题

tikv 节点不支持多盘,如果节点资源足够,可以单机装载多套tikv 服务,分盘来存储数据,最简单的方式就是扩容处理
参考文档

tiup 方式怎么部署单机多实例呢

我在线设置 label 的时候报错:

此话题已在最后回复的 1 分钟后被自动关闭。不再允许新回复。