tikv节点io占用率高

【 TiDB 使用环境】生产环境 or 测试环境 or POC: 生产
【 TiDB 版本】 5.3
【遇到的问题】tikv节点io使用率过高
【问题现象及影响】
目前生产环境tidb集群tikv节点数量为5,部分节点配置为4核8G,部分为16核32G,数据盘为aws 的gps3类型(ssd),基准iops为3000,目前所有tikv节点iops都维持在1500,读写速率维持在10MB/s左右水平,延迟在1s左右,遇到的问题和这篇博客http://laddyq.com/article/36733.html描述的很相似,根据博客将磁盘挂载参数增加了commit=60,data=writeback,tikv配置参数设置了sync-log=false,但是仍然没有起到效果

tikv版本
# tiup tikv:v5.3.0 --version
Starting component tikv: /root/.tiup/components/tikv/v5.3.0/tikv-server --version
TiKV
Release Version: 5.3.0
Edition: Community
Git Commit Hash: 6c1424706f3d5885faa668233f34c9f178302f36
Git Commit Branch: heads/refs/tags/v5.3.0
UTC Build Time: 2021-11-19 16:24:14
Rust Version: rustc 1.56.0-nightly (2faabf579 2021-07-27)
Enable Features: jemalloc mem-profiling portable sse protobuf-codec test-engines-rocksdb cloud-aws cloud-gcp
Profile: dist_release

TiUP Cluster Display 信息

tikv节点主机监控

tidb 实时最大QPS 只有300

tikv节点 io情况

TiUP Cluster Edit Config 信息

global:
user: tidb
ssh_port: 22
ssh_type: builtin
deploy_dir: /tidb/deploy
data_dir: /tidb/data
os: linux
arch: amd64
monitored:
node_exporter_port: 9100
blackbox_exporter_port: 9115
deploy_dir: /tidb/deploy/monitor-9100
data_dir: /tidb/data/monitor-9100
log_dir: /tidb/deploy/monitor-9100/log
server_configs:
tidb:
log.enable-timestamp: true
log.file.max-backups: 3
log.file.max-days: 3
log.level: info
oom-action: log
tikv:
raftdb.allow-concurrent-memtable-write: true
raftdb.max-background-jobs: 4
raftstore.apply-pool-size: 3
raftstore.store-pool-size: 3
readpool.storage.normal-concurrency: 3
readpool.unified.max-thread-count: 8
readpool.unified.min-thread-count: 3
rocksdb.max-background-jobs: 4
server.grpc-concurrency: 2
storage.scheduler-worker-pool-size: 10
pd:
log.file.max-backups: 3
log.file.max-days: 3
log.level: INFO
metric.interval: 15s
schedule.max-merge-region-keys: 200000
schedule.max-merge-region-size: 20
schedule.patrol-region-interval: 15ms
tiflash: {}
tiflash-learner: {}
pump: {}
drainer: {}
cdc: {}
tidb_servers:

  • host: 172.23.16.120
    ssh_port: 22
    port: 4000
    status_port: 10080
    deploy_dir: /tidb/deploy/tidb-01
    log_dir: /tidb/deploy/tidb-01/log
    arch: amd64
    os: linux
    tikv_servers:
  • host: 172.23.25.132
    ssh_port: 22
    port: 20160
    status_port: 20180
    deploy_dir: /tidb/deploy/tikv-03
    data_dir: /tidb/data/tikv-03
    log_dir: /tidb/deploy/tikv-03/log
    config:
    raftdb.allow-concurrent-memtable-write: true
    raftdb.max-background-jobs: 12
    raftstore.apply-pool-size: 9
    raftstore.store-pool-size: 9
    readpool.storage.normal-concurrency: 9
    readpool.unified.max-thread-count: 16
    readpool.unified.min-thread-count: 9
    rocksdb.max-background-jobs: 8
    server.grpc-concurrency: 8
    arch: amd64
    os: linux
  • host: 172.23.8.223
    ssh_port: 22
    port: 20160
    status_port: 20180
    deploy_dir: /tidb/deploy/tikv-04
    data_dir: /tidb/data/tikv-04
    log_dir: /tidb/deploy/tikv-04/log
    arch: amd64
    os: linux
  • host: 172.23.12.99
    ssh_port: 22
    port: 20160
    status_port: 20180
    deploy_dir: /tidb/deploy/tikv-05
    data_dir: /tidb/data/tikv-05
    log_dir: /tidb/deploy/tikv-05/log
    arch: amd64
    os: linux
  • host: 172.23.17.81
    ssh_port: 22
    port: 20160
    status_port: 20180
    deploy_dir: /tidb/deploy/tikv-07
    data_dir: /tidb/data/tikv-07
    log_dir: /tidb/log/tikv-07/
    config:
    raftdb.allow-concurrent-memtable-write: true
    raftdb.max-background-jobs: 12
    raftstore.apply-pool-size: 9
    raftstore.store-pool-size: 9
    raftstore.sync-log: false
    readpool.storage.normal-concurrency: 9
    readpool.unified.max-thread-count: 16
    readpool.unified.min-thread-count: 8
    rocksdb.defaultcf.max-write-buffer-number: 12
    rocksdb.defaultcf.write-buffer-size: 1024MB
    rocksdb.max-background-jobs: 12
    server.grpc-concurrency: 8
    storage.block-cache.capacity: 2GB
    storage.scheduler-worker-pool-size: 15
    arch: amd64
    os: linux
  • host: 172.23.8.190
    ssh_port: 22
    port: 20160
    status_port: 20180
    deploy_dir: /tidb/deploy/tikv-08
    data_dir: /tidb/data/tikv-08
    log_dir: /tidb/log/tikv-08
    config:
    raftstore.sync-log: false
    arch: amd64
    os: linux
    tiflash_servers:
  • host: 172.23.18.56
    ssh_port: 22
    tcp_port: 9000
    http_port: 8123
    flash_service_port: 3930
    flash_proxy_port: 20170
    flash_proxy_status_port: 20292
    metrics_port: 8234
    deploy_dir: /tidb-deploy/tiflash-9000
    data_dir: /tidb/data/tiflash-9000
    log_dir: /tidb-deploy/tiflash-9000/log
    arch: amd64
    os: linux
    pd_servers:
  • host: 172.23.16.120
    ssh_port: 22
    name: pd-03
    client_port: 2379
    peer_port: 2380
    deploy_dir: /tidb/deploy/pd-03
    data_dir: /tidb/data/pd-03
    log_dir: /tidb/deploy/pd-03/log
    arch: amd64
    os: linux
  • host: 172.23.18.56
    ssh_port: 22
    name: pd-04
    client_port: 2379
    peer_port: 2380
    deploy_dir: /tidb/deploy/pd-04
    data_dir: /tidb/data/pd-04
    log_dir: /tidb/deploy/pd-04/log
    arch: amd64
    os: linux
  1. TiDB- Overview 监控
  • 对应模块日志(包含问题前后1小时日志) →

检查一下TiKV Details -> RocksDB KV -> Block Cache hit,看下Block Cache命中率呢,看你资源配置都不高,内存是8G或者32G,那Block Cache就没有很大,这会增加部分物理IO量

我将实例配置升级至8核64G,tikv节点配置如下

  • host: 172.23.8.190
    ssh_port: 22
    port: 20160
    status_port: 20180
    deploy_dir: /tidb/deploy/tikv-08
    data_dir: /tidb/data/tikv-08
    log_dir: /tidb/log/tikv-08
    config:
    raftdb.max-background-jobs: 12
    raftstore.apply-pool-size: 9
    raftstore.store-pool-size: 9
    raftstore.sync-log: false
    readpool.storage.normal-concurrency: 9
    readpool.unified.max-thread-count: 16
    readpool.unified.min-thread-count: 9
    rocksdb.defaultcf.max-write-buffer-number: 12
    rocksdb.defaultcf.write-buffer-size: 512MB
    rocksdb.max-background-jobs: 8
    rocksdb.writecf.max-write-buffer-number: 12
    rocksdb.writecf.write-buffer-size: 512MB
    server.grpc-concurrency: 8
    storage.block-cache.capacity: 36GB
    arch: amd64

业务单线程写入场景下
tikv节点现象
磁盘iops始终只有1500,写延时在1s没有变化,节点内存使用率不高
节点监控:

tikv blockcache监控

感觉是磁盘的问题。
借用@h5n1 大佬的话:

云盘性能差,tidb内读写量比较大,每个操作、锁都得写raft日志,compaction还有大量读写。
IOPS 分为读写两部分,云盘标称的高 IOPS 大都是利用缓存获得提升的读 IOPS;磁盘的性能还包括带宽和 fdatasync,TiKV 在数据写入时需要进行磁盘的 sync 操作,以确保数据已经从缓冲区刷到硬件,再返回给业务侧,具体为 fdatasync 的操作系统调用。
TiKV 磁盘的建议 2GB/s 以上的写带宽,20K 次/s 以上的 fdatasync,在 4KB 高并发 direct 写的测试中 P99.99 低于 3ms;可以使用 fio 新版、或者 pg_test_fsync 工具进行测试。可以加上 -fdatasync=1 选项测一下,例如 大并发每次写4k而且每次 fsync
fio -direct=0 -fdatasync=1 -iodepth=4 -thread=4 -rw=write -ioengine=libaio -bs=4k -filename=./fio_test -size=20G -runtime=60 -group_reporting -name=write_test
fdatasync的性能参考
参考值一:非 NVMe 的 SSD 的 fdatasync/s 约 5~8K/s
参考值二:中早期 NVMe 的 fdatasync/s 约 20~50K/s
参考值三:当前成熟的 PCIE 3 的 NVMe 约 200~500K/s