Day 15
运维中的关键监控
系统
Cpu
Cpu usage 超过80%
Cpu load 小于总核数
内存
Tikv 最好不要超过60%
TiDB nodes: 20% free memory
网络
网络流量不要打满网卡
IO
IO Util 不要超过80%
TIdb
Query Summary
Duration: 对于OLTP的负载99%的延迟都应该低于100ms
Slow query:正常不应该出现
Ideal CPS: 判断延迟出现在数据库端还是客户端
Server
Get token duration:better < 1ms, 或者检查token-limit配置值是否大于总的连接数
Executor
Parse duration: better < 10ms
Compile duration: better < 30ms
KV Error
Lock Resolve OPS: better < 500 for expired and not expired or too many conflicts,太多锁冲突建议用悲观锁
KV Backoff OPS: better< 500 for txnLockFast and txnLock
PD Client
PD TSO .99 wait duration: better<5ms
Cluster
region:单个TiKV推荐5w以下,region间的心跳和raft状态机开销都会比较大
gRPC
.99 gRPC message duration:最好小于100ms
Thread CPU
raft store cpu:最好小于75%*raftstore.store-pool-size
async apply cpu:最好小于75%*raftstore.apply-pool-size
scheduler work cpu:最好小于80%*store.scheduler-work-pool-size
gRPC poll CPU:最好小于80%*server.grpc-concurrency
Unified read pool CPU:最好小于80%*readpool.unified.max-thread-count
Storage ReadPool CPU:最好小于80%*readpool.storage.normal-concurrency
Raft IO
append log duration:.99 latency < 10ms
apply log duration:.99 latency < 30ms 把raftlog里要操作的事情,写到数据库里面
commit log duration:.99 latency < 30ms raft复制过程中,从发起raft复制到结束的过程,如果网络延迟很高,可能会反应到这个监控里面
上面的 .999 latency指标
Raft propose
Propose wait duration:.99 latency < 20ms 请求发过来了,raft模块什么时候处理,这段等待的时间,如果这段时间特别长,说明raft模块繁忙,是不是刷盘很慢,cpu瓶颈等
aplly wait duration:.99 latency < 50ms 把commitlog 扔给apply模块,它什么时候进行apply,
上面的 .999 latency指标
Errors
Server is busy:最好是没有这个东西,监控里面标明原因
PD
etcd
99% wal fsync duration:最好小于5ms
Heartbeat
99% region heartbeat latency:最好小于5ms ,如果时间非常长,说明PD负载高
Dashboard
登陆
http://pdaddr:pdport/dashboard
summary
QPS/Latency
Top sql
Latest Slow Query
Keyviz
Pay attentions to hot spot issue
Sql statements
Tidb的sql的生命周期和关键监控指标
SQL在TiDB中的流程
7000f979-4d87-4678-b154-9c7a916af277-2677073.jpg1530×712 124 KB
开启Prepared Statement会略过Parse和Preprocess
开启Prepared Plan Cache会略过Parse、Preprocess、Logic Optimizer和Physical Optimizer
获取Token
Token是用于限制SQL并发
配置:token-limit
Grafana:Get Token Duration
Get TSO
异步地从PD获取时间戳(开始事务和结束事务都要获取)
Dashboard(SQL Statements & Slow Query)
Grafana:PD TSO Wait Duration(说明TiDB负载)
Grafana:PD TSO RPC Duration(说明TiDB于PD之间的网络情况或这PD负载)
Parse
Dashboard(SQL Statements & Slow Query)
Grafana:Parse Duration
一般batch insert时会耗时高
Compile
Preprocess(validator & type infer)+ Optimize
Dashboar(SQL Statements & Slow Query)
Grafana:Compile Duration
一般在复杂查询时耗时高
Prepared Statements
可以节省解析和预处理的开销
Grafana:Prepare Statement Count
Prepared Plan Cache
节省优化的开销
Grafana:Plan Cache Hits
执行阶段
Execution Duration
Expensive Executors OPS
HashAgg / Sort /IndexLookUp…
相关系统变量 tidb_{operator}_concurrency
KV Request
KV Request Duration 99
TiClient Region Error OPS
Lock Resolve OPS
DistSQL
发送KV请求及接收KV结果
DistSQL Duration
并发发送请求
控制并发度:tidb_distsql_scan_concurrency
Scan Keys
Dashboard(SQL Statements & Slow Query)
Grafana:Scan Keys
Coprocessor & Get & Batch Get
KV Request OPS
事务
KV Transaction Duration
Local Latch
适合冲突高的场景,默认关闭
Dashboard(SQL Statements & Slow Query)
Grafana:Local Latch Wait Time
Transaction Retry
像写冲突等之类的错误是可以重试的
Dashboard(SQL Statements & Slow Query)
Grafana:Transaction Retry Num
SQL在TiKV中的流程
444a95f0-5699-4228-9865-ed6e488bcc0b-2677073.jpg1536×662 98.2 KB
KV Request
gRPC Message Duration(反映TiKV中时间开销)
kv_get /kv_batch_get / coprocessor
KV Duration in TiDB ~= gRPC Message Duration + network RTT
事务
Prewrite & Commit
Dashboard(SQL Statements & Slow Query)
Resolve Lock
Dashboard(SQL Statements & Slow Query)
Lock Resolve OPS
Raft Store
使用Raft协议来保证副本的一致性
Raft propose
Propose wait duration
Apply wait duration
Raft IO
Append log duration
Apply log duration
Commit log duration
Coprocessor
Corprocessor Execution Time
Dashboard(SQL Statements & Slow Query)
Request Duration
Coprocesor Wait Time
Dashboard(SQL Statements & Slow Query)
Wait Duration
可以通过coprocessor cache来优化
RocksDB
每个TiKV实例都有两个RocksDB实例
Raft:存储Raft日志
KV:存储用户数据
Read
Get/Seek duration
Memtable hit
Block cache hit
SST read duration
Write
Write duration
Compaction
Compaction operations
Compaction duration