【TiDB 4.0 PCTA 学习笔记】- 运维中的关键监控&Tidb的sql的生命周期和关键监控指标@3班+高龙

Day 15

运维中的关键监控

系统

Cpu

Cpu usage 超过80%

Cpu load 小于总核数

内存

Tikv 最好不要超过60%

TiDB nodes: 20% free memory

网络

网络流量不要打满网卡

IO

IO Util 不要超过80%

TIdb

Query Summary

Duration: 对于OLTP的负载99%的延迟都应该低于100ms

Slow query:正常不应该出现

Ideal CPS: 判断延迟出现在数据库端还是客户端

Server

Get token duration:better < 1ms, 或者检查token-limit配置值是否大于总的连接数

Executor
Parse duration: better < 10ms
Compile duration: better < 30ms

KV Error
Lock Resolve OPS: better < 500 for expired and not expired or too many conflicts,太多锁冲突建议用悲观锁
KV Backoff OPS: better< 500 for txnLockFast and txnLock

PD Client
PD TSO .99 wait duration: better<5ms

Cluster

region:单个TiKV推荐5w以下,region间的心跳和raft状态机开销都会比较大

gRPC

.99 gRPC message duration:最好小于100ms

Thread CPU

raft store cpu:最好小于75%*raftstore.store-pool-size

async apply cpu:最好小于75%*raftstore.apply-pool-size

scheduler work cpu:最好小于80%*store.scheduler-work-pool-size

gRPC poll CPU:最好小于80%*server.grpc-concurrency

Unified read pool CPU:最好小于80%*readpool.unified.max-thread-count

Storage ReadPool CPU:最好小于80%*readpool.storage.normal-concurrency

Raft IO

append log duration:.99 latency < 10ms

apply log duration:.99 latency < 30ms 把raftlog里要操作的事情,写到数据库里面

commit log duration:.99 latency < 30ms raft复制过程中,从发起raft复制到结束的过程,如果网络延迟很高,可能会反应到这个监控里面

上面的 .999 latency指标

Raft propose

Propose wait duration:.99 latency < 20ms 请求发过来了,raft模块什么时候处理,这段等待的时间,如果这段时间特别长,说明raft模块繁忙,是不是刷盘很慢,cpu瓶颈等

aplly wait duration:.99 latency < 50ms 把commitlog 扔给apply模块,它什么时候进行apply,

上面的 .999 latency指标

Errors

Server is busy:最好是没有这个东西,监控里面标明原因

PD

etcd

99% wal fsync duration:最好小于5ms

Heartbeat

99% region heartbeat latency:最好小于5ms ,如果时间非常长,说明PD负载高

Dashboard

登陆

http://pdaddr:pdport/dashboard

summary

QPS/Latency

Top sql

Latest Slow Query

Keyviz

Pay attentions to hot spot issue

Sql statements

Tidb的sql的生命周期和关键监控指标

SQL在TiDB中的流程

7000f979-4d87-4678-b154-9c7a916af277-2677073.jpg1530×712 124 KB

开启Prepared Statement会略过Parse和Preprocess
开启Prepared Plan Cache会略过Parse、Preprocess、Logic Optimizer和Physical Optimizer

获取Token

Token是用于限制SQL并发

配置:token-limit

Grafana:Get Token Duration

Get TSO

异步地从PD获取时间戳(开始事务和结束事务都要获取)

Dashboard(SQL Statements & Slow Query)

Grafana:PD TSO Wait Duration(说明TiDB负载)

Grafana:PD TSO RPC Duration(说明TiDB于PD之间的网络情况或这PD负载)

Parse
Dashboard(SQL Statements & Slow Query)
Grafana:Parse Duration
一般batch insert时会耗时高

Compile

Preprocess(validator & type infer)+ Optimize

Dashboar(SQL Statements & Slow Query)

Grafana:Compile Duration

一般在复杂查询时耗时高

Prepared Statements

可以节省解析和预处理的开销

Grafana:Prepare Statement Count

Prepared Plan Cache

节省优化的开销

Grafana:Plan Cache Hits

执行阶段

Execution Duration

Expensive Executors OPS

HashAgg / Sort /IndexLookUp…

相关系统变量 tidb_{operator}_concurrency

KV Request

KV Request Duration 99

TiClient Region Error OPS

Lock Resolve OPS

DistSQL

发送KV请求及接收KV结果

DistSQL Duration

并发发送请求

控制并发度:tidb_distsql_scan_concurrency

Scan Keys

Dashboard(SQL Statements & Slow Query)

Grafana:Scan Keys

Coprocessor & Get & Batch Get

KV Request OPS

事务

KV Transaction Duration

Local Latch

适合冲突高的场景,默认关闭

Dashboard(SQL Statements & Slow Query)

Grafana:Local Latch Wait Time

Transaction Retry

像写冲突等之类的错误是可以重试的

Dashboard(SQL Statements & Slow Query)

Grafana:Transaction Retry Num

SQL在TiKV中的流程

444a95f0-5699-4228-9865-ed6e488bcc0b-2677073.jpg1536×662 98.2 KB

KV Request

gRPC Message Duration(反映TiKV中时间开销)

kv_get /kv_batch_get / coprocessor

KV Duration in TiDB ~= gRPC Message Duration + network RTT

事务

Prewrite & Commit

Dashboard(SQL Statements & Slow Query)

Resolve Lock

Dashboard(SQL Statements & Slow Query)

Lock Resolve OPS

Raft Store

使用Raft协议来保证副本的一致性

Raft propose

Propose wait duration

Apply wait duration

Raft IO

Append log duration

Apply log duration

Commit log duration

Coprocessor

Corprocessor Execution Time

Dashboard(SQL Statements & Slow Query)

Request Duration

Coprocesor Wait Time

Dashboard(SQL Statements & Slow Query)

Wait Duration

可以通过coprocessor cache来优化

RocksDB

每个TiKV实例都有两个RocksDB实例

Raft:存储Raft日志

KV:存储用户数据

Read

Get/Seek duration

Memtable hit

Block cache hit

SST read duration

Write

Write duration

Compaction

Compaction operations

Compaction duration