课程名称:
3.7.1 Metrics-that-DBAs-should-notice
3.7.2 The-lifecycle-of-a-SQL-and-relevant-metrics
学习时长:
40min
课程收获:
了解主要的监控指标
课程内容:
befor we begin
Goal: get to know some metrics should notice
Outline:
- system info
- tidb
- tikv
- pd
- dashboard
Part I: system info
- cpu
- network
- io
System Info
overview page,System info
- CPU usage
- if reaching 80%,the cpu may become the bottleneck of the whole system
- CPU load
- should be less than the number of cpu vcores,or it may become the bottleneck
- Memory Available
- TiKV nodes:memory usage < 60%
- TiDB nodes: 20% free memory
- Network Traffic
- should not exceed the bandwitdth of the network card
- IO Util
- if reach 80%+,it may become the bottleneck
Part II: TiDB
- latency
- errors
TiDB
Query Summary
- duration:.99 latency should less than 100ms for oltp workload
- slow query:there should not be too many slow queries.
- ideal cps:invisible metic,can make is visible by editing the grafana
Server
- get token duration:better < 1ms,or please check the ‘token-limit’ configuration is larger than total count of connections
Executor
- parse duration:better < 10ms
- compile duration:better < 30ms
KV errors
- lock resolve OPS:better < 500 for ‘expired’ and ‘not_expired’,
- kv backof ops:better < 500 for both ‘txnlockfast’ and ‘txnlock’
PD client
- pd tso .99 wait duration:better < 5ms
Part III: TiKV
- latency
- thread cpu
- errors
TiKV
cluster
- regon:better < 50K
gRPC
- .99 gRPC message duration:better<100ms
Thread CPU:
- raft store cpu:better<75%
- async apply CPU:better<75%
- scheduler worker CPU:better < 80%
- gRPC poll CPU:better < 80%
- unified read pool CPU:better < 80%
- storage readpool CPU:better < 80%
Raft IO:
- append log duration:.99 latency better < 10ms
- apply log duration:.99 latency better < 30ms
- commit log duration:.99latency better < 30ms
- also should notice the .999 latency for above metrics
Raft propose:
- propose wait duration:.99 latency better < 20ms
- apply wait duration:.99 latency better < 50ms
- also should notice the .999 latency for above metrics
Errors:
- Server is busy :better there is not busy error
Part IV: PD
- ETCD
- Heartbeat
ETCD
- 99% wal fsync duration:better < 5ms
Heartbeat
- 99% region heartbeat latency:better < 5ms
Part IV : Dashboard
- slow log
- sql statements
- key viz
Summary
- qps/latency
- top sql
- latest slow query
keyviz
- pay attention to hot spot issue
sql statements
- execute count/avg latency
slow query
- execute plan
Before we begin
outline:
- procedures in TiDB
- procedures in TiKV
- relevant metrics of each phase in dashboard and grafana
Part I: procedures in TiDB
- an overview of procedures in TiDB
- before execution
- parse & compile
- execution
- distsql
- transaction
Part II: procedures in TiKV
- an overview of procedures in tikv
- kv request
- transaction
- raft store
- coprocessor
- rocksdb
学习过程中遇到的问题或延伸思考:
- 问题 1:
- 问题 2:
- 延伸思考 1:
- 延伸思考 2: