课程名称:3.7.1 运维中的关键监控
学习时长:40min
课程收获:运维过程中需要关注的性能相关的监控指标
课程内容:
-
系统信息
-
Grafana Overview
-
CPU usage
-
80%,CPU可能成为系统瓶颈
-
-
CPU load
- 应小于核数,否则可能成为系统瓶颈
-
Memory
- TiKV:usage < 60%
- TiDB:20% free
-
网络
- 不应超过网卡的带宽
-
IO Util
- 若>80%,可能成为系统瓶颈
-
-
TiDB
-
Query
-
时长:99%时延应小于100ms(OLTP)
-
慢查询:不应有太多慢查询
-
Ideal CPS(Command Per Second):默认不可见参数,通过编辑grafana变为可见
- 该指标可以看出延迟是在TiDB端还是在客户端
-
-
Server
-
Get token duration:最好小于1ms,若高,说明token-limit(默认1000)配置不合理
- 确认token-limit数量大于连接数
- 获取token,是为了对连接限流
-
-
Executor
- Parse duration:最好<10ms
- Compile duration:最好小于30ms
-
PD Client
- PD TSO .99 Wait Duration:最好小于5ms
-
Errors
-
KV Errors
- Lock Resolve OPS:对于 expired和not-expired最好小于500,否则说明冲突较多
- KV Backoff OPS:对于txnLockFast和txnLock,最好小于500
-
-
-
TiKV
-
Cluster
- Region:最好<50K ,否则可能需要region merge和hibernate region
-
gRPC
- .99 gRPC message duration:越低越好,最好小于100ms(除了复杂的协处理器请求)
-
Thread CPU
- Raft store CPU:better < 75% *
[raftsore.store](http://raftsore.store/)-pool-size
- Async apply CPU:better < 75%*
raftstore.apply-pool-size
- Scheduler worker CPU:better < 80% *
storage.scheduler-worker-pool-size
- gRPC poll CPU:better < 80% *
server.grpc-concurrency
- Unified read pool CPU:better < 80% *
readpool.unified.max-thread-count
- Storage ReadPool CPU:better < 80% *
readpool.storage.normal-concurrency
- Raft store CPU:better < 75% *
-
-
PD
-
Dashboard
- http://PdAddr:PdPort/dashboard