TiDB 4.0 课程 21 天定制学习计划(3.7.1 3.7.2)

课程名称:

3.7.1 Metrics-that-DBAs-should-notice
3.7.2 The-lifecycle-of-a-SQL-and-relevant-metrics

学习时长:

40min

课程收获:

了解主要的监控指标

课程内容:

befor we begin

Goal: get to know some metrics should notice

Outline:

  • system info
  • tidb
  • tikv
  • pd
  • dashboard

Part I: system info

  • cpu
  • network
  • io

System Info

overview page,System info

  • CPU usage
  1. if reaching 80%,the cpu may become the bottleneck of the whole system
  • CPU load
  1. should be less than the number of cpu vcores,or it may become the bottleneck
  • Memory Available
  1. TiKV nodes:memory usage < 60%
  2. TiDB nodes: 20% free memory
  • Network Traffic
  1. should not exceed the bandwitdth of the network card
  • IO Util
  1. if reach 80%+,it may become the bottleneck

Part II: TiDB

  • latency
  • errors

TiDB

Query Summary

  • duration:.99 latency should less than 100ms for oltp workload
  • slow query:there should not be too many slow queries.
  • ideal cps:invisible metic,can make is visible by editing the grafana

Server

  • get token duration:better < 1ms,or please check the ‘token-limit’ configuration is larger than total count of connections

Executor

  • parse duration:better < 10ms
  • compile duration:better < 30ms

KV errors

  • lock resolve OPS:better < 500 for ‘expired’ and ‘not_expired’,
  • kv backof ops:better < 500 for both ‘txnlockfast’ and ‘txnlock’

PD client

  • pd tso .99 wait duration:better < 5ms

Part III: TiKV

  • latency
  • thread cpu
  • errors

TiKV

cluster

  • regon:better < 50K

gRPC

  • .99 gRPC message duration:better<100ms

Thread CPU:

  • raft store cpu:better<75%
  • async apply CPU:better<75%
  • scheduler worker CPU:better < 80%
  • gRPC poll CPU:better < 80%
  • unified read pool CPU:better < 80%
  • storage readpool CPU:better < 80%

Raft IO:

  • append log duration:.99 latency better < 10ms
  • apply log duration:.99 latency better < 30ms
  • commit log duration:.99latency better < 30ms
  • also should notice the .999 latency for above metrics

Raft propose:

  • propose wait duration:.99 latency better < 20ms
  • apply wait duration:.99 latency better < 50ms
  • also should notice the .999 latency for above metrics

Errors:

  • Server is busy :better there is not busy error

Part IV: PD

  • ETCD
  • Heartbeat

ETCD

  • 99% wal fsync duration:better < 5ms

Heartbeat

  • 99% region heartbeat latency:better < 5ms

Part IV : Dashboard

  • slow log
  • sql statements
  • key viz

Summary

  • qps/latency
  • top sql
  • latest slow query

keyviz

  • pay attention to hot spot issue

sql statements

  • execute count/avg latency

slow query

  • execute plan

Before we begin

outline:

  • procedures in TiDB
  • procedures in TiKV
  • relevant metrics of each phase in dashboard and grafana

Part I: procedures in TiDB

  • an overview of procedures in TiDB
  • before execution
  • parse & compile
  • execution
  • distsql
  • transaction

Part II: procedures in TiKV

  • an overview of procedures in tikv
  • kv request
  • transaction
  • raft store
  • coprocessor
  • rocksdb

学习过程中遇到的问题或延伸思考:

  • 问题 1:
  • 问题 2:
  • 延伸思考 1:
  • 延伸思考 2:

学习过程中参考的其他资料