课程名称:
3.1.1 TiDB-Cluster-Monitoring-Local
3.1.2 TiDB-Cluster-Operation-Local
学习时长:
50min
课程收获:
tidb监控系统及tiup管理集群
课程内容:
Background:
the monitoring system is an important component of a TiDB cluster .it can show the real-time running status of the DiDB cluster,and help us to troubleshoot the TiDB cluster issuse.
Goal:
understand the architecture of the monitoring system,have the ability to dbo TiDB cluster health check
OutLine:
- TiDB monitoring System
- grafana:a collection of monitoring graph
- TiDB alter system
- summary
Part I: TiDB monitoring System
- Prometheus and Grafana
- Components
- Architecture of TiDB monitoring System
Prometheus and Grafana
The tiDB monitoring framework adopts two open source projects:prometheus and grafana.
Prometheus:TiDB uses prometheus to store the monitoring and performance metrics and Grafana to visualize these metrics.
Grafana:TiDB uses grafana to display the performance metrics.Grafana is an open source project for analyzing and visualizing metrics.
Components
- prometheus:open source monitoring system+time series database
- Grafana: visualization tool of monitoring data
- Alert_Manager:Alert component,send alert by email,slack,SMA
- Pushgateway:collect metrics data,wait prometheus pull
- node_exporter:collect hardware metrics and push to prometheus
- blackbox_exporter:collect network metrics and push to prometheus
Architecture of TiDB monitoring System
- Node_exporter,Blackbox_exporter,pushgateway collect metrics data
- Prometheus receives and saves the data
- Alert_Manager sends alert messages
- Grafana visualizes the metrics data
Overvies Dashboard – PD
this dashboard show the redions management information and PD request duration.
Check this dashboard to see the speed of PD requests,the region health.
Importan Metrics:
- current storage size:the occupied storage capacity of the TiDB cluster,including the space occupied by TiKV replicas.
- Number of Regions:the total number of regions in the current cluster.
- 99% completed_cmds_duration_seconds:the 99th percentile duration to complete a pd-server request.should be less than 5ms.
- region health:the state of each region
- hot write/read region’s leader distribution:the total number of leaders who are the hotspots.
Overview Dashboard --TiDB
important metrics:
- statement ops:the number of different types of sql statements executed per second.
- duration:the duration between the time that the client’s network request is sent to tidb and the time that the request is returned to the client after tidb has executed the request.
- connection count:the connection number of each tidb instance.
- pd tso wait duration:the duration that tidb waits for pd to return tso.
- lock resolve ops:the number of tidb operations that resolve locks.
overview dashboard – TiKV
important metrics:
- leader & region: the number of leaders/regions on each tikv node,this is used to check if the distribution is balanced.
- cpu:the cpu usage ratio on each tikv node.this is used to check if the cpu usage is balanced .which means hot spot.
- server report failures:the number of error messages reported by each tikv instance.
- scheduler pending commands:the number of pending commands on each tikv instance.
before we begin
Context:common operatoins in a bare metal environment
Goal:use tiup to maintain a tidb cluster
outline:
- check the cluster status
- start/stop cluster
- modify the configuration
- scale the tidb clustert
- cluster controllers
- fix pack installation
- upgrade from tidb 3.0
- others
What’s TiUP
- new deployment and component management tool introduced with tidb platform 4.0
- support local deployment,cluster deployment,component versioning and distribution
- a single binary,command line tool
- imagine tiup as the ‘apt’ to the tidb ecosystem
- tiup cluster component
学习过程中遇到的问题或延伸思考:
- 问题 1:
- 问题 2:
- 延伸思考 1:
- 延伸思考 2: