TiDB 4.0 课程 21 天定制学习计划(3.1.1 3.1.2)

hindleyzeng · 2020 年12 月 24 日 15:15

课程名称：

3.1.1 TiDB-Cluster-Monitoring-Local
3.1.2 TiDB-Cluster-Operation-Local

学习时长：

50min

课程收获：

tidb监控系统及tiup管理集群

课程内容：

Background:

the monitoring system is an important component of a TiDB cluster .it can show the real-time running status of the DiDB cluster,and help us to troubleshoot the TiDB cluster issuse.

Goal:

understand the architecture of the monitoring system,have the ability to dbo TiDB cluster health check

OutLine:

TiDB monitoring System
grafana:a collection of monitoring graph
TiDB alter system
summary

Part I: TiDB monitoring System

Prometheus and Grafana
Components
Architecture of TiDB monitoring System

Prometheus and Grafana

The tiDB monitoring framework adopts two open source projects:prometheus and grafana.

Prometheus:TiDB uses prometheus to store the monitoring and performance metrics and Grafana to visualize these metrics.

Grafana:TiDB uses grafana to display the performance metrics.Grafana is an open source project for analyzing and visualizing metrics.

Components

prometheus:open source monitoring system+time series database
Grafana: visualization tool of monitoring data
Alert_Manager:Alert component,send alert by email,slack,SMA
Pushgateway:collect metrics data,wait prometheus pull
node_exporter:collect hardware metrics and push to prometheus
blackbox_exporter:collect network metrics and push to prometheus

Architecture of TiDB monitoring System

Node_exporter,Blackbox_exporter,pushgateway collect metrics data
Prometheus receives and saves the data
Alert_Manager sends alert messages
Grafana visualizes the metrics data

Overvies Dashboard – PD

this dashboard show the redions management information and PD request duration.

Check this dashboard to see the speed of PD requests,the region health.

Importan Metrics:

current storage size:the occupied storage capacity of the TiDB cluster,including the space occupied by TiKV replicas.
Number of Regions:the total number of regions in the current cluster.
99% completed_cmds_duration_seconds:the 99th percentile duration to complete a pd-server request.should be less than 5ms.
region health:the state of each region
hot write/read region’s leader distribution:the total number of leaders who are the hotspots.

Overview Dashboard --TiDB

important metrics:

statement ops:the number of different types of sql statements executed per second.
duration:the duration between the time that the client’s network request is sent to tidb and the time that the request is returned to the client after tidb has executed the request.
connection count:the connection number of each tidb instance.
pd tso wait duration:the duration that tidb waits for pd to return tso.
lock resolve ops:the number of tidb operations that resolve locks.

overview dashboard – TiKV

important metrics:

leader & region: the number of leaders/regions on each tikv node,this is used to check if the distribution is balanced.
cpu:the cpu usage ratio on each tikv node.this is used to check if the cpu usage is balanced .which means hot spot.
server report failures:the number of error messages reported by each tikv instance.
scheduler pending commands:the number of pending commands on each tikv instance.

before we begin

Context:common operatoins in a bare metal environment

Goal:use tiup to maintain a tidb cluster

outline:

check the cluster status
start/stop cluster
modify the configuration
scale the tidb clustert
cluster controllers
fix pack installation
upgrade from tidb 3.0
others

What’s TiUP

new deployment and component management tool introduced with tidb platform 4.0
support local deployment,cluster deployment,component versioning and distribution
a single binary,command line tool
imagine tiup as the ‘apt’ to the tidb ecosystem
tiup cluster component

学习过程中遇到的问题或延伸思考：

问题 1：
问题 2：
延伸思考 1：
延伸思考 2：