TIDB 3.0.12+dm 1.0.4hotfix dm告警问题

TIDB的监控器服务器地址和DM 监控的服务器地址是分开的 TIDB集群可以通过alertmanger可以配置微信或者邮件告警

但是DM通过同样的方式却不告警,alertmanager日志和prometheus.log都正常. [tidb@pd-tikv03 conf]$ cat prometheus.yml


global: scrape_interval: 15s # By default, scrape targets every 15 seconds. evaluation_interval: 15s # How frequently to evaluate rules.

scrape_timeout is set to the global default (10s).

external_labels: cluster: ‘hcloud-test-cluster’ monitor: “prometheus”

Load and evaluate rules in this file every ‘evaluation_interval’ seconds.

rule_files:

  • ‘dm_worker.rules.yml’

alerting: alertmanagers:

  • static_configs:
    • targets:
      • ‘10.200.25.83:9093’

scrape_configs:

  • job_name: “dm_worker” honor_labels: true # don’t overwrite job & instance labels static_configs:
    • targets:
      • ‘10.200.25.83:8262’
      • ‘10.200.25.83:8263’
      • ‘10.200.25.82:8262’
      • ‘10.200.25.82:8263’

[tidb@pd-tikv03 log]$ tail -f alertmanager.log

level=info ts=2020-04-24T02:21:25.266047945Z caller=main.go:275 msg=“Loading configuration file” file=conf/alertmanager.yml level=info ts=2020-04-24T02:21:25.274735349Z caller=main.go:350 msg=Listening address=:9093 level=info ts=2020-04-24T02:36:25.266644124Z caller=nflog.go:293 component=nflog msg=“Running maintenance” level=info ts=2020-04-24T02:36:25.266762038Z caller=silence.go:269 component=silences msg=“Running maintenance” level=info ts=2020-04-24T02:36:25.268486797Z caller=silence.go:271 component=silences msg=“Maintenance done” duration=1.834747ms size=0 level=info ts=2020-04-24T02:36:25.268797692Z caller=nflog.go:295 component=nflog msg=“Maintenance done” duration=2.182355ms size=1756 level=info ts=2020-04-24T02:48:17.669480674Z caller=main.go:136 msg=“Starting Alertmanager” version="(version=0.14.0, branch=HEAD, revision=30af4d051b37ce817ea7e35b56c57a0e2ec9dbb0)" level=info ts=2020-04-24T02:48:17.669634111Z caller=main.go:137 build_context="(go=go1.9.2, user=root@37b6a49ebba9, date=20180213-08:16:42)" level=info ts=2020-04-24T02:48:17.670981714Z caller=main.go:275 msg=“Loading configuration file” file=conf/alertmanager.yml level=info ts=2020-04-24T02:48:17.683878104Z caller=main.go:350 msg=Listening address=:9093 ^C

[tidb@pd-tikv03 log]$ tail -f prometheus.log

level=error ts=2020-04-24T02:04:58.371952703Z caller=notifier.go:473 component=notifier alertmanager=http://10.200.25.83:9093/api/v1/alerts count=0 msg=“Error sending alert” err=“Post http://10.200.25.83:9093/api/v1/alerts: dial tcp 10.200.25.83:9093: connect: connection refused” level=info ts=2020-04-24T02:48:11.444826815Z caller=main.go:220 msg=“Starting Prometheus” version="(version=2.2.1, branch=HEAD, revision=bc6058c81272a8d938c05e75607371284236aadc)" level=info ts=2020-04-24T02:48:11.445014157Z caller=main.go:221 build_context="(go=go1.10, user=root@149e5b3f0829, date=20180314-14:15:45)" level=info ts=2020-04-24T02:48:11.44510546Z caller=main.go:222 host_details="(Linux 3.10.0-1062.el7.x86_64 #1 SMP Wed Aug 7 18:08:02 UTC 2019 x86_64 pd-tikv03 (none))" level=info ts=2020-04-24T02:48:11.445140388Z caller=main.go:223 fd_limits="(soft=1000000, hard=1000000)" level=info ts=2020-04-24T02:48:11.452491503Z caller=main.go:504 msg=“Starting TSDB …” level=info ts=2020-04-24T02:48:11.452681391Z caller=web.go:382 component=web msg=“Start listening for connections” address=:9090 level=info ts=2020-04-24T02:48:11.923424153Z caller=main.go:514 msg=“TSDB started” level=info ts=2020-04-24T02:48:11.923565384Z caller=main.go:588 msg=“Loading configuration file” filename=/data/dm-master/conf/prometheus.yml level=info ts=2020-04-24T02:48:11.927782969Z caller=main.go:491 msg=“Server is ready to receive web requests.”

测试步骤是:1 停止任务不告警 2 停止dm集群也不告警

你好,

你的问题已收到,正在分析,请稍等

你好,

看下面帖子可否帮助你~ asktug 有很多类似的帖子可以先搜索尝试下~

感谢您的支持,我没有更改什么配置,再次模拟了同步DDL缩小字段长度导致报错,

告警实例: 10.200.25.83:8263 信息: dm worker paused exceed 1 min 详情: cluster: test-cluster, instance: 10.200.25.83:8263, task: test2, values: 3 阀值: 时间: 2020-04-24 13:08:03

有如下几个问题麻烦解答: 1 dm_worker.rules.yml 告警规则里面没有对dm-master ,dm_worker停止的时候做告警配置,这部分监控哪里设置报警呢

2 DM_task_state 这个同步任务状态 3或者其他什么值表示什么,哪里可以了解到

alert: DM_task_state expr: dm_worker_task_state == 3 for: 1m labels: env: test-cluster level: critical expr: dm_worker_task_state == 3

3 如果我之前配置了一个任务,但是我不想要了,想要删除,可是没有看到删除任务的命令; 而且会在dm_work目录下还存在(dumped_data.任务名)的目录,这个目录可是手动或者自动删除吗。

你好, 可以看下 grafana 监控中是否能帮到你

停止任务即可,stop-task task-name.yaml

dumped_data.taskname 为该任务在 dump 阶段保存的 sql 文件,如果不需要可以清理掉节省空间

DM_task_state 这个同步任务状态 3或者其他什么值表示什么,哪里可以了解到

alert: DM_task_state expr: dm_worker_task_state == 3 for: 1m labels: env: test-cluster level: critical expr: dm_worker_task_state == 3

还是没有看到这个3表示什么含义 ;另外监控没有对dmmaster,dmwork进程停止了没有告警触发规则。

还有个问题,我手动stop-task taskname 竟然不报警!!!!!!:hot_face:

正在分析请稍等

task state 的含义:

0 - invalidStage,

1 - New,

2 - Running,

3 - Paused,

4 - Stopped,

5 - Finished

主动停止 task 的话,我觉得不报警比较合理,异常导致 task 无法进行的话,需要报警

1赞