TIDB 3.0.12+dm 1.0.4hotfix dm告警问题

aiaix1211 · 2020 年4 月 24 日 02:56

TIDB的监控器服务器地址和DM 监控的服务器地址是分开的 TIDB集群可以通过alertmanger可以配置微信或者邮件告警

但是DM通过同样的方式却不告警，alertmanager日志和prometheus.log都正常. [tidb@pd-tikv03 conf]$ cat prometheus.yml

global: scrape_interval: 15s # By default, scrape targets every 15 seconds. evaluation_interval: 15s # How frequently to evaluate rules.

scrape_timeout is set to the global default (10s).

external_labels: cluster: ‘hcloud-test-cluster’ monitor: “prometheus”

Load and evaluate rules in this file every ‘evaluation_interval’ seconds.

rule_files:

‘dm_worker.rules.yml’

alerting: alertmanagers:

static_configs:
- targets:
  - ‘10.200.25.83:9093’

scrape_configs:

job_name: “dm_worker” honor_labels: true # don’t overwrite job & instance labels static_configs:
- targets:
  - ‘10.200.25.83:8262’
  - ‘10.200.25.83:8263’
  - ‘10.200.25.82:8262’
  - ‘10.200.25.82:8263’

[tidb@pd-tikv03 log]$ tail -f alertmanager.log

level=info ts=2020-04-24T02:21:25.266047945Z caller=main.go:275 msg=“Loading configuration file” file=conf/alertmanager.yml level=info ts=2020-04-24T02:21:25.274735349Z caller=main.go:350 msg=Listening address=:9093 level=info ts=2020-04-24T02:36:25.266644124Z caller=nflog.go:293 component=nflog msg=“Running maintenance” level=info ts=2020-04-24T02:36:25.266762038Z caller=silence.go:269 component=silences msg=“Running maintenance” level=info ts=2020-04-24T02:36:25.268486797Z caller=silence.go:271 component=silences msg=“Maintenance done” duration=1.834747ms size=0 level=info ts=2020-04-24T02:36:25.268797692Z caller=nflog.go:295 component=nflog msg=“Maintenance done” duration=2.182355ms size=1756 level=info ts=2020-04-24T02:48:17.669480674Z caller=main.go:136 msg=“Starting Alertmanager” version="(version=0.14.0, branch=HEAD, revision=30af4d051b37ce817ea7e35b56c57a0e2ec9dbb0)" level=info ts=2020-04-24T02:48:17.669634111Z caller=main.go:137 build_context="(go=go1.9.2, user=root@37b6a49ebba9, date=20180213-08:16:42)" level=info ts=2020-04-24T02:48:17.670981714Z caller=main.go:275 msg=“Loading configuration file” file=conf/alertmanager.yml level=info ts=2020-04-24T02:48:17.683878104Z caller=main.go:350 msg=Listening address=:9093 ^C

[tidb@pd-tikv03 log]$ tail -f prometheus.log

level=error ts=2020-04-24T02:04:58.371952703Z caller=notifier.go:473 component=notifier alertmanager=http://10.200.25.83:9093/api/v1/alerts count=0 msg=“Error sending alert” err=“Post http://10.200.25.83:9093/api/v1/alerts: dial tcp 10.200.25.83:9093: connect: connection refused” level=info ts=2020-04-24T02:48:11.444826815Z caller=main.go:220 msg=“Starting Prometheus” version="(version=2.2.1, branch=HEAD, revision=bc6058c81272a8d938c05e75607371284236aadc)" level=info ts=2020-04-24T02:48:11.445014157Z caller=main.go:221 build_context="(go=go1.10, user=root@149e5b3f0829, date=20180314-14:15:45)" level=info ts=2020-04-24T02:48:11.44510546Z caller=main.go:222 host_details="(Linux 3.10.0-1062.el7.x86_64 #1 SMP Wed Aug 7 18:08:02 UTC 2019 x86_64 pd-tikv03 (none))" level=info ts=2020-04-24T02:48:11.445140388Z caller=main.go:223 fd_limits="(soft=1000000, hard=1000000)" level=info ts=2020-04-24T02:48:11.452491503Z caller=main.go:504 msg=“Starting TSDB …” level=info ts=2020-04-24T02:48:11.452681391Z caller=web.go:382 component=web msg=“Start listening for connections” address=:9090 level=info ts=2020-04-24T02:48:11.923424153Z caller=main.go:514 msg=“TSDB started” level=info ts=2020-04-24T02:48:11.923565384Z caller=main.go:588 msg=“Loading configuration file” filename=/data/dm-master/conf/prometheus.yml level=info ts=2020-04-24T02:48:11.927782969Z caller=main.go:491 msg=“Server is ready to receive web requests.”

测试步骤是：1 停止任务不告警 2 停止dm集群也不告警

来了老弟 · 2020 年4 月 24 日 02:59

你好，

你的问题已收到，正在分析，请稍等

来了老弟 · 2020 年4 月 24 日 03:52

你好，

看下面帖子可否帮助你~
asktug 有很多类似的帖子可以先搜索尝试下~

aiaix1211 · 2020 年4 月 24 日 05:30

感谢您的支持，我没有更改什么配置，再次模拟了同步DDL缩小字段长度导致报错，

告警实例: 10.200.25.83:8263
信息: dm worker paused exceed 1 min
详情: cluster: test-cluster, instance: 10.200.25.83:8263, task: test2, values: 3
阀值:
时间: 2020-04-24 13:08:03

有如下几个问题麻烦解答：
1 dm_worker.rules.yml 告警规则里面没有对dm-master ,dm_worker停止的时候做告警配置，这部分监控哪里设置报警呢

2 DM_task_state 这个同步任务状态 3或者其他什么值表示什么，哪里可以了解到

alert: DM_task_state
expr: dm_worker_task_state == 3
for: 1m
labels:
env: test-cluster
level: critical
expr: dm_worker_task_state == 3

3
如果我之前配置了一个任务，但是我不想要了，想要删除，可是没有看到删除任务的命令；
而且会在dm_work目录下还存在（dumped_data.任务名）的目录，这个目录可是手动或者自动删除吗。

来了老弟 · 2020 年4 月 24 日 05:49

你好，
可以看下 grafana 监控中是否能帮到你
https://pingcap.com/docs-cn/tidb-data-migration/stable/monitor-a-dm-cluster/

停止任务即可，stop-task task-name.yaml
https://pingcap.com/docs-cn/tidb-data-migration/stable/manage-replication-tasks/

dumped_data.taskname 为该任务在 dump 阶段保存的 sql 文件，如果不需要可以清理掉节省空间

aiaix1211 · 2020 年4 月 24 日 05:57

DM_task_state 这个同步任务状态 3或者其他什么值表示什么，哪里可以了解到

alert: DM_task_state expr: dm_worker_task_state == 3 for: 1m labels: env: test-cluster level: critical expr: dm_worker_task_state == 3

还是没有看到这个3表示什么含义；另外监控没有对dmmaster,dmwork进程停止了没有告警触发规则。

还有个问题，我手动stop-task taskname 竟然不报警！！！！！！

来了老弟 · 2020 年4 月 24 日 06:19

正在分析请稍等

luancheng-PingCAP · 2020 年4 月 24 日 11:58

task state 的含义：

0 - invalidStage,

1 - New,

2 - Running,

3 - Paused,

4 - Stopped,

5 - Finished

主动停止 task 的话，我觉得不报警比较合理，异常导致 task 无法进行的话，需要报警

system · 2022 年10 月 31 日 19:06

此话题已在最后回复的 1 分钟后被自动关闭。不再允许新回复。