dm显示不正常

为提高效率,请提供以下信息,问题描述清晰能够更快得到解决:

【现象】 业务和数据库现象执行命令显示
执行命令tiup dmctl --master-addr 11.260.17.166:266 query-status时间比较久
显示也不正常

Starting component dmctl: /home/tidb/.tiup/components/dmctl/v2.0.6/dmctl/dmctl --master-addr 16.232.29.166:8266 query-status
{
“result”: true,
“msg”: “”,
“sources”: [
{
“result”: true,
“msg”: “”,
“sourceStatus”: {
“source”: “allocation_sys_fee_s_20210702”,
“worker”: “dm-1xxxx-10020”,
“result”: null,
“relayStatus”: null
},
“subTaskStatus”: [
{
“name”: “allocation-sys-s20210702”,
“stage”: “Running”,
“unit”: “Sync”,
“result”: null,
“unresolvedDDLLockID”: “”,
“sync”: {
“totalEvents”: “8562829”,
“totalTps”: “48”,
“recentTps”: “0”,
“masterBinlog”: “(mysql-bin.013872, 514791003)”,
“masterBinlogGtid”: “cabf8a02-9ce7-11eb-847e-00163e02c813:1-2714654319,cbe3f0d3-9ce7-11eb-b11e-00163e0235a5:1-55”,
“syncerBinlog”: “(mysql-bin.013872, 514517403)”,
“syncerBinlogGtid”: “”,
“blockingDDLs”: [
],
“unresolvedGroups”: [
],

【问题】 当前遇到的问题

【业务影响】

【TiDB 版本】
dm 2.0.3

【附件】

dm-master.log
[2021/09/25 12:03:14.983 +08:00] [INFO] [server.go:2190] [payload=] [request=QueryStatus]
[2021/09/25 12:03:14.983 +08:00] [INFO] [server.go:660] [“get sources”]
[2021/09/25 12:09:59.764 +08:00] [INFO] [server.go:2190] [payload=] [request=QueryStatus]
[2021/09/25 12:09:59.764 +08:00] [INFO] [server.go:660] [“get sources”]
[2021/09/25 12:11:05.255 +08:00] [INFO] [server.go:2190] [payload=] [request=QueryStatus]
[2021/09/25 12:11:05.255 +08:00] [INFO] [server.go:660] [“get sources”]
[2021/09/25 12:11:07.578 +08:00] [WARN] [util.go:121] [“failed to apply request”] [component=“embed etcd”] [took=6.639µs] [request="header:<ID:11341888889580941467 > compaction:<revision:434 > "] [response=] [error=“mvcc: required revision has been compacted”]
[2021/09/25 12:11:44.476 +08:00] [INFO] [server.go:2190] [payload=] [request=QueryStatus]
[2021/09/25 12:11:44.476 +08:00] [INFO] [server.go:660] [“get sources”]
[2021/09/25 12:14:39.413 +08:00] [INFO] [server.go:2190] [payload=] [request=QueryStatus]
[2021/09/25 12:14:39.413 +08:00] [INFO] [server.go:660] [“get sources”]
[2021/09/25 12:20:21.823 +08:00] [INFO] [server.go:2190] [payload="leader:true master:true names:“master2” "] [request=ListMember]
[2021/09/25 12:20:21.926 +08:00] [INFO] [server.go:2190] [payload="leader:true master:true names:“master3” "] [request=ListMember]
[2021/09/25 12:20:22.029 +08:00] [INFO] [server.go:2190] [payload="leader:true master:true names:“master4” "] [request=ListMember]
[2021/09/25 12:20:22.033 +08:00] [INFO] [server.go:2190] [payload="worker:true names:“dm-10.240.14.169-10010” "] [request=ListMember]
[2021/09/25 12:20:22.034 +08:00] [INFO] [server.go:2190] [payload="worker:true names:“dm-10.240.14.169-10011” "] [request=ListMember]
[2021/09/25 12:20:22.034 +08:00] [INFO] [server.go:2190] [payload="worker:true names:“dm-10.240.14.169-10012” "] [request=ListMember]
[2021/09/25 12:20:22.035 +08:00] [INFO] [server.go:2190] [payload="worker:true names:“dm-10.240.14.169-10013” "] [request=ListMember]
[2021/09/25 12:20:22.036 +08:00] [INFO] [server.go:2190] [payload="worker:true names:“dm-10.240.14.169-10014” "] [request=ListMember]
[2021/09/25 12:20:22.037 +08:00] [INFO] [server.go:2190] [payload="worker:true names:“dm-10.240.14.169-10015” "] [request=ListMember]
[2021/09/25 12:20:22.038 +08:00] [INFO] [server.go:2190] [payload="worker:true names:“dm-10.240.14.169-10016” "] [request=ListMember]
[2021/09/25 12:20:22.039 +08:00] [INFO] [server.go:2190] [payload="worker:true names:“dm-10.240.14.169-10017” "] [request=ListMember]
[2021/09/25 12:20:22.039 +08:00] [INFO] [server.go:2190] [payload="worker:true names:“dm-10.240.14.169-10018” "] [request=ListMember]
[2021/09/25 12:20:22.040 +08:00] [INFO] [server.go:2190] [payload="worker:true names:“dm-10.240.14.169-10019” "] [request=ListMember]
[2021/09/25 12:20:22.041 +08:00] [INFO] [server.go:2190] [payload="worker:true names:“dm-10.240.14.169-10020” "] [request=ListMember]
[2021/09/25 12:20:22.042 +08:00] [INFO] [server.go:2190] [payload="worker:true names:“dm-10.240.14.169-10021” "] [request=ListMember]
[2021/09/25 12:20:22.043 +08:00] [INFO] [server.go:2190] [payload="worker:true names:“dm-10.240.14.169-10022” "] [request=ListMember]
[2021/09/25 12:20:22.043 +08:00] [INFO] [server.go:2190] [payload="worker:true names:“dm-10.240.14.170-10023” "] [request=ListMember]
[2021/09/25 12:20:22.045 +08:00] [INFO] [server.go:2190] [payload="worker:true names:“dm-10.240.14.170-10024” "] [request=ListMember]
[2021/09/25 12:20:22.046 +08:00] [INFO] [server.go:2190] [payload="worker:true names:“dm-10.240.14.170-10025” "] [request=ListMember]
[2021/09/25 12:20:22.046 +08:00] [INFO] [server.go:2190] [payload="worker:true names:“dm-10.240.14.170-10026” "] [request=ListMember]
[2021/09/25 12:20:22.047 +08:00] [INFO] [server.go:2190] [payload="worker:true names:“dm-10.240.14.170-10027” "] [request=ListMember]
[2021/09/25 12:20:22.048 +08:00] [INFO] [server.go:2190] [payload="worker:true names:“dm-10.240.14.170-10028” "] [request=ListMember]
[2021/09/25 12:21:23.061 +08:00] [INFO] [server.go:2190] [payload=] [request=QueryStatus]
[2021/09/25 12:21:23.061 +08:00] [INFO] [server.go:660] [“get sources”]
[2021/09/25 12:25:23.578 +08:00] [INFO] [server.go:2190] [payload=] [request=QueryStatus]
[2021/09/25 12:25:23.578 +08:00] [INFO] [server.go:660] [“get sources”]


若提问为性能优化、故障排查类问题,请下载脚本运行。终端输出的打印结果,请务必全选并复制粘贴上传。

1、问题1 说的执行时间较长,需要提供一个信息,当前咱们的 task 数量是多少啊(不过 query-status 时间问题,已在优化中)
2、说的展示异常,是指?

任务数量12个,dm 组件相关prometheus alertmanager grafana都进行过迁移,之后就显示不正常,查看pro日志
level=error ts=2021-09-26T06:14:01.630186829Z caller=notifier.go:481 component=notifier alertmanager=http://11.158.17.63:9066/api/v1/alerts count=1 msg=“Error sending alert” err=“Post http://11.158.17.63:9066/api/v1/alerts: context deadline exceeded”
这个11.158.17.63:9066是已经下线的节点

1、你的意思是异常 是指 监控查看时,显示有问题 对吧,不是 task 运行有异常
2、如果监控的问题,需要你看看 prometheus 的配置文件里的内容,是否正常 ,然后 grafana 页面的配置里的,数据源是否ok,alertmanager 这个也是检查配置文件

嗯呐 显示有问题

1、alertmanager
prometheus
grafana 安装官方文档进行扩容缩容的
2、执行命令tiup dmctl --master-addr=ip:port query-status查询 任务状态全部都有
“msg”: “[code=38008:class=dm-master:scope=internal:level=high], Message: grpc request error, RawCause: rpc error: code = DeadlineExceeded desc = context deadline exceeded”,

3、查看prometheus 日志,显示
level=error ts=2021-09-26T06:36:01.63142327Z caller=notifier.go:481 component=notifier alertmanager=http://11.158.17.63:9066/api/v1/alerts count=1 msg=“Error sending alert” err=“Post http://11.158.17.63:9066/api/v1/alerts: context deadline exceeded”
level=error ts=2021-09-26T06:36:31.630211384Z caller=notifier.go:481 component=notifier alertmanager=http://11.158.17.63:9066/api/v1/alerts count=1 msg=“Error sending alert” err=“Post http://11.158.17.63:9066/api/v1/alerts: context deadline exceeded”

11.158.17.63:9066 alertmanager 是已经下线的老地址,为啥会出现

altermanager 问题,建议看 prometheus 的配置文件

这个配置需要手动更新吗,不是扩容缩容自动维护的吗

是自动的,但咱们这不是有异常了嘛,需要检查一下

里面都是最新的配置,正常

你的意思是 prometheus 的配置文件中,记录的 alertmanager 的IP+PORT 是对的?哪奇怪了,prometheus 的日志,你看看日志中 记录的 alertmanager 的地址是对的嘛(担心是不是 prometheus 没重启过,只是 配置文件改过了)

alertmanager.yml 文件没有修改过。里面没有呀。只有这个默认

prometheus 的配置文件,不是 alertmanager的

pro配置都是对的,pro组件单独重启过。重新部署了一个pro,


为啥显示这个错误

配置文件有问题