DM mydumper 缺失重试策略?

[2020/07/23 14:31:37.949 +08:00] [INFO] [relay.go:429] ["ignore event by transformer"] [component="relay log"] [header="{\"Timestamp\":0,\"EventType\":27,\"ServerID\":2,\"EventSize\":39,\"LogPos\":154,\"Flags\":0}"] [reason="heartbeat event"]
[2020/07/23 14:31:45.038 +08:00] [ERROR] [mydumper.go:170] ["Could not read data from DB_NAME.TABLE_NAME: Lost connection to MySQL server during query"] [task=prod-s1] [unit=dump]
[2020/07/23 14:31:45.039 +08:00] [INFO] [mydumper.go:164] ["Thread 2 shutting down"] [task=prod-s1] [unit=dump]
[2020/07/23 14:32:07.949 +08:00] [INFO] [relay.go:429] ["ignore event by transformer"] [component="relay log"] [header="{\"Timestamp\":0,\"EventType\":27,\"ServerID\":2,\"EventSize\":39,\"LogPos\":154,\"Flags\":0}"] [reason="heartbeat event"]

报断联后,query-status 状态正常,但是 dump 文件夹的大小不再增加。

  1. 麻烦提供完整的版本,以及详细的操作步骤,以及完整的 DM 日志 和 query-status 用于分析。
  2. 请问一下有没有尝试重启 dm-worker 。看看能否完成 Dump 阶段的处理。
Release Version: v1.0.6
Git Commit Hash: eaf2683c05ab44143bfb286bfbbc3ba157c555cc
Git Branch: release-1.0
UTC Build Time: 2020-06-17 10:22:01
Go Version: go version go1.13 linux/amd64

stop-task -> start-task 会重新开始备份。

日志信息不便上传,从worker日志中,明显的就是这个断链日志。

请问是否有自动重试策略,如果没有相关策略,我们就从 mydumper / 上游数据库 那边开始排查问题了。

目前 Mydumper 对于你提到的这种报错是不会重试的。

其实本身 worker 日志里也没什么有用的信息,就是备份哪些表的日志。

任务开始前会有一串前置输出

[2020/07/23 15:46:22.737 +08:00] [WARN] [mydumper.go:167] ["Broken table detected, please review: sys.host_summary"] [task=prod-s1] [unit=dump]
[2020/07/23 15:46:22.737 +08:00] [WARN] [mydumper.go:167] ["Broken table detected, please review: sys.host_summary_by_file_io"] [task=prod-s1] [unit=dump]
[2020/07/23 15:46:22.737 +08:00] [WARN] [mydumper.go:167] ["Broken table detected, please review: sys.host_summary_by_file_io_type"] [task=prod-s1] [unit=dump]
[2020/07/23 15:46:22.737 +08:00] [WARN] [mydumper.go:167] ["Broken table detected, please review: sys.host_summary_by_stages"] [task=prod-s1] [unit=dump]
[2020/07/23 15:46:22.737 +08:00] [WARN] [mydumper.go:167] ["Broken table detected, please review: sys.host_summary_by_statement_latency"] [task=prod-s1] [unit=dump]
[2020/07/23 15:46:22.737 +08:00] [WARN] [mydumper.go:167] ["Broken table detected, please review: sys.host_summary_by_statement_type"] [task=prod-s1] [unit=dump]
[2020/07/23 15:46:22.737 +08:00] [WARN] [mydumper.go:167] ["Broken table detected, please review: sys.innodb_buffer_stats_by_schema"] [task=prod-s1] [unit=dump]
[2020/07/23 15:46:22.737 +08:00] [WARN] [mydumper.go:167] ["Broken table detected, please review: sys.innodb_buffer_stats_by_table"] [task=prod-s1] [unit=dump]
[2020/07/23 15:46:22.737 +08:00] [WARN] [mydumper.go:167] ["Broken table detected, please review: sys.innodb_lock_waits"] [task=prod-s1] [unit=dump]
[2020/07/23 15:46:22.737 +08:00] [WARN] [mydumper.go:167] ["Broken table detected, please review: sys.io_by_thread_by_latency"] [task=prod-s1] [unit=dump]
[2020/07/23 15:46:22.737 +08:00] [WARN] [mydumper.go:167] ["Broken table detected, please review: sys.io_global_by_file_by_bytes"] [task=prod-s1] [unit=dump]
[2020/07/23 15:46:22.737 +08:00] [WARN] [mydumper.go:167] ["Broken table detected, please review: sys.io_global_by_file_by_latency"] [task=prod-s1] [unit=dump]
[2020/07/23 15:46:22.737 +08:00] [WARN] [mydumper.go:167] ["Broken table detected, please review: sys.io_global_by_wait_by_bytes"] [task=prod-s1] [unit=dump]
[2020/07/23 15:46:22.737 +08:00] [WARN] [mydumper.go:167] ["Broken table detected, please review: sys.io_global_by_wait_by_latency"] [task=prod-s1] [unit=dump]
[2020/07/23 15:46:22.737 +08:00] [WARN] [mydumper.go:167] ["Broken table detected, please review: sys.latest_file_io"] [task=prod-s1] [unit=dump]
[2020/07/23 15:46:22.737 +08:00] [WARN] [mydumper.go:167] ["Broken table detected, please review: sys.memory_by_host_by_current_bytes"] [task=prod-s1] [unit=dump]
[2020/07/23 15:46:22.737 +08:00] [WARN] [mydumper.go:167] ["Broken table detected, please review: sys.memory_by_thread_by_current_bytes"] [task=prod-s1] [unit=dump]
[2020/07/23 15:46:22.737 +08:00] [WARN] [mydumper.go:167] ["Broken table detected, please review: sys.memory_by_user_by_current_bytes"] [task=prod-s1] [unit=dump]
[2020/07/23 15:46:22.737 +08:00] [WARN] [mydumper.go:167] ["Broken table detected, please review: sys.memory_global_by_current_bytes"] [task=prod-s1] [unit=dump]
[2020/07/23 15:46:22.737 +08:00] [WARN] [mydumper.go:167] ["Broken table detected, please review: sys.memory_global_total"] [task=prod-s1] [unit=dump]
[2020/07/23 15:46:22.737 +08:00] [WARN] [mydumper.go:167] ["Broken table detected, please review: sys.processlist"] [task=prod-s1] [unit=dump]
[2020/07/23 15:46:22.737 +08:00] [WARN] [mydumper.go:167] ["Broken table detected, please review: sys.schema_index_statistics"] [task=prod-s1] [unit=dump]
[2020/07/23 15:46:22.737 +08:00] [WARN] [mydumper.go:167] ["Broken table detected, please review: sys.schema_table_lock_waits"] [task=prod-s1] [unit=dump]
[2020/07/23 15:46:22.737 +08:00] [WARN] [mydumper.go:167] ["Broken table detected, please review: sys.schema_table_statistics"] [task=prod-s1] [unit=dump]
[2020/07/23 15:46:22.737 +08:00] [WARN] [mydumper.go:167] ["Broken table detected, please review: sys.schema_table_statistics_with_buffer"] [task=prod-s1] [unit=dump]
[2020/07/23 15:46:22.737 +08:00] [WARN] [mydumper.go:167] ["Broken table detected, please review: sys.schema_tables_with_full_table_scans"] [task=prod-s1] [unit=dump]
[2020/07/23 15:46:22.737 +08:00] [WARN] [mydumper.go:167] ["Broken table detected, please review: sys.session"] [task=prod-s1] [unit=dump]
[2020/07/23 15:46:22.737 +08:00] [WARN] [mydumper.go:167] ["Broken table detected, please review: sys.statement_analysis"] [task=prod-s1] [unit=dump]
[2020/07/23 15:46:22.737 +08:00] [WARN] [mydumper.go:167] ["Broken table detected, please review: sys.statements_with_errors_or_warnings"] [task=prod-s1] [unit=dump]
[2020/07/23 15:46:22.737 +08:00] [WARN] [mydumper.go:167] ["Broken table detected, please review: sys.statements_with_full_table_scans"] [task=prod-s1] [unit=dump]
[2020/07/23 15:46:22.737 +08:00] [WARN] [mydumper.go:167] ["Broken table detected, please review: sys.statements_with_runtimes_in_95th_percentile"] [task=prod-s1] [unit=dump]
[2020/07/23 15:46:22.737 +08:00] [WARN] [mydumper.go:167] ["Broken table detected, please review: sys.statements_with_sorting"] [task=prod-s1] [unit=dump]
[2020/07/23 15:46:22.737 +08:00] [WARN] [mydumper.go:167] ["Broken table detected, please review: sys.statements_with_temp_tables"] [task=prod-s1] [unit=dump]
[2020/07/23 15:46:22.737 +08:00] [WARN] [mydumper.go:167] ["Broken table detected, please review: sys.user_summary"] [task=prod-s1] [unit=dump]
[2020/07/23 15:46:22.737 +08:00] [WARN] [mydumper.go:167] ["Broken table detected, please review: sys.user_summary_by_file_io"] [task=prod-s1] [unit=dump]
[2020/07/23 15:46:22.737 +08:00] [WARN] [mydumper.go:167] ["Broken table detected, please review: sys.user_summary_by_file_io_type"] [task=prod-s1] [unit=dump]
[2020/07/23 15:46:22.737 +08:00] [WARN] [mydumper.go:167] ["Broken table detected, please review: sys.user_summary_by_stages"] [task=prod-s1] [unit=dump]
[2020/07/23 15:46:22.737 +08:00] [WARN] [mydumper.go:167] ["Broken table detected, please review: sys.user_summary_by_statement_latency"] [task=prod-s1] [unit=dump]
[2020/07/23 15:46:22.738 +08:00] [WARN] [mydumper.go:167] ["Broken table detected, please review: sys.user_summary_by_statement_type"] [task=prod-s1] [unit=dump]
[2020/07/23 15:46:22.738 +08:00] [WARN] [mydumper.go:167] ["Broken table detected, please review: sys.wait_classes_global_by_avg_latency"] [task=prod-s1] [unit=dump]
[2020/07/23 15:46:22.738 +08:00] [WARN] [mydumper.go:167] ["Broken table detected, please review: sys.wait_classes_global_by_latency"] [task=prod-s1] [unit=dump]
[2020/07/23 15:46:22.738 +08:00] [WARN] [mydumper.go:167] ["Broken table detected, please review: sys.waits_by_host_by_latency"] [task=prod-s1] [unit=dump]
[2020/07/23 15:46:22.738 +08:00] [WARN] [mydumper.go:167] ["Broken table detected, please review: sys.waits_by_user_by_latency"] [task=prod-s1] [unit=dump]
[2020/07/23 15:46:22.738 +08:00] [WARN] [mydumper.go:167] ["Broken table detected, please review: sys.waits_global_by_latency"] [task=prod-s1] [unit=dump]
[2020/07/23 15:46:22.738 +08:00] [WARN] [mydumper.go:167] ["Broken table detected, please review: sys.x$ps_schema_table_statistics_io"] [task=prod-s1] [unit=dump]
[2020/07/23 15:46:22.738 +08:00] [WARN] [mydumper.go:167] ["Broken table detected, please review: sys.x$schema_table_lock_waits"] [task=prod-s1] [unit=dump]
[2020/07/23 15:46:22.738 +08:00] [WARN] [mydumper.go:167] ["Broken table detected, please review: sys.x$schema_table_statistics"] [task=prod-s1] [unit=dump]
[2020/07/23 15:46:22.738 +08:00] [WARN] [mydumper.go:167] ["Broken table detected, please review: sys.x$schema_table_statistics_with_buffer"] [task=prod-s1] [unit=dump]

暂不清楚是否与这个有关。

也可能是上游是一个从库导致的?尝试过多次,备份到 6~10G左右的时候就会出现这个中断。

  1. 应该与这个无关。这个只是在 Mydumper 检查到这些 view 的状态异常,方便的话提供一下:use sys; SHOW TABLE STATUS where name = 'host_summary';
  2. 建议调整 Mydumper 的并发以及检查上游到 DM 的网络是否有问题。

use sys; SHOW TABLE STATUS where name = ‘host_summary’;

host_summary,,,,,,,,,,,,,,,,,VIEW

看起来没有内容。

应该不是并发数的问题,是有几个库10G+,错误一般出现在这几个库上。

建议在asktug 搜索下问题出现时的报错:Lost connection to MySQL server during query,可以参考下解决方法

你好
不知问题是否已经解决了呢?

还是需要反馈下 task 任务文件
并且执行 dmesg | grep mydumper ,看下是否是被 oom
如果单表数据量较大,可以根据该贴 最佳解决方案 处理先,看是否可以解决你的问题

手动执行 mydumper 可以备份,但是放到 dm 里不能成功。

extra-args: “-r 50000 -s 500000 -x ‘^(abc-db1|abc_db2)’ --no-locks”

使用 dm 同步,并观察 dump 目录内容,不再是以前一个数据库被切成多个sql,而是一个巨大的 SQL 文件。dump 状态一直卡住。

在 --rows 参数的描述中有提到。

可以再次降低 -r 的配置项,进行测试,dump 状态卡主的现象是什么呢,辛苦提供下 query-status 看下。

14G dumped_data.prod-s1
总数据量 14G (跟我手动执行 mydumper 大小差不多)

目前 query-status prod-s1 报错

query-status prod-s1
{
    "result": true,
    "msg": "",
    "workers": [
        {
            "result": false,
            "worker": "127.0.0.1:8263",
            "msg": "[code=38008:class=dm-master:scope=internal:level=high] grpc request error: rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: connection error: desc = \"transport: Error while dialing dial tcp 127.0.0.1:8263: connect: connection refused\"\ngithub.com/pingcap/dm/pkg/terror.(*Error).Delegate\n\t/home/jenkins/agent/workspace/build_dm_master/go/src/github.com/pingcap/dm/pkg/terror/terror.go:271\ngithub.com/pingcap/dm/dm/master/workerrpc.callRPC\n\t/home/jenkins/agent/workspace/build_dm_master/go/src/github.com/pingcap/dm/dm/master/workerrpc/rawgrpc.go:124\ngithub.com/pingcap/dm/dm/master/workerrpc.(*GRPCClient).SendRequest\n\t/home/jenkins/agent/workspace/build_dm_master/go/src/github.com/pingcap/dm/dm/master/workerrpc/rawgrpc.go:64\ngithub.com/pingcap/dm/dm/master.(*Server).getStatusFromWorkers.func2\n\t/home/jenkins/agent/workspace/build_dm_master/go/src/github.com/pingcap/dm/dm/master/server.go:1135\ngithub.com/pingcap/dm/dm/master.(*AgentPool).Emit\n\t/home/jenkins/agent/workspace/build_dm_master/go/src/github.com/pingcap/dm/dm/master/agent_pool.go:117\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:1357",
            "subTaskStatus": [
            ],
            "relayStatus": null,
            "sourceID": ""
        }
    ]
}

估计由于导出的 table.sql 未切分,单个 table 较大,被 kill 了

[2921278.186787] Out of memory: Kill process 15552 (dm-worker2) score 437 or sacrifice child
[2921278.188057] Killed process 15552 (dm-worker2) total-vm:3773704kB, anon-rss:3676800kB, file-rss:0kB

请问有没有既能配置 -r 又让 table split 成小文件的方法(-r -s 同时生效)?

参数改成:extra-args: “-s 200000 -x ‘^(abc-db1|abc_db2)’ --no-locks”

同步到11G后,观察大表的sql,文件增加到300多后,最后一个文件大小不再增加,过一会后,整个目录被清空,又重新开始

看起来也是有重试的

[2020/07/27 12:59:56.317 +08:00] [INFO] [relay.go:429] ["ignore event by transformer"] [component="relay log"] [header="{\"Timestamp\":0,\"EventType\":27,\"ServerID\":2,\"EventSize\":39,\"LogPos\":154,\"Flags\":0}"] [reason="heartbeat event"]
[2020/07/27 13:00:26.317 +08:00] [INFO] [relay.go:429] ["ignore event by transformer"] [component="relay log"] [header="{\"Timestamp\":0,\"EventType\":27,\"ServerID\":2,\"EventSize\":39,\"LogPos\":154,\"Flags\":0}"] [reason="heartbeat event"]
[2020/07/27 13:00:28.776 +08:00] [ERROR] [mydumper.go:170] ["Could not read data from abc-db1.big_table: Lost connection to MySQL server during query"] [task=prod-s1] [unit=dump]
[2020/07/27 13:00:28.776 +08:00] [INFO] [mydumper.go:164] ["Thread 3 shutting down"] [task=prod-s1] [unit=dump]
[2020/07/27 13:00:42.756 +08:00] [INFO] [server.go:252] [request=QueryStatus] [payload="name:\"prod-s1\" "]
[2020/07/27 13:00:56.317 +08:00] [INFO] [relay.go:429] ["ignore event by transformer"] [component="relay log"] [header="{\"Timestamp\":0,\"EventType\":27,\"ServerID\":2,\"EventSize\":39,\"LogPos\":154,\"Flags\":0}"] [reason="heartbeat event"]
[2020/07/27 13:01:15.338 +08:00] [INFO] [server.go:252] [request=QueryStatus] [payload="name:\"prod-s1\" "]
[2020/07/27 13:01:26.317 +08:00] [INFO] [relay.go:429] ["ignore event by transformer"] [component="relay log"] [header="{\"Timestamp\":0,\"EventType\":27,\"ServerID\":2,\"EventSize\":39,\"LogPos\":154,\"Flags\":0}"] [reason="heartbeat event"]
[2020/07/27 13:01:56.317 +08:00] [INFO] [relay.go:429] ["ignore event by transformer"] [component="relay log"] [header="{\"Timestamp\":0,\"EventType\":27,\"ServerID\":2,\"EventSize\":39,\"LogPos\":154,\"Flags\":0}"] [reason="heartbeat event"]
[2020/07/27 13:02:26.144 +08:00] [INFO] [server.go:252] [request=QueryStatus] [payload="name:\"prod-s1\" "]
[2020/07/27 13:02:26.317 +08:00] [INFO] [relay.go:429] ["ignore event by transformer"] [component="relay log"] [header="{\"Timestamp\":0,\"EventType\":27,\"ServerID\":2,\"EventSize\":39,\"LogPos\":154,\"Flags\":0}"] [reason="heartbeat event"]
[2020/07/27 13:02:37.817 +08:00] [INFO] [mydumper.go:164] ["Thread 4 shutting down"] [task=prod-s1] [unit=dump]
[2020/07/27 13:02:37.817 +08:00] [INFO] [mydumper.go:164] ["Finished dump at: 2020-07-27 13:02:37"] [task=prod-s1] [unit=dump]
[2020/07/27 13:02:37.852 +08:00] [ERROR] [mydumper.go:118] ["dump data exits with error"] [task=prod-s1] [unit=dump] ["cost time"=7m5.385729902s] [error="msg:\"[code=32001:class=dump-unit:scope=internal:level=high] mydumper runs with error: exit status 1. \\n\\n\" "]
[2020/07/27 13:02:37.852 +08:00] [INFO] [subtask.go:266] ["unit process returned"] [subtask=prod-s1] [unit=Dump] [stage=Paused] [status={}]
[2020/07/27 13:02:37.852 +08:00] [ERROR] [subtask.go:285] ["unit process error"] [subtask=prod-s1] [unit=Dump] ["error information"="{\"msg\":\"[code=32001:class=dump-unit:scope=internal:level=high] mydumper runs with error: exit status 1. \\n\\n\"}"]
[2020/07/27 13:02:41.305 +08:00] [INFO] [worker.go:320] ["operate subtask"] [component="worker controller"] [operation=AutoResume] [task=prod-s1]
[2020/07/27 13:02:41.305 +08:00] [INFO] [task_checker.go:369] ["dispatch auto resume task"] [component="task checker"] [task=prod-s1] [opLogID=2]
[2020/07/27 13:02:42.305 +08:00] [INFO] [worker.go:802] ["start to execute operation"] [component="worker controller"] [oplog="{\"id\":2,\"task\":{\"op\":6,\"name\":\"prod-s1\"},\"ts\":1595826161305526325}"]
[2020/07/27 13:02:42.305 +08:00] [INFO] [worker.go:889] ["auto_resume sub task"] [component="worker controller"] [task=prod-s1]
[2020/07/27 13:02:42.305 +08:00] [INFO] [subtask.go:467] ["resume with unit"] [subtask=prod-s1] [unit=Dump]
[2020/07/27 13:02:42.305 +08:00] [INFO] [worker.go:893] ["end to execute operation"] [component="worker controller"] ["oplog ID"=2] []
[2020/07/27 13:02:43.100 +08:00] [INFO] [mydumper.go:164] ["Server version reported as: 5.7.28-log"] [task=prod-s1] [unit=dump]
[2020/07/27 13:02:43.104 +08:00] [INFO] [mydumper.go:164] ["Connected to a MySQL server"] [task=prod-s1] [unit=dump]

cat /var/log/messages | grep "Out of memory"
没有相关输出,不像是 client 被 kill 的情况

看下 dm-worker disk 使用情况。
dmesg | grep mydumper 结果反馈下。

不是 worker disk 不足导致。

dmesg | grep mydumper 正常。

目前试下来 手动模拟执行 mydumper,主库有较大概率可以成功备份,从库不能备份。

由于目前主库的 binlog format 是 mixed,从库是 row,需要择机将 主库 的 binlog 格式修改后,再用 dm 尝试。

降低 -r 的配置项,-s 可以先不使用。

dm 的使用前置检查中,对MySQL binlog 配置

  • binlog 是否开启(DM 要求 binlog 必须开启)
  • 是否有 binlog_format=ROW (DM 只支持 ROW 格式的 binlog 同步)
  • 是否有 binlog_row_image=FULL (DM 只支持 binlog_row_image=FULL

感觉不可行吧。只使用 -r 会备份成一个 10G+ 的大文件。

dm 在 import 到 tidb 中的时候,会自动拆分成小事务来多次提交?

loader 应该是根据文件并发的,所以 load 阶段应该是单线程。
这边的问题不是 dump 不是很稳定吗,这边通过减少一些参数看是否有效果。

如果当前 mysql 数据库满足上面的限制,row full 模式,这边在试下,用 dm 进行 dump 看是否可以。-r 设置的小一些,后面在可以逐步调高,看下上限能到多少。

-r 5000 是能成功,但是 dm-worker 就会 oom kill

Jul 29 17:09:11 iZkdjwfb4sz2clZ kernel: [<ffffffff9d9baff6>] out_of_memory+0x4b6/0x4f0
Jul 29 17:09:11 iZkdjwfb4sz2clZ kernel: Out of memory: Kill process 3163 (dm-worker2) score 505 or sacrifice child
Jul 29 17:09:11 iZkdjwfb4sz2clZ kernel: [<ffffffff9d9baff6>] out_of_memory+0x4b6/0x4f0
Jul 29 17:09:11 iZkdjwfb4sz2clZ kernel: Out of memory: Kill process 3167 (dm-worker2) score 505 or sacrifice child

单独执行 mydumper 和 dm-worker 使用一台服务器?