DM集群dump文件失败(Lost connection to MySQL server during query)

DM版本: V1.0.6
表oms.so_attachment记录如下:
image

表结构

任务启动后,dump这张表的时候就会报错,然后会清空dump文件目录后从新开始导数。
[2020/07/07 09:53:40.077 +08:00] [ERROR] [mydumper.go:170] [“Could not read data from oms.so_attachment: Lost connection to MySQL server during query”] [task=oms] [unit=dump]

请帮忙确认下异常原因,谢谢。

你好,执行 show grants for ‘’@’’; 看下同步用户的权限

应该和权限没有关系,同个库有12张表,其他表都已经正常同步,但这张表就会一直报错

可以上传看下,

检查下报错时间点 mysql 的 error log 中是否有发生过重启。否则可以连接时间设置的长一些在进行 dump

没有重启,初步确定是 这个表有个字段属性是 sign_str longtext ,查了一条记录看此字段的值存在30多万个字符
image
从8点35分导数一直到9点30分会报错,还请确认是否导数时间过长引起报错?
以下是相关mysql上游的timeout属性情况:

msyql> set global wait_timeout=xx;

msyql> set global interactive_timeout=xx;

调整到一个大的值,再同步看下

今天把上游mysql这两个参数都调整为18000(原值是1800),但是在dump接近2小时后,会重新报错,还请帮忙继续看下问题,谢谢。

[2020/07/13 12:23:54.642 +08:00] [INFO] [syncer.go:887] [“flushed checkpoint”] [task=oms] [unit=“binlog replication”] [checkpoint="(mysql-bin|000001.000028, 425025711)(flushed (mysql-bin|000001.000028, 425025711))"]
[2020/07/13 12:23:54.642 +08:00] [INFO] [relay.go:113] [“current earliest active relay log”] [task=oms] [unit=“binlog replication”] [“active relay log”=433f9876-b90c-11ea-ade9-fac753c33500.000001/mysql-bin.000028]
[2020/07/13 12:24:11.474 +08:00] [INFO] [relay.go:568] [“flush meta finished”] [component=“relay log”] [meta=“master-uuid = 433f9876-b90c-11ea-ade9-fac753c33500.000001, relay-binlog = (mysql-bin.000028, 425045108), relay-binlog-gtid = “]
[2020/07/13 12:24:14.134 +08:00] [ERROR] [mydumper.go:170] [“Could not read data from oms.so_attachment: Lost connection to MySQL server during query”] [task=so_attachment] [unit=dump]
[2020/07/13 12:24:14.147 +08:00] [INFO] [mydumper.go:164] [“Thread 1 shutting down”] [task=so_attachment] [unit=dump]
[2020/07/13 12:24:14.147 +08:00] [INFO] [mydumper.go:164] [“Finished dump at: 2020-07-13 12:24:14”] [task=so_attachment] [unit=dump]
[2020/07/13 12:24:14.642 +08:00] [ERROR] [mydumper.go:118] [“dump data exits with error”] [task=so_attachment] [unit=dump] [“cost time”=1h56m9.983838484s] [error=“msg:”[code=32001:class=dump-unit:scope=internal:level=high] mydumper runs with error: exit status 1. \ \ ” “]
[2020/07/13 12:24:14.642 +08:00] [INFO] [subtask.go:266] [“unit process returned”] [subtask=so_attachment] [unit=Dump] [stage=Paused] [status={}]
[2020/07/13 12:24:14.642 +08:00] [ERROR] [subtask.go:285] [“unit process error”] [subtask=so_attachment] [unit=Dump] [“error information”=”{“msg”:”[code=32001:class=dump-unit:scope=internal:level=high] mydumper runs with error: exit status 1. \ \ "}"]
[2020/07/13 12:24:18.658 +08:00] [INFO] [worker.go:320] [“operate subtask”] [component=“worker controller”] [operation=AutoResume] [task=so_attachment]

你好,

请帮忙反馈下,dmesg | grep -i oom 和 mydumper 的执行语句。如果是 dm 可以在 task 中上传下 mdudmpers 的部分

没有相关信息输出:
image

task的配置如下:
mydumpers:
mysql-replica-04.dump:
mydumper-path: bin/mydumper
threads: 4
chunk-filesize: 64
skip-tz-utc: true
extra-args: -T oms.so_attachment

你好,

对于单表数据量很大可以试下。去掉 chunk-filesize: 64 在 extra-args: 增加 --rows 试下

规定每个并发处理的数据量-r, --rows Try to split tables into chunks of this many rows. This option turns off --chunk-filesize

参数 -r 的值需要如何设定?是否有相关的文档介绍让我看看呢?我查了一下没有查到相关的文档

刚用 -r 10000,目前导出速度很快,待会dump完再反馈结果,谢谢。

ok,

10000行的时候,导出文件没问题,入库报错了,改成3000重新跑task,完美同步完成!感觉专家支持!

load 阶段报错吗?请问是什么错误呢

看报错是显示表已存在和主键冲突,这张表在下游一直都没有创建,所以正常来说不应该有这两类报错,今天下午drop掉这个表,重新跑就没问题,应该也和-r 取10000没多大关系。

如果有报错截图或者具体的信息可以贴出来看下,目前还不好说

我把关键的日志拿出来
开始dump:
[2020/07/13 17:08:44.715 +08:00] [INFO] [relay.go:89] [“current earliest active relay log”] [task=so_attachment] [unit=“binlog replication”] [“active relay log”=433f9876-b90c-11ea-ade9-fac753c33500.000001/mysql-bin.000029]
[2020/07/13 17:08:44.719 +08:00] [INFO] [subtask.go:204] [“start to run”] [subtask=so_attachment] [unit=Dump]
[2020/07/13 17:08:44.719 +08:00] [INFO] [worker.go:893] [“end to execute operation”] [component=“worker controller”] [“oplog ID”=302] []
[2020/07/13 17:08:45.012 +08:00] [WARN] [mydumper.go:167] ["–chunk-filesize disabled by --rows option"] [task=so_attachment] [unit=dump]
[2020/07/13 17:08:45.029 +08:00] [INFO] [mydumper.go:164] [“Server version reported as: 5.7.30-33-log”] [task=so_attachment] [unit=dump]
[2020/07/13 17:08:45.030 +08:00] [INFO] [mydumper.go:164] [“Connected to a MySQL server”] [task=so_attachment] [unit=dump]

5分钟后dump完成,从这里可以看出,修改了参数之后,dump文件是很快的
[2020/07/13 17:12:56.568 +08:00] [INFO] [mydumper.go:164] [“Finished dump at: 2020-07-13 17:12:56”] [task=so_attachment] [unit=dump]
[2020/07/13 17:12:56.578 +08:00] [INFO] [mydumper.go:116] [“dump data finished”] [task=so_attachment] [unit=dump] [“cost time”=4m11.858990186s]
[2020/07/13 17:12:56.580 +08:00] [INFO] [subtask.go:266] [“unit process returned”] [subtask=so_attachment] [unit=Dump] [stage=Finished] [status={}]

oms库存在,没问题,之后是建表。
[2020/07/13 17:12:56.648 +08:00] [ERROR] [baseconn.go:178] [“execute statement failed”] [task=so_attachment] [unit=load] [query=“CREATE DATABASE oms;”] [argument="[]"] [error=“Error 1007: Can’t create database ‘oms’; database exists”]
[2020/07/13 17:12:56.649 +08:00] [ERROR] [db.go:176] [“execute statements failed after retry”] [task=so_attachment] [unit=load] [queries="[CREATE DATABASE oms;]"] [arguments="[]"] [error="[code=10006:class=database:scope=not-set:level=high] execute statement failed: CREATE DATABASE oms;: Error 1007: Can’t create database ‘oms’; database exists"]
[2020/07/13 17:12:56.649 +08:00] [INFO] [loader.go:973] [“database already exists, skip it”] [task=so_attachment] [unit=load] [“db schema file”=./dumped_data.so_attachment/oms-schema-create.sql]
[2020/07/13 17:12:56.649 +08:00] [INFO] [loader.go:1117] [“finish to create schema”] [task=so_attachment] [unit=load] [“schema file”=./dumped_data.so_attachment/oms-schema-create.sql]
[2020/07/13 17:12:56.667 +08:00] [INFO] [loader.go:1139] [“start to create table”] [task=so_attachment] [unit=load] [“table file”=./dumped_data.so_attachment/oms.so_attachment-schema.sql]
[2020/07/13 17:12:58.211 +08:00] [WARN] [db.go:166] [“execute transaction”] [task=so_attachment] [unit=load] [query="[USE oms; CREATE TABLE so_attachment (#省略] [argument="[]"] [“cost time”=1.543604873s]

在17点13分,莫名奇妙又重新dump,这里看应该可以确定,应该是在入库的过程中,又重新dump,然后导致后面的主键冲突(这部分日志就不贴了)。这里有可能是我在dmctl上想执行query-status的时候,用向上的移动键选错了执行命令,选成start-task ……
[2020/07/13 17:13:39.986 +08:00] [INFO] [mydumper.go:284] [“create mydumper”] [task=so_attachment] [unit=dump] [argument="[–host,10.71.64.247,–port,3306,–user,bigdata_ro,–outputdir,./dumped_data.so_attachment,–logfile,/dev/stderr,–verbose,3,–threads,4,–chunk-filesize,64,–skip-tz-utc,-T,oms.so_attachment,-r,10000]"]
[2020/07/13 17:13:39.987 +08:00] [INFO] [syncer.go:887] [“flushed checkpoint”] [task=oms] [unit=“binlog replication”] [checkpoint="(mysql-bin|000001.000029, 160686568)(flushed (mysql-bin|000001.000029, 160686568))"]
[2020/07/13 17:13:39.987 +08:00] [INFO] [relay.go:113] [“current earliest active relay log”] [task=oms] [unit=“binlog replication”] [“active relay log”=433f9876-b90c-11ea-ade9-fac753c33500.000001/mysql-bin.000029]
[2020/07/13 17:13:40.022 +08:00] [INFO] [loader.go:432] [“all previous checkpoints cleared”] [task=so_attachment] [unit=load]
[2020/07/13 17:13:40.061 +08:00] [INFO] [checkpoint.go:485] [“create checkpoint schema”] [task=so_attachment] [unit=“binlog replication”] [component=“remote checkpoint”] [statement=“CREATE SCHEMA IF NOT EXISTS dm_meta“]
[2020/07/13 17:13:40.062 +08:00] [INFO] [checkpoint.go:504] [“create checkpoint table”] [task=so_attachment] [unit=“binlog replication”] [component=“remote checkpoint”] [statement=“CREATE TABLE IF NOT EXISTS dm_meta.so_attachment_syncer_checkpoint (\ \t\t\tid VARCHAR(32) NOT NULL,\ \t\t\tcp_schema VARCHAR(128) NOT NULL,\ \t\t\tcp_table VARCHAR(128) NOT NULL,\ \t\t\tbinlog_name VARCHAR(128),\ \t\t\tbinlog_pos INT UNSIGNED,\ \t\t\tis_global BOOLEAN,\ \t\t\tcreate_time timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP,\ \t\t\tupdate_time timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,\ \t\t\tUNIQUE KEY uk_id_schema_table (id, cp_schema, cp_table)\ \t\t)”]
[2020/07/13 17:13:40.065 +08:00] [INFO] [syncer.go:384] [“all previous meta cleared”] [task=so_attachment] [unit=“binlog replication”]
[2020/07/13 17:13:40.070 +08:00] [INFO] [relay.go:89] [“current earliest active relay log”] [task=so_attachment] [unit=“binlog replication”] [“active relay log”=433f9876-b90c-11ea-ade9-fac753c33500.000001/mysql-bin.000029]
[2020/07/13 17:13:40.082 +08:00] [INFO] [subtask.go:204] [“start to run”] [subtask=so_attachment] [unit=Dump]
[2020/07/13 17:13:40.082 +08:00] [INFO] [worker.go:146] [initialized] [component=“worker controller”]
[2020/07/13 17:13:40.083 +08:00] [INFO] [purger.go:115] [“starting relay log purger”] [component=“relay purger”] [config=”{“interval”:3600,“expires”:0,“remain-space”:15}”]
[2020/07/13 17:13:40.083 +08:00] [INFO] [worker.go:182] [“start running”] [component=“worker controller”]
[2020/07/13 17:13:40.083 +08:00] [INFO] [server.go:112] [“start gRPC API”] [“listened address”=:8262]

高亮部分显示 dm-meta 不存在,所以会重新执行 loader 操作,先忽略吧,因为有存在个人操作的问题,目前同步完整即可,load 阶段的数据冲突,主要来自于下游数据问题。

也完美解决我的问题,特意登录回复致谢!:grinning: