解决方案之:DM relay 处理单元报错handle a potential duplicate event或同步任务中断并包含 get binlog error ERROR 1236 (HY000)、binlog checksum mismatch, data may be corrupted 等 binlog 获取或解析失败错误

遇到DM的问题:包含了relay+syncer。当遇到mysql binlog超过4g的问题时,你的relay+syncer都需要处理,这个问题来源于上游,并且阻塞了relay同步,你需要先解决relay同步,然后再解决syncer的问题

问题现象:

(1)收到邮件报警

登录查看DM报错情况

query-status mba_app_repl_online1

发现syncer报错

{

“Type”: “UnknownError”,

“msg”: “[code=10001:class=database:scope=downstream:level=high] database driver error: bad connection\ngithub.com/pingcap/dm/pkg/terror.(*Error).Delegate\n\t/home/jenkins/agent/workspace/build_dm_master/go/src/github.com/pingcap/dm/pkg/terror/terror.go:267\ngithub.com/pingcap/dm/pkg/terror.DBErrorAdaptArgs\n\t/home/jenkins/agent/workspace/build_dm_master/go/src/github.com/pingcap/dm/pkg/terror/adapter.go:39\ngithub.com/pingcap/dm/pkg/terror.DBErrorAdapt\n\t/home/jenkins/agent/workspace/build_dm_master/go/src/github.com/pingcap/dm/pkg/terror/adapter.go:46\ngithub.com/pingcap/dm/pkg/conn.(*BaseDB).GetBaseConn\n\t/home/jenkins/agent/workspace/build_dm_master/go/src/github.com/pingcap/dm/pkg/conn/basedb.go:92\ngithub.com/pingcap/dm/syncer.createConns.func1\n\t/home/jenkins/agent/workspace/build_dm_master/go/src/github.com/pingcap/dm/syncer/db.go:299\ngithub.com/pingcap/dm/syncer.(*DBConn).resetConn\n\t/home/jenkins/agent/workspace/build_dm_master/go/src/github.com/pingcap/dm/syncer/db.go:157\ngithub.com/pingcap/dm/syncer.(*Syncer).resetDBs\n\t/home/jenkins/agent/workspace/build_dm_master/go/src/github.com/pingcap/dm/syncer/syncer.go:513\ngithub.com/pingcap/dm/syncer.(*Syncer).Resume\n\t/home/jenkins/agent/workspace/build_dm_master/go/src/github.com/pingcap/dm/syncer/syncer.go:2277\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:1337”,

“error”: null

binlog同步报错: handle a potential duplicate event

{

“Type”: “UnknownError”,

“msg”: “[code=30029:class=relay-unit:scope=internal:level=high] handle a potential duplicate event \u0026{Timestamp:1583906704 EventType:DeleteRowsEventV2 ServerID:2084430 EventSize:8132 LogPos:11943 Flags:0} in 4723-binlog.002270: check event \u0026{Timestamp:1583906704 EventType:DeleteRowsEventV2 ServerID:2084430 EventSize:8132 LogPos:11943 Flags:0} whether duplicate in /data/dm/relay_log/3793c524-c319-11e8-95ba-6c92bf2f9975.000001/4723-binlog.002270: event from 3811 in /data/dm/relay_log/3793c524-c319-11e8-95ba-6c92bf2f9975.000001/4723-binlog.002270 diff from passed-in event \u0026{Timestamp:1583906704 EventType:DeleteRowsEventV2 ServerID:2084430 EventSize:8132 LogPos:11943 Flags:0}\ngithub.com/pingcap/dm/pkg/terror.(*Error).Generatef\n\t/home/jenkins/agent/workspace/build_dm_master/go/src/github.com/pingcap/dm/pkg/terror/terror.go:237\ngithub.com/pingcap/dm/relay/writer.checkIsDuplicateEvent\n\t/home/jenkins/agent/workspace/build_dm_master/go/src/github.com/pingcap/dm/relay/writer/file_util.go:157\ngithub.com/pingcap/dm/relay/writer.(*FileWriter).handleDuplicateEventsExist\n\t/home/jenkins/agent/workspace/build_dm_master/go/src/github.com/pingcap/dm/relay/writer/file.go:369\ngithub.com/pingcap/dm/relay/writer.(*FileWriter).handlePotentialHoleOrDuplicate\n\t/home/jenkins/agent/workspace/build_dm_master/go/src/github.com/pingcap/dm/relay/writer/file.go:312\ngithub.com/pingcap/dm/relay/writer.(*FileWriter).handleEventDefault\n\t/home/jenkins/agent/workspace/build_dm_master/go/src/github.com/pingcap/dm/relay/writer/file.go:287\ngithub.com/pingcap/dm/relay/writer.(*FileWriter).WriteEvent\n\t/home/jenkins/agent/workspace/build_dm_master/go/src/github.com/pingcap/dm/relay/writer/file.go:130\ngithub.com/pingcap/dm/relay.(*Relay).handleEvents\n\t/home/jenkins/agent/workspace/build_dm_master/go/src/github.com/pingcap/dm/relay/relay.go:434\ngithub.com/pingcap/dm/relay.(*Relay).process\n\t/home/jenkins/agent/workspace/build_dm_master/go/src/github.com/pingcap/dm/relay/relay.go:292\ngithub.com/pingcap/dm/relay.(*Relay).Process\n\t/home/jenkins/agent/workspace/build_dm_master/go/src/github.com/pingcap/dm/relay/relay.go:191\ngithub.com/pingcap/dm/dm/worker.(*realRelayHolder).run\n\t/home/jenkins/agent/workspace/build_dm_master/go/src/github.com/pingcap/dm/dm/worker/relay.go:164\ngithub.com/pingcap/dm/dm/worker.(*realRelayHolder).Start.func1\n\t/home/jenkins/agent/workspace/build_dm_master/go/src/github.com/pingcap/dm/dm/worker/relay.go:140\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:1337”,

(2)排查问题发现上游的mysql有大binlog文件,并且大小超过4G,查看DM的relay-log目录,DM同步的binlog最大也停止在4G,最新的binlog已经不再拉取,SQL执行停止,拉取也停止。其实上游mysql的binlog有6G多

通过报错的4723-binlog.002270文件找下上游mysql的大小

show binary logs;

| 4723-binlog.002268 | 1074943453 |

| 4723-binlog.002269 | 1074017204 |

| 4723-binlog.002270 | 6977972415 |

| 4723-binlog.002271 | 1081537286 |

| 4723-binlog.002272 | 1073806137 |

(3)解决问题:

找到官方的解决方案:

https://pingcap.com/docs-cn/stable/reference/tools/data-migration/troubleshoot/error-handling/

首先解决relay日志同步的问题

在 DM 进行 relay log 拉取与增量同步过程中,如果遇到了上游超过 4GB 的 binlog 文件,就可能出现这两个错误。

原因是 DM 在写 relay log 时需要依据 binlog position 及文件大小对 event 进行验证,且需要保存同步的 binlog position 信息作为 checkpoint。但是 MySQL binlog position 官方定义使用 uint32 存储,所以超过 4G 部分的 binlog position 的 offset 值会溢出,进而出现上面的错误。

对于 relay 处理单元,可通过以下步骤手动恢复:

1、在上游确认出错时对应的 binlog 文件的大小超出了 4GB。

2、停止 DM-worker。

ansible-playbook stop.yml --tags=dm-worker -l dm-worker1

注意:这里尤其要注意不要认为stop-task就是停了worker,需要将woker停了,在系统中ps -ef|grep 'dm-worker’没有了才算

3、将上游对应的 binlog 文件复制到 relay log 目录作为 relay log 文件 (替掉源文件时记得备份)

4、更新 relay log 目录内对应的 relay.meta 文件以从下一个 binlog 开始拉取。 (修改前记得备份)

例如:报错时有 binlog-name = “mysql-bin.004451” 与binlog-pos = 2453,则将其分别更新为 binlog-name = “mysql-bin.004452” 与binlog-pos = 4。

注意:如果是按照gtid配置的DM,除了修改binlog-name以及binlog-pos外,还需要修改binlog-gtid,上游mysql binlog-gtid获取方式:

MySQL [(none)]> show binlog events in ‘4723-binlog.002271’ from 4 limit 4;

±-------------------±----±---------------±----------±------------±--------------------------------------------------------------------------+

| Log_name | Pos | Event_type | Server_id | End_log_pos | Info |

±-------------------±----±---------------±----------±------------±--------------------------------------------------------------------------+

| 4723-binlog.002271 | 4 | Format_desc | 20849203 | 120 | Server ver: 5.6.36-log, Binlog ver: 4 |

| 4723-binlog.002271 | 120 | Previous_gtids | 20849203 | 191 | f0e914ef-54cf-11e7-813d-6c92bf2fa791:1-138218058 |

| 4723-binlog.002271 | 191 | Gtid | 2084430 | 239 | SET @@SESSION.GTID_NEXT= ‘f0e914ef-54cf-11e7-813d-6c92bf2fa791:138218059’ |

| 4723-binlog.002271 | 239 | Query | 2084430 | 302 | BEGIN |

±-------------------±----±---------------±----------±------------±--------------------------------------------------------------------------+

4 rows in set (0.03 sec)

5、重启 DM-worker。

ansible-playbook start.yml --tags=dm-worker -l dm-worker1

(4)解决了binlog同步问题,但是还是会遇到syncer执行不成功。

对于 binlog replication 处理单元,可通过以下步骤手动恢复:

1、在上游确认出错时对应的 binlog 文件的大小超出了 4GB。

2、通过 stop-task 停止同步任务。

3、将下游 dm_meta 数据库中 global checkpoint 与每个 table 的 checkpoint 中的 binlog_name 更新为出错的 binlog 文件,将 binlog_pos 更新为已同步过的一个合法的 position 值,比如 4。

例如:出错任务名为 dm_test,对应的 source-id 为 replica-1,出错时对应的 binlog 文件为 mysql-bin|000001.004451,则执行 UPDATE dm_test_syncer_checkpoint SET binlog_name=‘mysql-bin|000001.004451’, binlog_pos = 4 WHERE id=‘replica-1’;。

注意:具体的值可以登录dm_meta库select相应的syncer表即可

update mba_app_repl_online1_syncer_checkpoint set binlog_name=‘4723-binlog|000001.002270’,binlog_pos=978592437;

4、在同步任务配置中为 syncers 部分设置 safe-mode: true 以保证可重入执行。

5、通过 start-task 启动同步任务。

通过 query-status 观察同步任务状态,当原造成出错的 relay log 文件同步完成后,即可还原 safe-mode 为原始值并重启同步任务。

1赞