DM拉取binlog时候,会自动填充一些内容,导致任务一直hang住

版本:6.1.1
现象:
如下图所示,任务一直在报runtime error: invalid memory address or nil pointer dereference,这个错误会自动重试,不过一直都报同样的问题

原因:
看dm-worker的日志文件,能看到有报错,虽然任务已经同步到"syncerBinlog": "(mysql-bin|000001.001224, 905172217)",但是在解析binlog文件时候,由于DM不能识别自己填充了一些内容,直接报错了。

[2022/11/03 16:06:15.948 +08:00] [WARN] [syncer.go:2776] ["found error when get sql_mode from binlog status_vars"] [task=fund_task] [unit="binlog replication"] [error="Q_SQL_MODE_CODE not found in status_vars []"]
[2022/11/03 16:06:15.949 +08:00] [INFO] [ddl.go:37] ["parse ddl"] [task=fund_task] [unit="binlog replication"] [event=query] ["query event context"="{schema: , originSQL: # dummy query generated by DM, often used to fill a hole in a binlog file, startLocation: position: (mysql-bin|000001.001224, 394), gtid-set: 1e787e9a-ea02-11eb-8239-00163e0eb750:1-136252793,1fd7fe20-ea02-11eb-9db6-00163e02d592:1-20853306,3e773ad8-46c1-11ec-992b-0242a9fe7b02:1-139,777234c5-87b7-11ea-9279-00163e1054e6:1-1494434753,784b7fc9-87b7-11ea-a0b7-00163e068fb5:1-1724587847,7e04612e-1c6a-11ed-ac0c-52540048a565:1, currentLocation: position: (mysql-bin|000001.001224, 905172217), gtid-set: , lastLocation: position: (mysql-bin|000001.001224, 905172217), gtid-set: 1e787e9a-ea02-11eb-8239-00163e0eb750:1-136252793,1fd7fe20-ea02-11eb-9db6-00163e02d592:1-20853306,3e773ad8-46c1-11ec-992b-0242a9fe7b02:1-139,777234c5-87b7-11ea-9279-00163e1054e6:1-1494434753,784b7fc9-87b7-11ea-a0b7-00163e068fb5:1-1724587847,7e04612e-1c6a-11ed-ac0c-52540048a565:1, re-sync: , needHandleDDLs: , trackInfos: }"]
[2022/11/03 16:06:15.951 +08:00] [ERROR] [syncer.go:1769] ["panic log"] [task=fund_task] [unit="binlog replication"] ["error message"="\"invalid memory address or nil pointer dereference\""] [stack="github.com/pingcap/tiflow/dm/syncer.(*Syncer).Run.func5\n\tgithub.com/pingcap/tiflow/dm/syncer/syncer.go:1769\nruntime.gopanic\n\truntime/panic.go:844\nruntime.panicmem\n\truntime/panic.go:220\nruntime.sigpanic\n\truntime/signal_unix.go:818\ngithub.com/pingcap/tiflow/dm/syncer.(*Syncer).Run.func8\n\tgithub.com/pingcap/tiflow/dm/syncer/syncer.go:1900\ngithub.com/pingcap/tiflow/dm/syncer.(*Syncer).Run\n\tgithub.com/pingcap/tiflow/dm/syncer/syncer.go:2338\ngithub.com/pingcap/tiflow/dm/syncer.(*Syncer).Process\n\tgithub.com/pingcap/tiflow/dm/syncer/syncer.go:702\ngithub.com/pingcap/tiflow/dm/syncer.(*Syncer).Resume\n\tgithub.com/pingcap/tiflow/dm/syncer/syncer.go:3742"]
[2022/11/03 16:06:15.951 +08:00] [INFO] [syncer.go:3619] ["flush all jobs"] [task=fund_task] [unit="binlog replication"] ["global checkpoint"="{{{mysql-bin|000001.001224 905172217} 0xc05764a908 0} <nil>}(flushed {{{mysql-bin|000001.001224 905172217} 0xc05764a908 0} <nil>})"] ["flush job seq"=1]
[2022/11/03 16:06:15.952 +08:00] [INFO] [syncer.go:1531] ["received ungraceful exit ctx, exit now"] [task=fund_task] [unit="binlog replication"]
[2022/11/03 16:06:15.953 +08:00] [INFO] [subtask.go:356] ["unit process returned"] [subtask=fund_task] [unit=Sync] [stage=Paused] [status="{\"totalEvents\":6627,\"totalTps\":1656,\"syncerBinlog\":\"(mysql-bin|000001.001224, 905172217)\",\"syncerBinlogGtid\":\"1e787e9a-ea02-11eb-8239-00163e0eb750:1-136252793,1fd7fe20-ea02-11eb-9db6-00163e02d592:1-20853306,3e773ad8-46c1-11ec-992b-0242a9fe7b02:1-139,777234c5-87b7-11ea-9279-00163e1054e6:1-1494434753,784b7fc9-87b7-11ea-a0b7-00163e068fb5:1-1724587847,7e04612e-1c6a-11ed-ac0c-52540048a565:1\",\"binlogType\":\"local\"}"]
[2022/11/03 16:06:15.953 +08:00] [INFO] [local_reader.go:782] ["binlog reader closing"] [task=fund_task] [unit="binlog replication"] [component="binlog reader"]
[2022/11/03 16:06:15.953 +08:00] [WARN] [local_reader.go:288] ["parse relay finished"] [task=fund_task] [unit="binlog replication"] [component="binlog reader"] [error="parse relay log file mysql-bin.001224 from offset 4 in dir relay_log/dm-172.30.241.143-18262/7e04612e-1c6a-11ed-ac0c-52540048a565.000001: context canceled"]
[2022/11/03 16:06:15.953 +08:00] [INFO] [local_reader.go:788] ["binlog reader closed"] [task=fund_task] [unit="binlog replication"] [component="binlog reader"]
[2022/11/03 16:06:16.031 +08:00] [ERROR] [subtask.go:377] ["unit process error"] [subtask=fund_task] [unit=Sync] ["error information"="ErrCode:36001 ErrClass:\"sync-unit\" ErrScope:\"internal\" ErrLevel:\"high\" Message:\"panic error: runtime error: invalid memory address or nil pointer dereference\" "]
[2022/11/03 16:06:16.031 +08:00] [INFO] [subtask.go:379] [paused] [subtask=fund_task] [unit=Sync]

应该是和binlog文件格式有关系,解析binlog文件,确实能发现DM填充了一些0到binlog文件中。

可以上传一下这个 worker 的完整日志吗?如果有多个日志文件,需要包含 “hole exist from pos1 to pos2” 的那个

从备份里找到MySQL里原始的binlog来替换DM填充过的binlog文件,先解决了这个问题。

整个日志文件太大了,我截取了出现“hole exist from pos1 to pos2”报错附近的日志,你看下足够么
masked.log (429.1 KB)

另外,DM存在和TiDB类似的可以自动清理log的参数么? 现在都是每隔一段时间删一下日志文件。。:thinking:

应该足够了,感谢反馈,我们分析一下

您这边在早上 8 点 56 分是手动开关过 relay 吗

这个暂时没有。日志大概生成速度是怎样的,很消耗磁盘空间吗

是的,因为偶尔会有task识别不到relay-log的情况,需要经常resume-task来恢复,就按照顺序执行了stop-task、stop-relay、start-relay、start-task

1 Like

日志级别是info,但是commit被识别成了DDL,日志里80%上都是这种日志:

[2022/10/31 23:56:30.713 +08:00] [INFO] [ddl.go:37] ["parse ddl"] [task=fund_task] [unit="binlog replication"] [event=query] ["query event context"="{schema: , originSQL: COMMIT, startLocation: position: (mysql-bin|000001.001212, 235036967), gtid-set: 1e787e9a-ea02-11eb-8239-00163e0eb750:1-136252793,1fd7fe20-ea02-11eb-9db6-00163e02d592:1-20525508,3e773ad8-46c1-11ec-992b-0242a9fe7b02:1-139,777234c5-87b7-11ea-9279-00163e1054e6:1-1489357604,784b7fc9-87b7-11ea-a0b7-00163e068fb5:1-1724587847,7e04612e-1c6a-11ed-ac0c-52540048a565:1, currentLocation: position: (mysql-bin|000001.001212, 235037031), gtid-set: 1e787e9a-ea02-11eb-8239-00163e0eb750:1-136252793,1fd7fe20-ea02-11eb-9db6-00163e02d592:1-20525508,3e773ad8-46c1-11ec-992b-0242a9fe7b02:1-139,777234c5-87b7-11ea-9279-00163e1054e6:1-1489357604,784b7fc9-87b7-11ea-a0b7-00163e068fb5:1-1724587847,7e04612e-1c6a-11ed-ac0c-52540048a565:1, lastLocation: position: (mysql-bin|000001.001212, 235036967), gtid-set: 1e787e9a-ea02-11eb-8239-00163e0eb750:1-136252793,1fd7fe20-ea02-11eb-9db6-00163e02d592:1-20525508,3e773ad8-46c1-11ec-992b-0242a9fe7b02:1-139,777234c5-87b7-11ea-9279-00163e1054e6:1-1489357604,784b7fc9-87b7-11ea-a0b7-00163e068fb5:1-1724587847,7e04612e-1c6a-11ed-ac0c-52540048a565:1, re-sync: , needHandleDDLs: , trackInfos: }"]

每天都有十多G日志

您这个上游是什么数据库,是某个 RDS 吗?以及存储引擎是 InnoDB 吗?

通常情况下 COMMIT 不会以 DDL 的形式出现在 binlog 中,这个我们可以根据您的场景优化一下

DM是从社区版的MySQL拉取binlog的,但是数据写入源头是RDS,自建的MySQL是RDS的一个从库

奇怪,您这个社区版的 MySQL 是什么版本的?有没有表的存储引擎使用了 MyISAM?

是5.7版本,没有MyISAM的表。

1 Like

类似的告警还有

401662:[2022/11/04 10:35:58.448 +08:00] [WARN] [checkpoint_flush_worker.go:112] ["sync flush post-process(afterFlushFn) failed"] [task=qieman_task] [unit="binlog replication"] [error="[code=28005:class=relay-event-lib:scope=internal:level=high], Message: there aren't any data under relay log directory relay-dir., Workaround: Please check relay log using query-status."] [errorVerbose="[code=28005:class=relay-event-lib:scope=internal:level=high], Message: there aren't any data under relay log directory relay-dir., Workaround: Please check relay log using query-status.\ngithub.com/pingcap/tiflow/dm/pkg/terror.(*Error).Generate\n\tgithub.com/pingcap/tiflow/dm/pkg/terror/terror.go:271\ngithub.com/pingcap/tiflow/dm/syncer.(*Syncer).updateActiveRelayLog\n\tgithub.com/pingcap/tiflow/dm/syncer/relay.go:116\ngithub.com/pingcap/tiflow/dm/syncer.(*Syncer).afterFlushCheckpoint\n\tgithub.com/pingcap/tiflow/dm/syncer/syncer.go:1260\ngithub.com/pingcap/tiflow/dm/syncer.(*checkpointFlushWorker).Run\n\tgithub.com/pingcap/tiflow/dm/syncer/checkpoint_flush_worker.go:111\ngithub.com/pingcap/tiflow/dm/syncer.(*Syncer).Run.func3\n\tgithub.com/pingcap/tiflow/dm/syncer/syncer.go:1655\nruntime.goexit\n\truntime/asm_amd64.s:1571"]

会导致任务处于paused状态,这种状态出现不规律,一般来说手动执行一下resume-task就能解决。出现这个报错时候,task状态为:

但是看复制延时的话,又是基本没有延时的,所以怀疑是:偶尔task获取到了最新位置,但是relay-log里binlog还没能读取到。


或者跟6.1的这个relay-log拉取负责方调整有关系?

这种问题都需要日志才能排查

拉取mysql的binlog,如何知晓要从哪个binlog获取呢?人工解析binlog日志吗?这个工作量不小哦。

另外这个是6.1版本才有的bug?在5.4版本有问题吗?

DM的任务错误日志里,有报错是哪个binlog出问题了,这个binlog名字和MySQL上binlog名字是一样的,所以直接到上游MySQL中,把同名的binlog传输到dm-worker对应的relay-log目录就可以了