[FAQ] dm 同步报错之 waiting for relay binlog pos to catch up with loader end binlog pos is timeout

【问题澄清】

  • dm 出现报错,重启task后无法解决,删除 loader_checkpoint
    与syncer_checkpoint 表,库、dump.data后 重新启动task,恢复成功 但syncer还是有问题,报错依旧。从dm服务器上使用同步指定账户登录源Mysql库没问题。权限与文档中一致为:
    GRANT SELECT, RELOAD, REPLICATION SLAVE, REPLICATION CLIENT ON . TO ‘syncer’@’%';
  • 并且 relay log 中的日志并不是最新的,为早些日志的日志。

query-status task_vccx
{
    "result": true,
    "msg": "",
    "workers": [
        {
            "result": true,
            "worker": "10.10.24.84:8264",
            "msg": "",
            "subTaskStatus": [
                {
                    "name": "task_vccx",
                    "stage": "Running",
                    "unit": "Sync",
                    "result": {
                        "isCanceled": false,
                        "errors": [
                            {
                                "Type": "UnknownError",
                                "msg": "[code=40067:class=dm-worker:scope=internal:level=high] waiting for relay binlog pos to catch up with loader end binlog pos is timeout (exceeding 5m0s), loader end binlog pos: (mysql-bin.000020, 159702946), relay binlog pos: (mysql-bin.000002, 280333226)\ngithub.com/pingcap/dm/pkg/terror.(*Error).Generate\n\t/home/jenkins/agent/workspace/build_dm_master/go/src/github.com/pingcap/dm/pkg/terror/terror.go:232\ngithub.com/pingcap/dm/dm/worker.(*SubTask).unitTransWaitCondition\n\t/home/jenkins/agent/workspace/build_dm_master/go/src/github.com/pingcap/dm/dm/worker/subtask.go:684\ngithub.com/pingcap/dm/dm/worker.(*SubTask).run\n\t/home/jenkins/agent/workspace/build_dm_master/go/src/github.com/pingcap/dm/dm/worker/subtask.go:193\ngithub.com/pingcap/dm/dm/worker.(*SubTask).fetchResult\n\t/home/jenkins/agent/workspace/build_dm_master/go/src/github.com/pingcap/dm/dm/worker/subtask.go:280\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:1357",
                                "error": null
                            }
                        ],
                        "detail": null
                    },
                    "unresolvedDDLLockID": "",
                    "sync": {
                        "totalEvents": "0",
                        "totalTps": "0",
                        "recentTps": "0",
                        "masterBinlog": "(mysql-bin.000020, 180373475)",
                        "masterBinlogGtid": "55b4bea9-902e-11ea-85ca-0242ac120003:1-53472472",
                        "syncerBinlog": "(, 4)",
                        "syncerBinlogGtid": "",
                        "blockingDDLs": [
                        ],
                        "unresolvedGroups": [
                        ],
                        "synced": false
                    }
                }
            ],
            "relayStatus": {
                "masterBinlog": "(mysql-bin.000020, 180373475)",
                "masterBinlogGtid": "55b4bea9-902e-11ea-85ca-0242ac120003:1-53472472",
                "relaySubDir": "55b4bea9-902e-11ea-85ca-0242ac120003.000001",
                "relayBinlog": "(mysql-bin.000002, 280333226)",
                "relayBinlogGtid": "55b4bea9-902e-11ea-85ca-0242ac120003:1-3883867",
                "relayCatchUpMaster": false,
                "stage": "Paused",
                "result": {
                    "isCanceled": false,
                    "errors": [
                        {
                            "Type": "UnknownError",
                            "msg": "[code=30014:class=relay-unit:scope=upstream:level=high] start reader for UUID 55b4bea9-902e-11ea-85ca-0242ac120003.000001: start sync from GTID set 55b4bea9-902e-11ea-85ca-0242ac120003:1-3883867: dial tcp 10.139.60.82:3306: connect: connection refused\ngithub.com/pingcap/dm/pkg/terror.(*Error).Delegate\n\t/home/jenkins/agent/workspace/build_dm_master/go/src/github.com/pingcap/dm/pkg/terror/terror.go:267\ngithub.com/pingcap/dm/pkg/binlog/reader.(*TCPReader).StartSyncByGTID\n\t/home/jenkins/agent/workspace/build_dm_master/go/src/github.com/pingcap/dm/pkg/binlog/reader/tcp.go:102\ngithub.com/pingcap/dm/relay/reader.(*reader).setUpReaderByGTID\n\t/home/jenkins/agent/workspace/build_dm_master/go/src/github.com/pingcap/dm/relay/reader/reader.go:160\ngithub.com/pingcap/dm/relay/reader.(*reader).Start\n\t/home/jenkins/agent/workspace/build_dm_master/go/src/github.com/pingcap/dm/relay/reader/reader.go:109\ngithub.com/pingcap/dm/relay.(*Relay).setUpReader\n\t/home/jenkins/agent/workspace/build_dm_master/go/src/github.com/pingcap/dm/relay/relay.go:610\ngithub.com/pingcap/dm/relay.(*Relay).process\n\t/home/jenkins/agent/workspace/build_dm_master/go/src/github.com/pingcap/dm/relay/relay.go:304\ngithub.com/pingcap/dm/relay.(*Relay).Process\n\t/home/jenkins/agent/workspace/build_dm_master/go/src/github.com/pingcap/dm/relay/relay.go:191\ngithub.com/pingcap/dm/dm/worker.(*realRelayHolder).run\n\t/home/jenkins/agent/workspace/build_dm_master/go/src/github.com/pingcap/dm/dm/worker/relay.go:167\ngithub.com/pingcap/dm/dm/worker.(*realRelayHolder).Start.func1\n\t/home/jenkins/agent/workspace/build_dm_master/go/src/github.com/pingcap/dm/dm/worker/relay.go:143\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:1357",
                            "error": null
                        }
                    ],
                    "detail": null
                }
            },
            "sourceID": "mysql-replica-82-vccx_test"
        }
    ]
}

【排查思路】

  • 此报错在 msg 中一般会被定义为 internal,通过报错信息,可以判断时 relay log 这部分的问题

【解决方案】

  • stop.yml -l dm-worker02 - 避免 relay log 被重写
  • 删除 dm-worker02 对应的 relay log
  • start.yml -l dm-worker02 来恢复

【经典案例】