DM 报错 [code=30014:class=relay-unit:scope=upstream:level=high] start reader for UUID

为提高效率,提问时请尽量提供详细背景信息,问题描述清晰可优先响应。以下信息点请尽量提供:

  • 系统版本 & kernel 版本:CentOS Linux release 7.6.1810 (Core) & Linux mt-18-34 3.10.0-957.27.2.el7.x86_64
  • TiDB 版本:5.7.25-TiDB-v3.0.3
  • 磁盘型号:阿里云ecs-云ssd
  • 集群节点分布:3个节点<pd+tikv> 和 1个tidb 节点
  • 数据量 & region 数量 & 副本数:测试环境很少
  • 集群 QPS、.999-Duration、读写比例:测试环境很少
  • 问题描述(我做了什么): 查阅官网相关文档,定位问题是已发布的错误码为 ErrRelayTCPReaderStartSyncGTID,[code=30014:class=relay-unit:scope=upstream:level=high],“start sync from GTID set %s”

(https://github.com/pingcap/dm/blob/master/_utils/terror_gen/errors_release.txt)

有同学遇到同类问题吗 ? 该如何处理呢?

尝试重启dm-master和dm-work,没有效果 。 求指导

dm版本为: v1.0.1

错误详情为:

» query-status { “result”: true, “msg”: “”, “workers”: [ { “result”: true, “worker”: “192.168.18.31:8262”, “msg”: “”, “subTaskStatus”: [ { “name”: “test_to_tidb”, “stage”: “Running”, “unit”: “Sync”, “result”: null, “unresolvedDDLLockID”: “”, “sync”: { “totalEvents”: “41583861”, “totalTps”: “20088”, “recentTps”: “0”, “masterBinlog”: “(bin_log.000691, 6043578)”, “masterBinlogGtid”: “417c9ef6-f039-11e7-b6a5-00163e04da89:1-14899,75ae8004-3f21-11e9-be85-00163e042ba9:1-111937783”, “syncerBinlog”: “(bin_log|000001.000686, 26224351)”, “syncerBinlogGtid”: “”, “blockingDDLs”: [ ], “unresolvedGroups”: [ ], “synced”: false } } ], “relayStatus”: { “masterBinlog”: “(bin_log.000691, 6043578)”, “masterBinlogGtid”: “417c9ef6-f039-11e7-b6a5-00163e04da89:1-14899,75ae8004-3f21-11e9-be85-00163e042ba9:1-111937783”, “relaySubDir”: “75ae8004-3f21-11e9-be85-00163e042ba9.000001”, “relayBinlog”: “(bin_log.000686, 26224351)”, “relayBinlogGtid”: “75ae8004-3f21-11e9-be85-00163e042ba9:1-111777756,417c9ef6-f039-11e7-b6a5-00163e04da89:1-14899”, “relayCatchUpMaster”: false,

▽ “stage”: “Paused”, “result”: { “isCanceled”: false, “errors”: [ { “Type”: “UnknownError”, “msg”: “[code=30014:class=relay-unit:scope=upstream:level=high] start reader for UUID 75ae8004-3f21-11e9-be85-00163e042ba9.000001: start sync from GTID set 75ae8004-3f21-11e9-be85-00163e042ba9:1-111777756,417c9ef6-f039-11e7-b6a5-00163e04da89:1-14899: dial tcp 192.168.18.30:3306: connect: connection refusedngithub.com/pingcap/dm/pkg/terror.(*Error).Delegate /home/jenkins/workspace/build_dm_master/go/src/github.com/pingcap/dm/pkg/terror/terror.go:267ngithub.com/pingcap/dm/pkg/binlog/reader.(*TCPReader).StartSyncByGTID /home/jenkins/workspace/build_dm_master/go/src/github.com/pingcap/dm/pkg/binlog/reader/tcp.go:102ngithub.com/pingcap/dm/relay/reader.(*reader).setUpReaderByGTID /home/jenkins/workspace/build_dm_master/go/src/github.com/pingcap/dm/relay/reader/reader.go:160ngithub.com/pingcap/dm/relay/reader.(*reader).Start /home/jenkins/workspace/build_dm_master/go/src/github.com/pingcap/dm/relay/reader/reader.go:109ngithub.com/pingcap/dm/relay.(*Relay).setUpReader /home/jenkins/workspace/build_dm_master/go/src/github.com/pingcap/dm/relay/relay.go:586ngithub.com/pingcap/dm/relay.(*Relay).process /home/jenkins/workspace/build_dm_master/go/src/github.com/pingcap/dm/relay/relay.go:304ngithub.com/pingcap/dm/relay.(*Relay).Process /home/jenkins/workspace/build_dm_master/go/src/github.com/pingcap/dm/relay/relay.go:191ngithub.com/pingcap/dm/dm/worker.(*realRelayHolder).run /home/jenkins/workspace/build_dm_master/go/src/github.com/pingcap/dm/dm/worker/relay.go:165ngithub.com/pingcap/dm/dm/worker.(*realRelayHolder).resumeRelay.func1 /home/jenkins/workspace/build_dm_master/go/src/github.com/pingcap/dm/dm/worker/relay.go:257 runtime.goexit /usr/local/go/src/runtime/asm_amd64.s:1337” } ], “detail”: null } }, “sourceID”: “mysql-replica-01” } ] }

  • 看起来是是连接数据库被拒绝了,我们的数据库信息是否正确?
  • 或者是否有防火墙,master 和 worker 都需要能连接上游数据库。

数据库账号和对应权限都有的,这个集群已经跑了快2个月了。 目前没找到什么好的解决办法。 没修改任务配置的情况下,关闭dm,然后关闭tidb,之后开启tidb和dm ,重新dump/load 数据,集群重建,然后恢复了。

在官网文档里找到该问题的错误码为: ErrRelayTCPReaderStartSyncGTID,[code=30014:class=relay-unit:scope=upstream:level=high],“start sync from GTID set %s” 。。。 但是我没找到对应的解决方案,请问目前已定义的错误码都有对应的解决方案吗,该在哪里找呢 ? 谢谢

好吧。

https://pingcap.com/docs-cn/stable/reference/tools/data-migration/troubleshoot/error-handling/

多谢 :slight_smile:

那看来目前解决办法暂时是:
如果 DM 由于版本问题等未自动进行重试或自动重试未能成功,则可尝试先使用 stop-task 停止任务,然后再使用 start-task 重启任务。

此话题已在最后回复的 1 分钟后被自动关闭。不再允许新回复。