DM同步过程中遇到奇怪的错误SQL,且手动处理后任务无法继续

为提高效率,提问时请提供以下信息,问题描述清晰可优先响应。

  • 【TiDB 版本】:4.0.8 w/ DM 2.0.0
  • 【问题描述】:

当前我们正在线下环境测试用DM从阿里云RDS MySQL实时导入数据(共50多张表,原库表直接迁移,没有分库分表情况)。任务正常运行一段时间之后出错停止了,但未及时处理,出错2个小时之后,用dmctl query-status命令查看:

{
    "result": true,
    "msg": "",
    "sources": [
        {
            "result": true,
            "msg": "",
            "sourceStatus": {
                "source": "dm_mysql_src_test",
                "worker": "dm-10.0.46.191-8262",
                "result": null,
                "relayStatus": null
            },
            "subTaskStatus": [
                {
                    "name": "dm_mysql_task_test",
                    "stage": "Paused",
                    "unit": "Sync",
                    "result": {
                        "isCanceled": false,
                        "errors": [
                            {
                                "ErrCode": 10006,
                                "ErrClass": "database",
                                "ErrScope": "not-set",
                                "ErrLevel": "high",
                                "Message": "startLocation: [position: (mysql-bin.000011, 214170385), gtid-set: 4e104f70-2cdb-11eb-9206-ec0d9acf00fe:1-1688134,d2c18b61-da9f-11e9-b478-7cd30a518102:1-828785122], endLocation: [position: (mysql-bin.000011, 214170532), gtid-set: 4e104f70-2cdb-11eb-9206-ec0d9acf00fe:1-1688134,d2c18b61-da9f-11e9-b478-7cd30a518102:1-828785122]: execute statement failed: commit",
                                "RawCause": "Error 1062: Duplicate entry 'oujYRxHPp_kmBMNCmu' for key 'openid'",
                                "Workaround": ""
                            }
                        ],
                        "detail": null
                    },
                    "unresolvedDDLLockID": "",
                    "sync": {
                        "totalEvents": "1225708",
                        "totalTps": "69",
                        "recentTps": "48",
                        "masterBinlog": "(mysql-bin.000011, 421297478)",
                        "masterBinlogGtid": "d2c18b61-da9f-11e9-b478-7cd30a518102:1-828785122,4e104f70-2cdb-11eb-9206-ec0d9acf00fe:1-1951208",
                        "syncerBinlog": "(mysql-bin.000011, 213984816)",
                        "syncerBinlogGtid": "d2c18b61-da9f-11e9-b478-7cd30a518102:1-828785122,4e104f70-2cdb-11eb-9206-ec0d9acf00fe:1-1687811",
                        "blockingDDLs": [
                        ],
                        "unresolvedGroups": [
                        ],
                        "synced": false,
                        "binlogType": "remote"
                    }
                }
            ]
        }
    ]
}

在执行该任务的dm-worker.log中发现:

[2020/11/23 20:03:02.701 +08:00] [ERROR] [db.go:273] ["execute statements failed after retry"] [task=dm_mysql_task_test] [unit="binlog replication"] [queries="[INSERT INTO `test_agb`.`userwacc` (`userid`,`unionid`,`openid`,`avatar`,`nickname`,`gender`,`pinyin`,`subscribed`,`appopenid`,`weappopenid`,`publicopenid`,`joinweappopenid`,`rbtweappopenid`,`third_id`) VALUES (?,?,?,?,?,?,?,?,?,?,?,?,?,?)]"] [arguments="[[110034603 oRo5hu5-mkmg oujYRxHPp_kmBMNCmu  测试1 0  0 oujYRxHPp_kmBMNCmu <nil> <nil> <nil> <nil> ]]"] [error="[code=10006:class=database:scope=not-set:level=high], Message: execute statement failed: commit, RawCause: Error 1062: Duplicate entry 'oujYRxHPp_kmBMNCmu' for key 'openid'"]
[2020/11/23 20:03:02.712 +08:00] [ERROR] [subtask.go:300] ["unit process error"] [subtask=dm_mysql_task_test] [unit=Sync] ["error information"="{\"ErrCode\":10006,\"ErrClass\":\"database\",\"ErrScope\":\"not-set\",\"ErrLevel\":\"high\",\"Message\":\"startLocation: [position: (mysql-bin.000011, 214170385), gtid-set: 4e104f70-2cdb-11eb-9206-ec0d9acf00fe:1-1688134,d2c18b61-da9f-11e9-b478-7cd30a518102:1-828785122], endLocation: [position: (mysql-bin.000011, 214170532), gtid-set: 4e104f70-2cdb-11eb-9206-ec0d9acf00fe:1-1688134,d2c18b61-da9f-11e9-b478-7cd30a518102:1-828785122]: execute statement failed: commit\",\"RawCause\":\"Error 1062: Duplicate entry 'oujYRxHPp_kmBMNCmu' for key 'openid'\"}"]

该表的openid一列具有UNIQUE KEY约束,该INSERT语句插入了相同的openid,故无法同步。但是RDS MySQL库竟然执行成功了(i.e. 与TiDB中相比,该条数据的PRIMARY KEY列userid已经变成了110034603),按道理这是完全不应该发生的,怀疑是RDS本身的问题。

接下来试图用mysqlbinlog工具查看源库对应位点的binlog,结果无法解析,输出Row event for unknown table

/*!50530 SET @@SESSION.PSEUDO_SLAVE_MODE=1*/;
/*!40019 SET @@session.max_insert_delayed_threads=0*/;
/*!50003 SET @OLD_COMPLETION_TYPE=@@COMPLETION_TYPE,COMPLETION_TYPE=0*/;
DELIMITER /*!*/;
# at 214170385
#201123 17:31:45 server id 1819313190  end_log_pos 0 CRC32 0x76a5d501 	Start: binlog v 4, server v 5.6.16-log created 201123 17:31:45
# at 214170385
#201123 20:03:02 server id 1819313190  end_log_pos 214170532 CRC32 0xaad43c0f 	Write_rows: table id 260 flags: STMT_END_F
### Row event for unknown table #260DELIMITER ;
# End of log file
ROLLBACK /* added by mysqlbinlog */;
/*!50003 SET COMPLETION_TYPE=@OLD_COMPLETION_TYPE*/;
/*!50530 SET @@SESSION.PSEUDO_SLAVE_MODE=0*/;

于是我们决定暂时不去查找原因,而是直接在TiDB中执行了DELETE+INSERT语句,手动处理了出错的那一行。然后手动修改任务配置文件,在mysql-instances一节加入meta,指定出错语句之后的消费位点:

mysql-instances:
  - source-id: "dm_mysql_src_test"
    meta:
      binlog-name: "mysql-bin.000011"
      binlog-pos: 214170532
      binlog-gtid: "4e104f70-2cdb-11eb-9206-ec0d9acf00fe:1-1688134,d2c18b61-da9f-11e9-b478-7cd30a518102:1-828785122"
    ......

执行dmctl start-task命令,很久没有反应,然后输出:

{
    "result": true,
    "msg": "",
    "sources": [
        {
            "result": false,
            "msg": "[code=38032:class=dm-master:scope=internal:level=high], Message: some error occurs in dm-worker: ErrCode:10001 ErrClass:\"database\" ErrScope:\"downstream\" ErrLevel:\"high\" Message:\"fail to initial unit Sync of subtask dm_mysql_task_test : database driver error\" RawCause:\"context deadline exceeded\" Workaround:\"Please check the database connection and the database config in configuration file.\" , Workaround: Please execute `query-status` to check status.",
            "source": "dm_mysql_src_test",
            "worker": "dm-10.0.46.191-8262"
        }
    ]
}

dmctl query-status显示(反应时间也很长,大概几十秒):

{
    "result": true,
    "msg": "",
    "sources": [
        {
            "result": false,
            "msg": "[code=38008:class=dm-master:scope=internal:level=high], Message: grpc request error, RawCause: rpc error: code = DeadlineExceeded desc = context deadline exceeded",
            "sourceStatus": {
                "source": "dm_mysql_src_test",
                "worker": "",
                "result": null,
                "relayStatus": null
            },
            "subTaskStatus": [
            ]
        }
    ]
}

将任务配置中加入的meta信息删掉,重启任务,错误仍然如旧。期间多次检查MySQL数据源的配置,均是正常状态。

请问如何解决上述问题?Many thanks~

Hi,是2.0已知bug,在表多的时候会触发。可以按照文档升级到nightly

https://docs.pingcap.com/zh/tidb-data-migration/stable/maintain-dm-using-tiup#升级操作

下面是asktug其他用户遇到的相似问题

1 个赞

Hi,感谢回复~

我执行下面的命令试图patch nightly版本,发生错误:

# wget http://download.pingcap.org/dm-nightly-linux-amd64.tar.gz

# tiup dm patch sht-tidm-cluster-pro dm-nightly-linux-amd64.tar.gz -N 10.0.46.191:8262,10.0.46.192:8262,10.0.46.193:8262
Starting component `dm`: /root/.tiup/components/dm/v1.2.4/tiup-dm patch sht-tidm-cluster-pro dm-nightly-linux-amd64.tar.gz -N 10.0.46.191:8262,10.0.46.192:8262,10.0.46.193:8262

Error: entry dm-worker/dm-worker not found in package dm-nightly-linux-amd64.tar.gz

Verbose debug logs has been written to /root/logs/tiup-cluster-debug-2020-11-24-11-34-54.log.
Error: run `/root/.tiup/components/dm/v1.2.4/tiup-dm` (wd:/root/.tiup/data/SHE4trf) failed: exit status 1

查看产生的报错日志,同样包含Error: entry dm-worker/dm-worker not found in package dm-nightly-linux-amd64.tar.gz信息。请问应该如何正确使用patch?感谢~

按照文档直接升级即可,patch包是要单独获取的,并不是官网那个

OK。但是在官网和GitHub上都没有找到patch包在哪里,是官方单独发送的吗,还是我没找到?(另外“点击提供有效信息”那里已经填写完了)

patch需要等待社区专家。另外可以不依赖patch,使用

tiup dm upgrade <cluster-name> nightly

进行更新

1 个赞

好的,感谢,现在正常了~

:+1:

我也遇到这个bug了

此话题已在最后回复的 1 分钟后被自动关闭。不再允许新回复。