[collector.go:203] ["backup failed summary"] [task=dm-rds_master] [unit=dump]

【环境说明】
TiDB: v5.4.1
DM: v5.4.1

同步任务配置:

【问题说明】
DM在全量dump数据时异常停止,执行resume-task后已经导出的本地数据被清理,但是通过query-status查看导出情况,发现进度仍然是之前的进度。
具体如下:
异常退出

[2022/05/26 13:49:48.124 +08:00] [INFO] [status.go:31] [progress] [task=dm-rds_master] [unit=dump] [tables=“46/168 (27.4%)”] [“finished rows”=1426395870] [“estimate total rows”=1387484471] [“finish ed size”=666GB] [“average speed(MiB/s)”=72.45181706066245]
[2022/05/26 13:51:48.124 +08:00] [INFO] [status.go:31] [progress] [task=dm-rds_master] [unit=dump] [tables=“46/168 (27.4%)”] [“finished rows”=1449285519] [“estimate total rows”=1387484471] [“finish ed size”=675GB] [“average speed(MiB/s)”=71.97121670267033]
[2022/05/26 13:52:50.136 +08:00] [INFO] [collector.go:203] [“backup failed summary”] [task=dm-rds_master] [unit=dump] [total-ranges=0] [ranges-succeed=0] [ranges-failed=0]
[2022/05/26 13:52:50.155 +08:00] [INFO] [writer.go:77] [“context has been done, the writer will exit”] [task=dm-rds_master] [unit=dump] [“writer ID”=0]
[2022/05/26 13:52:50.162 +08:00] [ERROR] [dumpling.go:152] [“dump data exits with error”] [task=dm-rds_master] [unit=dump] [“cost time”=1h25m26.411221266s] [error=“ErrCode:32001 ErrClass:"dump-unit" ErrScope:"internal" ErrLevel:"high" Message:"mydumper/dumpling runs with error, with output (may empty): " RawCause:"sql: SHOW COLUMNS FROM marketing_center.mc_activity_record_32: driver: bad connection" “]
[2022/05/26 13:52:50.163 +08:00] [INFO] [subtask.go:310] [“unit process returned”] [subtask=dm-rds_master] [unit=Dump] [stage=Paused] [status=”{"totalTables":168,"completedTables":46,"finishedBytes":679914782035,"finishedRows":1461745697,"estimateTotalRows":1387484471}”]
[2022/05/26 13:52:50.163 +08:00] [ERROR] [subtask.go:331] [“unit process error”] [subtask=dm-rds_master] [unit=Dump] [“error information”="ErrCode:32001 ErrClass:"dump-unit" ErrScope:"internal" ErrLevel:"high" Message:"mydumper/dumpling runs with error, with output (may empty): " RawCause:"sql: SHOW COLUMNS FROM marketing_center.mc_activity_record_32: driver: bad connection" "]
[2022/05/26 13:52:50.163 +08:00] [INFO] [subtask.go:333] [paused] [subtask=dm-rds_master] [unit=Dump]
[2022/05/26 13:52:50.714 +08:00] [WARN] [task_checker.go:383] [“task can’t auto resume”] [component=“task checker”] [task=dm-rds_master] [“paused duration”=1h55m54.999266586s]
[2022/05/26 13:52:55.709 +08:00] [WARN] [task_checker.go:383] [“task can’t auto resume”] [component=“task checker”] [task=dm-rds_master] [“paused duration”=1h55m59.994170723s]
[2022/05/26 13:53:00.710 +08:00] [WARN] [task_checker.go:383] [“task can’t auto resume”] [component=“task checker”] [task=dm-rds_master] [“paused duration”=1h56m4.994740739s]

查看状态:query-status dm-rds_master

tiup is checking updates for component dmctl …
Starting component dmctl: /home/admin/.tiup/components/dmctl/v6.0.0/dmctl/dmctl /home/admin/.tiup/components/dmctl/v6.0.0/dmctl/dmctl --master-addr=172.19.58.205:8261 query-status dm-rds_master
{
“result”: true,
“msg”: “”,
“sources”: [
{
“result”: true,
“msg”: “”,
“sourceStatus”: {
“source”: “ds-rds_master”,
“worker”: “dm-xxxx-9262”,
“result”: null,
“relayStatus”: null
},
“subTaskStatus”: [
{
“name”: “dm-rds_master”,
“stage”: “Paused”,
“unit”: “Dump”,
“result”: {
“isCanceled”: false,
“errors”: [
{
“ErrCode”: 32001,
“ErrClass”: “dump-unit”,
“ErrScope”: “internal”,
“ErrLevel”: “high”,
“Message”: "mydumper/dumpling runs with error, with output (may empty): ",
“RawCause”: “sql: SHOW COLUMNS FROM marketing_center.mc_activity_record_32: driver: bad connection”,
“Workaround”: “”
}
],
“detail”: null
},
“unresolvedDDLLockID”: “”,
“dump”: {
“totalTables”: “168”,
“completedTables”: 76,
“finishedBytes”: 780307764962,
“finishedRows”: 1587939832,
“estimateTotalRows”: 2104326425
}
}
]
}
]
}

执行resume-task dm-rds_master,并查看状态

tiup is checking updates for component dmctl …
Starting component dmctl: /home/admin/.tiup/components/dmctl/v6.0.0/dmctl/dmctl /home/admin/.tiup/components/dmctl/v6.0.0/dmctl/dmctl --master-addr=172.19.58.205:8261 query-status dm-rds_master
{
“result”: true,
“msg”: “”,
“sources”: [
{
“result”: true,
“msg”: “”,
“sourceStatus”: {
“source”: “ds-rds_master”,
“worker”: “dm-xxx-9262”,
“result”: null,
“relayStatus”: null
},
“subTaskStatus”: [
{
“name”: “dm-rds_master”,
“stage”: “Running”,
“unit”: “Dump”,
“result”: null,
“unresolvedDDLLockID”: “”,
“dump”: {
“totalTables”: “168”,
“completedTables”: 76,
“finishedBytes”: 780438448955,
“finishedRows”: 1589111472,
“estimateTotalRows”: 2106467643
}
}
]
}
]
}

查看磁盘使用

[admin@dm-cluster01 dm-worker-9262]$ df -h
Filesystem Size Used Avail Use% Mounted on
/dev/vdb1 1.5T 5.5G 1.4T 1% /ccdata

从上面的报错日志上可以知道,已经导出675G,但是磁盘只使用了5.5G,被清空了;重新恢复后,进度也是6/7百G:

“dump”: {
“totalTables”: “168”,
“completedTables”: 76,
“finishedBytes”: 780307764962,
“finishedRows”: 1587939832,
“estimateTotalRows”: 2104326425
}

恢复dump一段时间后又出现异常退出,并且paused duration明显感觉不对

[2022/05/26 15:52:13.076 +08:00] [INFO] [collector.go:203] [“backup failed summary”] [task=dm-rds_master] [unit=dump] [total-ranges=0] [ranges-succeed=0] [ranges-failed=0]
[2022/05/26 15:52:13.095 +08:00] [INFO] [writer.go:77] [“context has been done, the writer will exit”] [task=dm-rds_master] [unit=dump] [“writer ID”=0]
[2022/05/26 15:52:13.102 +08:00] [ERROR] [dumpling.go:152] [“dump data exits with error”] [task=dm-rds_master] [unit=dump] [“cost time”=1h24m16.249526321s] [error=“ErrCode:32001 ErrClass:"dump-unit" ErrScope:"internal" ErrLevel:"high" Message:"mydumper/dumpling runs with error, with output (may empty): " RawCause:"sql: SHOW COLUMNS FROM marketing_center.mc_activity_record_32: driver: bad connection" “]
[2022/05/26 15:52:13.103 +08:00] [INFO] [subtask.go:310] [“unit process returned”] [subtask=dm-rds_master] [unit=Dump] [stage=Paused] [status=”{"totalTables":168,"completedTables":99,"finishedBytes":1121792198364,"finishedRows":2322327027,"estimateTotalRows":2801300683}”]
[2022/05/26 15:52:13.103 +08:00] [ERROR] [subtask.go:331] [“unit process error”] [subtask=dm-rds_master] [unit=Dump] [“error information”="ErrCode:32001 ErrClass:"dump-unit" ErrScope:"internal" ErrLevel:"high" Message:"mydumper/dumpling runs with error, with output (may empty): " RawCause:"sql: SHOW COLUMNS FROM marketing_center.mc_activity_record_32: driver: bad connection" "]
[2022/05/26 15:52:13.103 +08:00] [INFO] [subtask.go:333] [paused] [subtask=dm-rds_master] [unit=Dump]
[2022/05/26 15:52:15.715 +08:00] [WARN] [task_checker.go:383] [“task can’t auto resume”] [component=“task checker”] [task=dm-rds_master] [“paused duration”=3h55m20.000207423s]
[2022/05/26 15:52:20.710 +08:00] [WARN] [task_checker.go:383] [“task can’t auto resume”] [component=“task checker”] [task=dm-rds_master] [“paused duration”=3h55m24.994350679s]
[2022/05/26 15:52:25.709 +08:00] [WARN] [task_checker.go:383] [“task can’t auto resume”] [component=“task checker”] [task=dm-rds_master] [“paused duration”=3h55m29.994009346s]
[2022/05/26 15:52:30.710 +08:00] [WARN] [task_checker.go:383] [“task can’t auto resume”] [component=“task checker”] [task=dm-rds_master] [“paused duration”=3h55m34.994646301s]
[2022/05/26 15:52:35.709 +08:00] [WARN] [task_checker.go:383] [“task can’t auto resume”] [component=“task checker”] [task=dm-rds_master] [“paused duration”=3h55m39.994240072s]
[2022/05/26 15:52:40.710 +08:00] [WARN] [task_checker.go:383] [“task can’t auto resume”] [component=“task checker”] [task=dm-rds_master] [“paused duration”=3h55m44.994714045s]
[2022/05/26 15:52:45.710 +08:00] [WARN] [task_checker.go:383] [“task can’t auto resume”] [component=“task checker”] [task=dm-rds_master] [“paused duration”=3h55m49.994378108s]

并且无法stop-task

{
“op”: “Delete”,
“result”: false,
“msg”: “[code=38041:class=dm-master:scope=internal:level=medium], Message: invalid op 7 on task”,
“sources”: [
]
}

dump unit 重新启动后会清理上次残留的数据。

您 DM-worker 部署位置是在 /ccdata 下吗?dump 单元会在 DM-worker 的部署位置写入数据

此外 invalid op 7 on task 的问题,请使用对应版本的 dmctl 操作 DM 集群

日志中 pause 时间异常的问题,请您提供完整日志我们看一下

2 个赞

目前已经解决了,把上游wait_timeout(默认3600s)改大了就好了;是连接超时断开了导致,建议加个重连机制,要不然拉取超大表数据的时候,会因为时间太久(超过3600s),再获取下一张表结构的时候就会有异常了,重拉的时间成本太大了。

异常日志被我清掉了,拿不到了,不好意思:rofl:

2 个赞

此话题已在最后回复的 1 分钟后被自动关闭。不再允许新回复。