数据迁移异常

为提高效率,提问时请提供以下信息,问题描述清晰可优先响应。

  • 【TiDB 版本】:tidb4.0.8 dm2.0
  • 【问题描述】:一个source源上的所有task都异常终止了,task错误信息如下:
    “result”: null,
    “relayStatus”: null
    },
    “subTaskStatus”: [
    {
    “name”: “tasksupplier”,
    “stage”: “Paused”,
    “unit”: “InvalidUnit”,
    “result”: {
    “isCanceled”: false,
    “errors”: [
    {
    “ErrCode”: 10001,
    “ErrClass”: “database”,
    “ErrScope”: “downstream”,
    “ErrLevel”: “high”,
    “Message”: “fail to initial unit Sync of subtask tasksupplier : database driver error”,
    “RawCause”: “context deadline exceeded”,
    “Workaround”: “Please check the database connection and the database config in configuration file.”
    }
    ],
    “detail”: null
    },
    “unresolvedDDLLockID”: “”
    }
    ]
    dm worker日志内容如下:
    worker日志:[2020/11/19 16:32:23.395 +08:00] [ERROR] [server.go:1391] [“fail to query operation”] [retryNum=0] [task=tasksupplier ] [source=mysql-7491] [expect=Running] [error="[code=38008:class=dm-master:scope=internal:level=high], Message: grpc request error, RawCause: rpc error: code = DeadlineExceeded desc = context deadline exceeded"]
    若提问为性能优化、故障排查类问题,请下载脚本运行。终端输出的打印结果,请务必全选并复制粘贴上传。

你好,麻烦检查下下面几项:
1.请问该 source 源是否正常运行,是否出现重启的情况?
2.当前 dm-master 状态是否正常?是否可以提供下 dm-master 和 dm-worker 的日志?

source源状态:

(源端的mysql 一直正常运行,未曾重启过)
tiup dmctl --master-addr 10.10.74.232:8261 operate-source show source7491.yaml
“result”: true,
“msg”: “”,
“source”: “mysql-7491”,
“worker”: “dm-10.10.74.231-8262”
},
dm-master(leader日志)
[2020/11/19 15:51:07.213 +08:00] [INFO] [server.go:1890] [payload="name:“voucher_header” "] [request=QueryStatus]
[2020/11/19 15:51:07.996 +08:00] [INFO] [server.go:1890] [payload="name:“voucher_header” "] [request=QueryStatus]
[2020/11/19 15:51:08.925 +08:00] [INFO] [server.go:1890] [payload="name:“voucher_header” "] [request=QueryStatus]
[2020/11/19 15:52:13.792 +08:00] [INFO] [server.go:1890] [payload="name:“voucher_header” "] [request=QueryStatus]
[2020/11/19 15:52:15.598 +08:00] [INFO] [server.go:1890] [payload="name:“voucher_header” "] [request=QueryStatus]
[2020/11/19 15:52:17.556 +08:00] [INFO] [server.go:1890] [payload="name:“voucher_header” "] [request=QueryStatus]
[2020/11/19 15:53:08.201 +08:00] [INFO] [server.go:1890] [payload="name:“voucher_header” "] [request=QueryStatus]
[2020/11/19 15:53:09.522 +08:00] [INFO] [server.go:1890] [payload="name:“voucher_header” "] [request=QueryStatus]
[2020/11/19 15:56:52.771 +08:00] [INFO] [server.go:1890] [payload="op:Resume name:“voucher_header” "] [request=OperateTask]
[2020/11/19 15:56:52.776 +08:00] [WARN] [grpclog.go:60] [“transport: http2Server.HandleStreams failed to read frame: read tcp 10.0.74.232:8261->10.0.74.229:8327: read: connection reset by peer”] [component=“embed etcd”]
[2020/11/19 15:56:58.912 +08:00] [INFO] [server.go:1890] [payload="name:“voucher_header” "] [request=QueryStatus]
[2020/11/19 15:57:01.113 +08:00] [INFO] [server.go:1890] [payload="name:“voucher_header” "] [request=QueryStatus]
[2020/11/19 15:58:11.136 +08:00] [INFO] [server.go:1890] [payload="name:“voucher_header” "] [request=QueryStatus]
[2020/11/19 15:58:12.333 +08:00] [INFO] [server.go:1890] [payload="name:“voucher_header” "] [request=QueryStatus]
[2020/11/19 15:58:14.482 +08:00] [INFO] [server.go:1890] [payload="name:“voucher_header” "] [request=QueryStatus]
[2020/11/19 15:58:15.322 +08:00] [INFO] [server.go:1890] [payload="name:“voucher_header” "] [request=QueryStatus]
[2020/11/19 16:06:45.454 +08:00] [INFO] [server.go:1890] [payload=] [request=QueryStatus]
[2020/11/19 16:06:45.454 +08:00] [INFO] [server.go:668] [“get sources”]
[2020/11/19 16:30:15.217 +08:00] [INFO] [server.go:1890] [payload=] [request=QueryStatus]
[2020/11/19 16:30:15.218 +08:00] [INFO] [server.go:668] [“get sources”]
[2020/11/19 16:30:34.916 +08:00] [INFO] [server.go:1890] [payload="name:“taskfactoryscmstmssss” "] [request=QueryStatus]
[2020/11/19 16:31:53.390 +08:00] [INFO] [server.go:1890] [payload=“op:Resume name:“taskfactoryscmstmssss” “] [request=OperateTask]
[2020/11/19 16:32:23.395 +08:00] [ERROR] [server.go:1391] [“fail to query operation”] [retryNum=0] [task=taskfactoryscmstmssss] [source=mysql-7491] [expect=Running] [error=”[code=38008:class=dm-master:scope=internal:level=high], Message: grpc request error, RawCause: rpc error: code = DeadlineExceeded desc = context deadline exceeded”]
[2020/11/19 16:32:54.171 +08:00] [INFO] [server.go:1890] [payload="leader:true master:true names:“dm-10.0.74.230-8261” "] [request=ListMember]
[2020/11/19 16:32:54.277 +08:00] [INFO] [server.go:1890] [payload="leader:true master:true names:“dm-10.0.74.231-8261” "] [request=ListMember]
[2020/11/19 16:32:54.386 +08:00] [INFO] [server.go:1890] [payload="leader:true master:true names:“dm-10.0.74.232-8261” "] [request=ListMember]
[2020/11/19 16:32:54.491 +08:00] [INFO] [server.go:1890] [payload="worker:true names:“dm-10.0.74.230-8262” "] [request=ListMember]
[2020/11/19 16:32:54.593 +08:00] [INFO] [server.go:1890] [payload="worker:true names:“dm-10.0.74.231-8262” "] [request=ListMember]
[2020/11/19 16:32:54.695 +08:00] [INFO] [server.go:1890] [payload="worker:true names:“dm-10.0.74.232-8262” "] [request=ListMember]
[2020/11/19 16:35:10.166 +08:00] [INFO] [server.go:1890] [payload="name:“taskfactoryscmstmssss” "] [request=QueryStatus]
[2020/11/19 16:35:51.632 +08:00] [INFO] [server.go:1890] [payload=] [request=QueryStatus]
[2020/11/19 16:35:51.632 +08:00] [INFO] [server.go:668] [“get sources”]
[2020/11/19 16:36:28.736 +08:00] [INFO] [server.go:1890] [payload="name:“tasksupplier” "] [request=QueryStatus]
[2020/11/19 16:40:15.742 +08:00] [INFO] [server.go:1890] [payload=] [request=QueryStatus]
[2020/11/19 16:40:15.743 +08:00] [INFO] [server.go:668] [“get sources”]
[2020/11/19 16:40:35.595 +08:00] [INFO] [server.go:1890] [payload="leader:true master:true names:“dm-10.0.74.230-8261” "] [request=ListMember]
[2020/11/19 16:40:35.701 +08:00] [INFO] [server.go:1890] [payload="leader:true master:true names:“dm-10.0.74.231-8261” "] [request=ListMember]

DM-WORKER日志:
[2020/11/19 16:32:23.395 +08:00] [ERROR] [server.go:1391] [“fail to query operation”] [retryNum=0] [task=tasksupplier ] [source=mysql-7491] [expect=Running] [error="[code=38008:class=dm-master:scope=internal:level=high], Message: grpc request error, RawCause: rpc error: code = DeadlineExceeded desc = context deadline exceeded"
今天早上有重启了一下task ,相关dm-worker日志
my-binlog.001038, 1570996), gtid-set: )"]
[2020/11/20 10:04:15.388 +08:00] [INFO] [syncer.go:2278] [“binlog replication progress”] [task=voucher_line] [unit=“binlog replication”] [“total binlog size”=170076597] [“last binlog size”=169331064] [“cost time”=30] [bytes/Second=24851] [“unsynced binlog size”=0] [“estimate time to catch up”=0]
[2020/11/20 10:04:15.388 +08:00] [INFO] [syncer.go:2303] [“binlog replication status”] [task=voucher_line] [unit=“binlog replication”] [total_events=205] [total_tps=0] [tps=0] [master_position="(my-binlog.001038, 2287184)"] [master_gtid=] [checkpoint=“position: (my-binlog.001038, 2287184), gtid-set: (flushed position: (my-binlog.001038, 1570996), gtid-set: )”]
[2020/11/20 10:04:18.441 +08:00] [INFO] [syncer.go:870] [“flush checkpoints except for these tables”] [task=voucher_line] [unit=“binlog replication”] [tables="[]"]
[2020/11/20 10:04:18.441 +08:00] [INFO] [syncer.go:873] [“prepare flush sqls”] [task=voucher_line] [unit=“binlog replication”] [“shard meta sqls”="[“DELETE FROM meta_finance.voucher_line_syncer_sharding_meta where source_id=? and target_table_id=?”]"] [“shard meta arguments”="[[“mysql-7491”,"csx_b2b_finance.voucher_line"]]"]
[2020/11/20 10:04:18.457 +08:00] [INFO] [syncer.go:880] [“flushed checkpoint”] [task=voucher_line] [unit=“binlog replication”] [checkpoint=“position: (my-binlog.001038, 2376175), gtid-set: (flushed position: (my-binlog.001038, 2376175), gtid-set: )”]
[2020/11/20 10:04:42.868 +08:00] [INFO] [rollback.go:59] [“rolling back”] [functon=close-checkpoint] [onwer=syncer]
[2020/11/20 10:04:42.868 +08:00] [INFO] [rollback.go:59] [“rolling back”] [functon=close-DBs] [onwer=syncer]
[2020/11/20 10:04:42.868 +08:00] [INFO] [loader.go:716] [“stop importing data process”] [task=taskfactoryscmstmssss] [unit=load]
[2020/11/20 10:04:42.869 +08:00] [INFO] [loader.go:1313] [“clean dump files”] [task=taskfactoryscmstmssss] [unit=load]
[2020/11/20 10:04:42.869 +08:00] [ERROR] [subtask.go:187] [“fail to initial subtask”] [subtask=taskfactoryscmstmssss] [error="[code=10001:class=database:scope=downstream:level=high], Message: fail to initial unit Sync of subtask taskfactoryscmstmssss : database driver error, RawCause: context deadline exceeded, Workaround: Please check the database connection and the database config in configuration file."]
[2020/11/20 10:04:42.869 +08:00] [INFO] [task_checker.go:328] [“dispatch auto resume task”] [component=“task checker”] [task=taskfactoryscmstmssss]
[2020/11/20 10:04:42.869 +08:00] [WARN] [task_checker.go:320] [“backoff skip auto resume task”] [component=“task checker”] [task=taskfactoryscmstmssss] [latestResumeTime=2020/11/20 10:04:42.869 +08:00] [duration=5m0s]
[2020/11/20 10:04:42.869 +08:00] [WARN] [task_checker.go:320] [“backoff skip auto resume task”] [component=“task checker”] [task=tasksupplier] [latestResumeTime=2020/11/20 10:03:27.871 +08:00] [duration=5m0s]
[2020/11/20 10:04:45.388 +08:00] [INFO] [syncer.go:2278] [“binlog replication progress”] [task=voucher_line] [unit=“binlog replication”] [“total binlog size”=173375052] [“last binlog size”=170076597] [“cost time”=30] [bytes/Second=109948] [“unsynced binlog size”=0] [“estimate time to catch up”=0]
[2020/11/20 10:04:45.389 +08:00] [INFO] [syncer.go:2303] [“binlog replication status”] [task=voucher_line] [unit=“binlog replication”] [total_events=205] [total_tps=0] [tps=0] [master_position="(my-binlog.001038, 5585639)"] [master_gtid=] [checkpoint=“position: (my-binlog.001038, 5585639), gtid-set: (flushed position: (my-binlog.001038, 2376175), gtid-set: )”]
[2020/11/20 10:04:47.367 +08:00] [WARN] [task_checker.go:320] [“backoff skip auto resume task”] [component=“task checker”] [task=taskfactoryscmstmssss] [latestResumeTime=2020/11/20 10:04:42.869 +08:00] [duration=5m0s]
[2020/11/20 10:04:47.367 +08:00] [WARN] [task_checker.go:320] [“backoff skip auto resume task”] [component=“task checker”] [task=tasksupplier] [latestResumeTime=2020/11/20 10:03:27.871 +08:00] [duration=5m0s]
[2020/11/20 10:04:49.163 +08:00] [INFO] [syncer.go:870] [“flush checkpoints except for these tables”] [task=voucher_line] [unit=“binlog replication”] [tables="[]"]
[2020/11/20 10:04:49.163 +08:00] [INFO] [syncer.go:873] [“prepare flush sqls”] [task=voucher_line] [unit=“binlog replication”] [“shard meta sqls”="[“DELETE FROM meta_finance.voucher_line_syncer_sharding_meta where source_id=? and target_table_id=?”]"] [“shard meta arguments”="[[“mysql-7491”,"csx_b2b_finance.voucher_line"]]"]
[2020/11/20 10:04:49.178 +08:00] [INFO] [syncer.go:880] ["flushed checkpoin

确认下 dm-worker 的机器,使用 source 中的配置的账号密码,能登陆到上游数据库吗?

现在出问题的worker

可以登录,其他的source也连同一个MySQL源,账号密码同一个,其他的source没问题

是否有尝试重启 dm-worker? 另外可以提供完整的 dm-worker 日志文件么?

日志有点大 ,百度云盘https://pan.baidu.com/s/18CapEz5MIQaWLwtD7WZXUA

提取码:jqvb

你好,这边跟研发同学确认了下是 2.0 的已知 bug,这块超时处理有些问题,目前下一个版本发版时间还不确定,可以使用 nightly 版本(已修复该问题)

注意:需要把 TiUP self 先更到最新,旧版有些问题。然后临时 tiup dm patch 一下 nightly 的版本。

生产环境能使用nightly吗 有以下几个问题:
1、超时时间是多久
2、怎么能避免(我的场景是用了多个source,但是多个source都是指定的同一个mysql源,出问题的是已经正常运行增量的source,在增量运行期间,有task 做全量同步,全量同步拿数据是有在mysql 源端是不是执行(flush table with read lock)(不止这一个source上的task有增量,其他source的task也有增量),
3、我现在出问题的worker怎么恢复正常使用呢

1min
表多的情况下会遇到,如果你这边不使用 nightly,麻烦填写一下联系我们相关信息,我们看下是否可以提供 hotfix 版本

在哪里填写

已经提交

:+1:

今天又遇到这个问题了,想问一下这个是因为一个task中表太多吗,还是因为一个worker中太多task,怎么能避免这种场景,

那就使用 nightly 版本吧。

想问一下我怎么能避免这种问题呢,还有hotfix版本是否能够避免这个问题

tiup dm upgrade nightly 升级吧。
填完表单需要其他同事评估,已知 bug 还是升级下。

主要原因是:2.0 里内部会为每个表建 track 它的 schema 信息。之前那里的超时处理有问题,如果表稍微多一些,就造成在指定时间内处理不完,超时失败

有两个问题:1、多建几个worker,task 分布在几个worker上 是否能避免,还是说这个跟worker和task个数没关系。
2、使用nightly会不会有其他风险,你们评估大概啥时候能评估好

这边建议你使用 nightly 吧,已经有用户升级过没问题了。