6.5版本在DM同步DDL时候被卡死,没办法取消

DM同步了一个小表的DDL到tidb-server,tidb-server一直被卡着,使用ADMIN CANCEL DDL JOBS取消DDL也一直失败,最终等待了近四个小时,才CANCEL成功。期间有大量的[2023/05/24 19:09:28.673 +08:00] [INFO] [syncer.go:333] ["[ddl] syncer check all versions, someone is not synced"] [info="instance ip {tidb-server-ip} port 4000, id 831b35bd-9d31-4ec4-a005-08ac7c6ead64"] ["ddl id"=47401] [ver=33590]报错。

过滤后的日志为:

[2023/05/24 19:09:28.569 +08:00] [INFO] [ddl_worker.go:314] ["[ddl] add DDL jobs"] ["batch count"=1] [jobs="ID:47401, Type:add column, State:queueing, SchemaState:none, SchemaID:5747, TableID:7466, RowCount:0, ArgLen:4, start time: 2023-05-24 19:09:28.503 +0800 CST, Err:<nil>, ErrCount:0, SnapshotVersion:0; "] [table=true]
[2023/05/24 19:09:28.569 +08:00] [INFO] [ddl.go:1011] ["[ddl] start DDL job"] [job="ID:47401, Type:add column, State:queueing, SchemaState:none, SchemaID:5747, TableID:7466, RowCount:0, ArgLen:4, start time: 2023-05-24 19:09:28.503 +0800 CST, Err:<nil>, ErrCount:0, SnapshotVersion:0"] [query="ALTER TABLE `db1`.`tb1` ADD COLUMN `col1` VARCHAR(30) NULL COMMENT '收益止盈类型, FIXED:固定值, RANGE:范围值'"]
[2023/05/24 19:09:28.589 +08:00] [INFO] [ddl_worker.go:1168] ["[ddl] run DDL job"] [worker="worker 3, tp general"] [job="ID:47401, Type:add column, State:queueing, SchemaState:none, SchemaID:5747, TableID:7466, RowCount:0, ArgLen:0, start time: 2023-05-24 19:09:28.503 +0800 CST, Err:<nil>, ErrCount:0, SnapshotVersion:0"]
[2023/05/24 19:09:28.592 +08:00] [INFO] [column.go:133] ["[ddl] run add column job"] [job="ID:47401, Type:add column, State:running, SchemaState:none, SchemaID:5747, TableID:7466, RowCount:0, ArgLen:4, start time: 2023-05-24 19:09:28.503 +0800 CST, Err:<nil>, ErrCount:0, SnapshotVersion:0"] [columnInfo="{\"id\":45,\"name\":{\"O\":\"col1\",\"L\":\"col1\"},\"offset\":44,\"origin_default\":null,\"origin_default_bit\":null,\"default\":null,\"default_bit\":null,\"default_is_expr\":false,\"generated_expr_string\":\"\",\"generated_stored\":false,\"dependences\":null,\"type\":{},\"state\":0,\"comment\":\"收益止盈类型, FIXED:固定值, RANGE:范围值\",\"hidden\":false,\"change_state_info\":null,\"version\":2}"]
[2023/05/24 19:09:28.634 +08:00] [INFO] [domain.go:726] ["mdl gets lock, update to owner"] [jobID=47401] [version=33590]
[2023/05/24 19:09:28.673 +08:00] [INFO] [syncer.go:333] ["[ddl] syncer check all versions, someone is not synced"] [info="instance ip 192.168.1.221, port 4000, id 65867468-6375-4732-b2e7-1501df354479"] ["ddl id"=47401] [ver=33590]
[2023/05/24 19:09:28.673 +08:00] [INFO] [syncer.go:333] ["[ddl] syncer check all versions, someone is not synced"] [info="instance ip 192.168.1.245, port 4000, id 831b35bd-9d31-4ec4-a005-08ac7c6ead64"] ["ddl id"=47401] [ver=33590]

# 中间过滤了很多日志,基本上都是`syncer check all versions, someone is not synced`的报错

[2023/05/25 01:14:22.975 +08:00] [INFO] [ddl_worker.go:1377] ["[ddl] wait latest schema version changed(get the metadata lock if tidb_enable_metadata_lock is true)"] [ver=33590] ["take time"=6h4m54.355850116s] [job="ID:47401, Type:add column, State:running, SchemaState:delete only, SchemaID:5747, TableID:7466, RowCount:0, ArgLen:4, start time: 2023-05-24 19:09:28.503 +0800 CST, Err:<nil>, ErrCount:0, SnapshotVersion:0"]
[2023/05/25 01:14:22.989 +08:00] [INFO] [ddl_worker.go:1168] ["[ddl] run DDL job"] [worker="worker 3, tp general"] [job="ID:47401, Type:add column, State:cancelling, SchemaState:delete only, SchemaID:5747, TableID:7466, RowCount:0, ArgLen:0, start time: 2023-05-24 19:09:28.503 +0800 CST, Err:<nil>, ErrCount:0, SnapshotVersion:0"]
[2023/05/25 01:14:22.992 +08:00] [INFO] [rollingback.go:433] ["[ddl] the DDL job is cancelled normally"] [worker="worker 3, tp general"] [job="ID:47401, Type:add column, State:rollingback, SchemaState:delete only, SchemaID:5747, TableID:7466, RowCount:0, ArgLen:1, start time: 2023-05-24 19:09:28.503 +0800 CST, Err:[ddl:8214]Cancelled DDL job, ErrCount:1, SnapshotVersion:0"] [error="[ddl:8214]Cancelled DDL job"]
[2023/05/25 01:14:23.004 +08:00] [INFO] [ddl_worker.go:1168] ["[ddl] run DDL job"] [worker="worker 3, tp general"] [job="ID:47401, Type:add column, State:rollingback, SchemaState:delete only, SchemaID:5747, TableID:7466, RowCount:0, ArgLen:0, start time: 2023-05-24 19:09:28.503 +0800 CST, Err:[ddl:8214]Cancelled DDL job, ErrCount:1, SnapshotVersion:0"]
[2023/05/25 01:14:23.084 +08:00] [INFO] [domain.go:726] ["mdl gets lock, update to owner"] [jobID=47401] [version=33626]
[2023/05/25 01:15:21.269 +08:00] [INFO] [ddl_worker.go:1377] ["[ddl] wait latest schema version changed(get the metadata lock if tidb_enable_metadata_lock is true)"] [ver=33626] ["take time"=58.239228943s] [job="ID:47401, Type:add column, State:rollingback, SchemaState:delete reorganization, SchemaID:5747, TableID:7466, RowCount:0, ArgLen:1, start time: 2023-05-24 19:09:28.503 +0800 CST, Err:[ddl:8214]Cancelled DDL job, ErrCount:1, SnapshotVersion:0"]
[2023/05/25 01:15:21.280 +08:00] [INFO] [ddl_worker.go:1168] ["[ddl] run DDL job"] [worker="worker 3, tp general"] [job="ID:47401, Type:add column, State:rollingback, SchemaState:delete reorganization, SchemaID:5747, TableID:7466, RowCount:0, ArgLen:0, start time: 2023-05-24 19:09:28.503 +0800 CST, Err:[ddl:8214]Cancelled DDL job, ErrCount:1, SnapshotVersion:0"]
[2023/05/25 01:15:21.334 +08:00] [INFO] [domain.go:726] ["mdl gets lock, update to owner"] [jobID=47401] [version=33627]
[2023/05/25 01:15:21.379 +08:00] [INFO] [ddl_worker.go:1377] ["[ddl] wait latest schema version changed(get the metadata lock if tidb_enable_metadata_lock is true)"] [ver=33627] ["take time"=74.139844ms] [job="ID:47401, Type:add column, State:rollback done, SchemaState:none, SchemaID:5747, TableID:7466, RowCount:0, ArgLen:1, start time: 2023-05-24 19:09:28.503 +0800 CST, Err:[ddl:8214]Cancelled DDL job, ErrCount:1, SnapshotVersion:0"]
[2023/05/25 01:15:21.393 +08:00] [INFO] [ddl_worker.go:643] ["[ddl] finish DDL job"] [worker="worker 3, tp general"] [job="ID:47401, Type:add column, State:rollback done, SchemaState:none, SchemaID:5747, TableID:7466, RowCount:0, ArgLen:0, start time: 2023-05-24 19:09:28.503 +0800 CST, Err:[ddl:8214]Cancelled DDL job, ErrCount:1, SnapshotVersion:0"]
[2023/05/25 01:15:21.440 +08:00] [INFO] [ddl.go:1114] ["[ddl] DDL job is failed"] [jobID=47401]

你好,我想了解下,你们 dm 的 task 是怎么样的?上游几个 instance ,ddl 是啥,业务压力有多大?

ddl不兼容报错有很多,但像你说的卡住倒没见过,怕是你这被锁了吧?

mdl gets lock

有没有可能目标库在跑其他ddl,ddl是不能并发的

是单MySQL实例的同步,不过这个问题其实是跟DM没有关系。应该是这个表的metadata_lock有问题,对这个表的任何DDL操作都会卡很长时间(基本上都是3、4个小时,不过最终还是能成功),其他表没有这个问题。

天师牛逼
很可能是 metadata lock 的问题,见这里的 issue: https://github.com/pingcap/tidb/issues/43755

大概率就是这个问题,今天看7.1的release note,看到已经合并到7.1正式分支了。

此话题已在最后回复的 60 天后被自动关闭。不再允许新回复。