重启 线上 dm master & workers,导致dm集群不可用

为提高效率,提问时请提供以下信息,问题描述清晰可优先响应。

  • 【TiDB 版本】:tidb v2.1.15 / dm v1.0.0-alpha-83-gf6f0566
  • 【问题描述】: 在下游重新部署了一个dm worker,使用的serverid和之前的 worker重复,导致老的worker无法读取binlog。知道id不能重复后,关闭新的worker。

重新启动老的worker,发现:


rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: connection error: desc = \"transport: Error while dialing dial tcp 172.xx.xx.xx:8262: connect: connection refused\"

老的worker 一直报错:

panic: out of sorted order from level db for task log key 21 44 4D 21 54 61 73 6B 4C 6F 67 87 00 00 00 00 00 00 00 (log ID 135)

goroutine 1 [running]:
github.com/pingcap/dm/dm/worker.(*Logger).Initial(0xc00020e420, 0xc0003984e0, 0xc00020ad40, 0x16, 0x1, 0xc00020ad40, 0x16)
	/home/jenkins/workspace/build_dm_master/go/src/github.com/pingcap/dm/dm/worker/log.go:166 +0x814
github.com/pingcap/dm/dm/worker.(*Metadata).loadFromDB(0xc00033c100, 0xc00020ad40, 0x16)
	/home/jenkins/workspace/build_dm_master/go/src/github.com/pingcap/dm/dm/worker/meta.go:255 +0x3c
github.com/pingcap/dm/dm/worker.NewMetadata(0xc0001e7580, 0x11, 0xc0003984e0, 0xc0003984e0, 0x0, 0x0)
	/home/jenkins/workspace/build_dm_master/go/src/github.com/pingcap/dm/dm/worker/meta.go:117 +0x247
github.com/pingcap/dm/dm/worker.NewWorker(0xc0001c0300, 0x15c7c00, 0xc00003e068, 0x13d3ad8)
	/home/jenkins/workspace/build_dm_master/go/src/github.com/pingcap/dm/dm/worker/worker.go:99 +0x374
github.com/pingcap/dm/dm/worker.(*Server).Start(0xc0001e8a40, 0x144d150, 0xc0001ec8a0)
	/home/jenkins/workspace/build_dm_master/go/src/github.com/pingcap/dm/dm/worker/server.go:70 +0xe0
main.main()
	/home/jenkins/workspace/build_dm_master/go/src/github.com/pingcap/dm/cmd/dm-worker/main.go:65 +0x349

若提问为性能优化、故障排查类问题,请下载脚本运行。终端输出的打印结果,请务必全选并复制粘贴上传。

看 Panic 信息,是一个 DM 的 bug 问题,在后面已经修复了,另外看目前使用的版本是 v1.0.0-alpha ,可能不太稳定,推荐使用 1.0 GA 或者之后的版本。

请问现在 要怎么修复这个worker 呢?

因为这个worker比较重要,数据量也大,重新做,会很费时间,同时影响的地方比较多。

如果要在已有的环境去修复,麻烦描述下具体的操作问题以及报错。

经过如下:

1.在下游重新部署了一个dm worker,使用的serverid和之前的 worker重复,导致老的worker无法读取binlog。

  1. 知道id不能重复后,关闭新的worker。此时的报错log:
2020/02/24 10:56:04.507 relay.go:176: [error] [relay] process exit with error ERROR 1236 (HY000): A slave with the same server_uuid/server_id as this slave has connected to the master; the first event '' at 4, the last event read from '/var/lib/mysql-binlog/mysql-bin.008871' at 107854788, the last byte read from '/var/lib/mysql-binlog/mysql-bin.008871' at 107854788.
github.com/pingcap/errors.AddStack
        /go/pkg/mod/github.com/pingcap/errors@v0.11.0/errors.go:174
github.com/pingcap/errors.Trace
        /go/pkg/mod/github.com/pingcap/errors@v0.11.0/juju_adaptor.go:12
github.com/pingcap/dm/relay.(*Relay).process
        /home/jenkins/workspace/build_dm_master/go/src/github.com/pingcap/dm/relay/relay.go:357
github.com/pingcap/dm/relay.(*Relay).Process
        /home/jenkins/workspace/build_dm_master/go/src/github.com/pingcap/dm/relay/relay.go:173
github.com/pingcap/dm/dm/worker.(*RelayHolder).run
        /home/jenkins/workspace/build_dm_master/go/src/github.com/pingcap/dm/dm/worker/relay.go:107
github.com/pingcap/dm/dm/worker.(*RelayHolder).Start.func1
        /home/jenkins/workspace/build_dm_master/go/src/github.com/pingcap/dm/dm/worker/relay.go:83
runtime.goexit
        /usr/local/go/src/runtime/asm_amd64.s:1337
  1. 重新启动老的worker,执行滚动重启命令:
ansible-playbook rolling_update.yml --tags=dm-worker
ansible-playbook rolling_update.yml --tags=dm-master

在dmctl中,发现该worker的信息:

        {
            "result": false,
            "worker": "172.xx.xx.xx:8262",
            "msg": "rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: connection error: desc = \"transport: Error while dialing dial tcp 172.xx.xx.xx:8262: connect: connection refused\"",
            "subTaskStatus": [
            ],
            "relayStatus": null

查看老的worker的dm-worker-stderr.log, 重复刷该报错:

panic: out of sorted order from level db for task log key 21 44 4D 21 54 61 73 6B 4C 6F 67 87 00 00 00 00 00 00 00 (log ID 135)

goroutine 1 [running]:
github.com/pingcap/dm/dm/worker.(*Logger).Initial(0xc00020e420, 0xc0003984e0, 0xc00020ad40, 0x16, 0x1, 0xc00020ad40, 0x16)
	/home/jenkins/workspace/build_dm_master/go/src/github.com/pingcap/dm/dm/worker/log.go:166 +0x814
github.com/pingcap/dm/dm/worker.(*Metadata).loadFromDB(0xc00033c100, 0xc00020ad40, 0x16)
	/home/jenkins/workspace/build_dm_master/go/src/github.com/pingcap/dm/dm/worker/meta.go:255 +0x3c
github.com/pingcap/dm/dm/worker.NewMetadata(0xc0001e7580, 0x11, 0xc0003984e0, 0xc0003984e0, 0x0, 0x0)
	/home/jenkins/workspace/build_dm_master/go/src/github.com/pingcap/dm/dm/worker/meta.go:117 +0x247
github.com/pingcap/dm/dm/worker.NewWorker(0xc0001c0300, 0x15c7c00, 0xc00003e068, 0x13d3ad8)
	/home/jenkins/workspace/build_dm_master/go/src/github.com/pingcap/dm/dm/worker/worker.go:99 +0x374
github.com/pingcap/dm/dm/worker.(*Server).Start(0xc0001e8a40, 0x144d150, 0xc0001ec8a0)
	/home/jenkins/workspace/build_dm_master/go/src/github.com/pingcap/dm/dm/worker/server.go:70 +0xe0
main.main()
	/home/jenkins/workspace/build_dm_master/go/src/github.com/pingcap/dm/cmd/dm-worker/main.go:65 +0x349

但是,dm-worker.log没有报错,一直卡在最后一行:

2020/02/24 12:49:17.382 printer.go:52: [info] Welcome to worker
2020/02/24 12:49:17.382 printer.go:53: [info] Release Version: v1.0.0-alpha-83-gf6f0566
2020/02/24 12:49:17.382 printer.go:54: [info] Git Commit Hash: f6f056642417f09e4c7f304e9f4fb4c5c58d2d5a
2020/02/24 12:49:17.382 printer.go:55: [info] Git Branch: master
2020/02/24 12:49:17.382 printer.go:56: [info] UTC Build Time: 2019-05-22 02:46:15
2020/02/24 12:49:17.382 printer.go:57: [info] Go Version: go version go1.12 linux/amd64
2020/02/24 12:49:17.382 main.go:47: [info] config: {"log-level":"info","log-file":"/data1/deploy/dm/dm_worker1/log/dm-worker.log","log-rotate":"","worker-addr":":8262","enable-gtid":true,"auto-fix-gtid":false,"relay-dir":"/data1/deploy/dm/dm_worker1/relay_log","meta-dir":"./dm_worker_meta","server-id":101,"flavor":"mysql","charset":"","relay-binlog-name":"","relay-binlog-gtid":"a06ce942-d0ec-11e7-8959-00163e0c3232:1-1071835021","source-id":"mysql-replica-01","from":{"host":"stg-db-13.int.frontnode.net","port":3306,"user":"tidb-syncer","max-allowed-packet":67108864},"purge":{"interval":3600,"expires":0,"remain-space":15},"tracer":{"enable":false,"source":"mysql-replica-01","tracer-addr":"","batch-size":20,"checksum":false},"config-file":"conf/dm-worker.toml"}
2020/02/24 12:49:17.383 db.go:51: [info] kv db config: &{BlockCacheCapacity:8388608 BlockRestartInterval:16 BlockSize:4096 CompactionL0Trigger:8 CompactionTableSize:67108864 CompactionTotalSize:536870912 CompactionTotalSizeMultiplier:8 WriteBuffer:67108864 WriteL0PauseTrigger:24 WriteL0SlowdownTrigger:17}

上面的报错是在这里:

panic(fmt.Sprintf("out of sorted order from level db for task log key % X (log ID %d), start location %d, end location %d",

附链接:https://github.com/pingcap/dm/blob/master/dm/worker/log.go#L182

请问 ,为什么会是初始化日志的时候有问题?这个导致的worker启动不了嘛,是不是可以从这块着手恢复呢?

多谢,求教

上面回复过,这个报错“out of sorted order from level db for task log key” 是代码 bug,最好是做下升级,现在用的版本是不太稳定的。

关键即使升级了,我的kv 内容还是有问题,是不是我的存储的数据出错了?

指这里: kv db config: &{BlockCacheCapacity:8388608 BlockRestartInterval:16 BlockSize:4096 CompactionL0Trigger:8 CompactionTableSize:67108864 CompactionTotalSize:536870912 CompactionTotalSizeMultiplier:8 WriteBuffer:67108864 WriteL0PauseTrigger:24 WriteL0SlowdownTrigger:17}

我有同事看了下master分支的代码,那段代码逻辑也是一样的

可能是操作不正确,先把新的 dm-worker 下掉:https://pingcap.com/docs-cn/stable/reference/tools/data-migration/cluster-operations/ 然后再看下旧的 dm-worker 是否还报错。

这个问题的产生:

新worker使用了MySQL 主库 已经注册的server_id,把老的worker给挤掉了。

若是MySQL的从库,只需重启,便可以了。

在dm里,这个情况没有被考虑到。导致:执行重启,dm_worker_meta/KV目录数据损坏。

解决方案: 执行新老worker实例的替换,重新拉取binlog,在resume相关的任务即可。

:+1: