Drainer同步备份到下游rds时遗漏数据

为提高效率,提问时请提供以下信息,问题描述清晰可优先响应。

  • 【TiDB 版本】:v3.0.1
  • 【问题描述】:

tidb集群的开启了两个drainer将数据备份到下游的rds,由于其中一个drainer对应的上游tidb库中一个大表新增一个字段导致报错

[2019/12/10 14:56:56.090 +08:00] [ERROR] [load.go:557] ["exec failed"] [sql="ALTER TABLEtc_payment_sync_orderADD COLUMNpayment_statustinyint(4) NOT NULL DEFAULT '0' COMMENT '原始订单状态:0未成功,1成功' aftertype"] [error="invalid connection"] [errorVerbose="invalid connection[ngithub.com/pingcap/errors.AddStack](http://ngithub.com/pingcap/errors.AddStack) /home/jenkins/workspace/release_tidb_3.0/go/pkg/mod/[github.com/pingcap/errors@v0.11.4/errors.go:174](http://github.com/pingcap/errors@v0.11.4/errors.go:174)[ngithub.com/pingcap/errors.Trace](http://ngithub.com/pingcap/errors.Trace) /home/jenkins/workspace/release_tidb_3.0/go/pkg/mod/[github.com/pingcap/errors@v0.11.4/juju_adaptor.go:15](http://github.com/pingcap/errors@v0.11.4/juju_adaptor.go:15)[ngithub.com/pingcap/tidb-binlog/pkg/util.RetryOnError](http://ngithub.com/pingcap/tidb-binlog/pkg/util.RetryOnError) /home/jenkins/workspace/release_tidb_3.0/go/src/[github.com/pingcap/tidb-binlog/pkg/util/util.go:157](http://github.com/pingcap/tidb-binlog/pkg/util/util.go:157)[ngithub.com/pingcap/tidb-binlog/pkg/loader.(*loaderImpl).execDDL](http://ngithub.com/pingcap/tidb-binlog/pkg/loader.(*loaderImpl).execDDL) /home/jenkins/workspace/release_tidb_3.0/go/src/[github.com/pingcap/tidb-binlog/pkg/loader/load.go:278](http://github.com/pingcap/tidb-binlog/pkg/loader/load.go:278)[ngithub.com/pingcap/tidb-binlog/pkg/loader.(*batchManager).execDDL](http://ngithub.com/pingcap/tidb-binlog/pkg/loader.(*batchManager).execDDL) /home/jenkins/workspace/release_tidb_3.0/go/src/[github.com/pingcap/tidb-binlog/pkg/loader/load.go:555](http://github.com/pingcap/tidb-binlog/pkg/loader/load.go:555)[ngithub.com/pingcap/tidb-binlog/pkg/loader.(*batchManager).put](http://ngithub.com/pingcap/tidb-binlog/pkg/loader.(*batchManager).put) /home/jenkins/workspace/release_tidb_3.0/go/src/[github.com/pingcap/tidb-binlog/pkg/loader/load.go:578](http://github.com/pingcap/tidb-binlog/pkg/loader/load.go:578)[ngithub.com/pingcap/tidb-binlog/pkg/loader.(*loaderImpl).Run](http://ngithub.com/pingcap/tidb-binlog/pkg/loader.(*loaderImpl).Run) /home/jenkins/workspace/release_tidb_3.0/go/src/[github.com/pingcap/tidb-binlog/pkg/loader/load.go:441](http://github.com/pingcap/tidb-binlog/pkg/loader/load.go:441)[ngithub.com/pingcap/tidb-binlog/drainer/sync.(*MysqlSyncer).run](http://ngithub.com/pingcap/tidb-binlog/drainer/sync.(*MysqlSyncer).run) /home/jenkins/workspace/release_tidb_3.0/go/src/[github.com/pingcap/tidb-binlog/drainer/sync/mysql.go:117](http://github.com/pingcap/tidb-binlog/drainer/sync/mysql.go:117) runtime.goexit /usr/local/go/src/runtime/asm_amd64.s:1337"]

之后write save checkpoint日志就没有产生了,后面该drainer同步就停止了,另一drainer正常运行,后面由于收到drainer同步延迟的告警后,重启了同步停止的drainer,重启后发现异常的那个drainer竟然直接跳过中间没有同步的数据,直接跳到另外那个正常drainer同步的时间点开始同步,这样就导致中间丢了一部分数据,请问大佬这部分数据可以补救回来吗?

请问通过两个 drainer 将上游数据同步到下游两个不同的 RDS 吗? 方便的话麻烦提供一下 drainer 的配置文件 以及重启 drainer 的命令

重启的时候,我先用的是scripts下的stop_drainer.sh,发现关不掉之后,我直接kill -9,后面再start_drainer.sh

麻烦通过下面的命令看一下 pump 和 drainer 的状态

bin/binlogctl -pd-urls=http://127.0.0.1:2379 -cmd pumps
bin/binlogctl -pd-urls=http://127.0.0.1:2379 -cmd drainers

pumps:

[2019/12/11 10:44:31.942 +08:00] [INFO] [nodes.go:47] ["query node"] [type=pump] [node="{NodeID: TIDB006:8250, Addr: 11.12.13.147:8250, State: online, MaxCommitTS: 413147403803951149, UpdateTime: 2019-12-11 10:44:31 +0800 CST}"] [2019/12/11 10:44:31.942 +08:00] [INFO] [nodes.go:47] ["query node"] [type=pump] [node="{NodeID: TIDB007:8250, Addr: 11.12.13.148:8250, State: online, MaxCommitTS: 413147403397627941, UpdateTime: 2019-12-11 10:44:29 +0800 CST}"]

drainer:

` [2019/12/11 10:45:41.159 +08:00] [INFO] [nodes.go:47] [“query node”] [type=drainer] [node="{NodeID: TIDB006:8249, Addr: 11.12.13.147:8249, State: online, MaxCommitTS: 413147401706799155, UpdateTime: 2019-12-11 10:45:39 +0800 CST}"] [2019/12/11 10:45:41.159 +08:00] [INFO] [nodes.go:47] [“query node”] [type=drainer] [node="{NodeID: TIDB007:8249, Addr: 11.12.13.148:8249, State: online, MaxCommitTS: 413147421302063179, UpdateTime: 2019-12-11 10:45:39 +0800 CST}"]

`

重启完后,状态都是正常的,就是有一段数据没有同步到下游rds

ddl 同步到下游了吗?

ddl已经同步到下游了

可以看下 sync-diff-inspector 这个工具进行上下游数据比较,找到并生成修复SQL

然后麻烦提供一下 pump 和 drainer 节点完整的日志,我通过日志看下过程

好的,后续出现一个大表的ddl的时候如何避免这种问题呢?

大表的ddl容易导致drainer挂掉或者假死,上次我给一个大表添加索引,也导致了这种问题,后面重启就好了,报错和上面那个一样,其他的没什么异常日志

嗯,这个麻烦提供一下 pump 和 drainer 的日志,我再跟进一下,感谢

我看了下pump里面没有错误日志,drainer的错误日志里的内容

[mysql] 2019/10/24 12:02:41 packets.go:36: read tcp 11.12.13.148:41368->10.1.66.238:3306: i/o timeout [mysql] 2019/10/24 12:03:42 packets.go:36: read tcp 11.12.13.148:38886->10.1.66.238:3306: i/o timeout [mysql] 2019/10/24 12:04:43 packets.go:36: read tcp 11.12.13.148:38896->10.1.66.238:3306: i/o timeout [mysql] 2019/10/24 12:05:44 packets.go:36: read tcp 11.12.13.148:38940->10.1.66.238:3306: i/o timeout [mysql] 2019/10/24 12:06:45 packets.go:36: read tcp 11.12.13.148:38950->10.1.66.238:3306: i/o timeout [mysql] 2019/12/10 14:52:51 packets.go:36: read tcp 11.12.13.148:42670->10.1.66.238:3306: i/o timeout [mysql] 2019/12/10 14:53:52 packets.go:36: read tcp 11.12.13.148:45318->10.1.66.238:3306: i/o timeout [mysql] 2019/12/10 14:54:53 packets.go:36: read tcp 11.12.13.148:45320->10.1.66.238:3306: i/o timeout [mysql] 2019/12/10 14:55:54 packets.go:36: read tcp 11.12.13.148:45322->10.1.66.238:3306: i/o timeout [mysql] 2019/12/10 14:56:55 packets.go:36: read tcp 11.12.13.148:45356->10.1.66.238:3306: i/o timeout

以及drainer.log中的报错信息:

[2019/12/10 14:56:56.090 +08:00] [ERROR] [load.go:557] ["exec failed"] [sql="ALTER TABLEtc_payment_sync_orderADD COLUMNpayment_statustinyint(4) NOT NULL DEFAULT '0' COMMENT '原始订单状态:0未成功,1成功' aftertype"] [error="invalid connection"] [errorVerbose="invalid connection[ngithub.com/pingcap/errors.AddStack](http://ngithub.com/pingcap/errors.AddStack) /home/jenkins/workspace/release_tidb_3.0/go/pkg/mod/[github.com/pingcap/errors@v0.11.4/errors.go:174](http://github.com/pingcap/errors@v0.11.4/errors.go:174)[ngithub.com/pingcap/errors.Trace](http://ngithub.com/pingcap/errors.Trace) /home/jenkins/workspace/release_tidb_3.0/go/pkg/mod/[github.com/pingcap/errors@v0.11.4/juju_adaptor.go:15](http://github.com/pingcap/errors@v0.11.4/juju_adaptor.go:15)[ngithub.com/pingcap/tidb-binlog/pkg/util.RetryOnError](http://ngithub.com/pingcap/tidb-binlog/pkg/util.RetryOnError) /home/jenkins/workspace/release_tidb_3.0/go/src/[github.com/pingcap/tidb-binlog/pkg/util/util.go:157](http://github.com/pingcap/tidb-binlog/pkg/util/util.go:157)[ngithub.com/pingcap/tidb-binlog/pkg/loader.(*loaderImpl).execDDL](http://ngithub.com/pingcap/tidb-binlog/pkg/loader.(*loaderImpl).execDDL) /home/jenkins/workspace/release_tidb_3.0/go/src/[github.com/pingcap/tidb-binlog/pkg/loader/load.go:278](http://github.com/pingcap/tidb-binlog/pkg/loader/load.go:278)[ngithub.com/pingcap/tidb-binlog/pkg/loader.(*batchManager).execDDL](http://ngithub.com/pingcap/tidb-binlog/pkg/loader.(*batchManager).execDDL) /home/jenkins/workspace/release_tidb_3.0/go/src/[github.com/pingcap/tidb-binlog/pkg/loader/load.go:555](http://github.com/pingcap/tidb-binlog/pkg/loader/load.go:555)[ngithub.com/pingcap/tidb-binlog/pkg/loader.(*batchManager).put](http://ngithub.com/pingcap/tidb-binlog/pkg/loader.(*batchManager).put) /home/jenkins/workspace/release_tidb_3.0/go/src/[github.com/pingcap/tidb-binlog/pkg/loader/load.go:578](http://github.com/pingcap/tidb-binlog/pkg/loader/load.go:578)[ngithub.com/pingcap/tidb-binlog/pkg/loader.(*loaderImpl).Run](http://ngithub.com/pingcap/tidb-binlog/pkg/loader.(*loaderImpl).Run) /home/jenkins/workspace/release_tidb_3.0/go/src/[github.com/pingcap/tidb-binlog/pkg/loader/load.go:441](http://github.com/pingcap/tidb-binlog/pkg/loader/load.go:441)[ngithub.com/pingcap/tidb-binlog/drainer/sync.(*MysqlSyncer).run](http://ngithub.com/pingcap/tidb-binlog/drainer/sync.(*MysqlSyncer).run) /home/jenkins/workspace/release_tidb_3.0/go/src/[github.com/pingcap/tidb-binlog/drainer/sync/mysql.go:117](http://github.com/pingcap/tidb-binlog/drainer/sync/mysql.go:117) runtime.goexit /usr/local/go/src/runtime/asm_amd64.s:1337"]

大佬,我使用 Reparo这个工具把指定时间范围内的数据从tidb里抽取出来,然后再恢复到下游rds中可以吗?

Reparo 可以指定 -start-datetime 以及 -stop-datetime 具体可以参考官方文档:

但是使用这个的前提是drainer已经把pump中的数据拉取过来了,我的情况是drainer遗漏了部分数据,现在想把这部分数据给补充回来

这种情况建议重新全量同步。