大表ddl导致drainer假死无法继续同步数据

为提高效率,提问时请提供以下信息,问题描述清晰可优先响应。

  • 【TiDB 版本】:v3.0.1
  • 【问题描述】:

大表的ddl导致drainer假死(./binlogctl -pd-urls=http://10.10.10.145:2379 -cmd drainers依然显示在线,但是同步数据停止了,下游数据库中的ddl已经执行了),这个时候我需要怎么恢复drainer?

drainer报错日志:

[2019/12/10 14:56:56.090 +08:00] [ERROR] [load.go:557] [“exec failed”] [sql="ALTER TABLE tc_payment_sync_order ADD COLUMN payment_status tinyint(4) NOT NULL DEFAULT ‘0’ COMMENT ‘原始订单状态:0未成功,1成功’ after type"] [error=“invalid connection”] [errorVerbose=“invalid connectionngithub.com/pingcap/errors.AddStack /home/jenkins/workspace/release_tidb_3.0/go/pkg/mod/github.com/pingcap/errors@v0.11.4/errors.go:174 github.com/pingcap/errors.Trace /home/jenkins/workspace/release_tidb_3.0/go/pkg/mod/github.com/pingcap/errors@v0.11.4/juju_adaptor.go:15 github.com/pingcap/tidb-binlog/pkg/util.RetryOnError /home/jenkins/workspace/release_tidb_3.0/go/src/github.com/pingcap/tidb-binlog/pkg/util/util.go:157 github.com/pingcap/tidb-binlog/pkg/loader.(*loaderImpl).execDDL /home/jenkins/workspace/release_tidb_3.0/go/src/github.com/pingcap/tidb-binlog/pkg/loader/load.go:278ngithub.com/pingcap/tidb-binlog/pkg/loader.(*batchManager).execDDL /home/jenkins/workspace/release_tidb_3.0/go/src/github.com/pingcap/tidb-binlog/pkg/loader/load.go:555ngithub.com/pingcap/tidb-binlog/pkg/loader.(*batchManager).put /home/jenkins/workspace/release_tidb_3.0/go/src/github.com/pingcap/tidb-binlog/pkg/loader/load.go:578ngithub.com/pingcap/tidb-binlog/pkg/loader.(*loaderImpl).Run /home/jenkins/workspace/release_tidb_3.0/go/src/github.com/pingcap/tidb-binlog/pkg/loader/load.go:441ngithub.com/pingcap/tidb-binlog/drainer/sync.(*MysqlSyncer).run /home/jenkins/workspace/release_tidb_3.0/go/src/github.com/pingcap/tidb-binlog/drainer/sync/mysql.go:117 runtime.goexit /usr/local/go/src/runtime/asm_amd64.s:1337”] [root@TIDB007 log]# grep “2019/12/10 14:56:56” drainer.log [2019/12/10 14:56:56.090 +08:00] [ERROR] [load.go:557] [“exec failed”] [sql="ALTER TABLE tc_payment_sync_order ADD COLUMN payment_status tinyint(4) NOT NULL DEFAULT ‘0’ COMMENT ‘原始订单状态:0未成功,1成功’ after type"] [error=“invalid connection”] [errorVerbose=“invalid connectionngithub.com/pingcap/errors.AddStack /home/jenkins/workspace/release_tidb_3.0/go/pkg/mod/github.com/pingcap/errors@v0.11.4/errors.go:174 github.com/pingcap/errors.Trace /home/jenkins/workspace/release_tidb_3.0/go/pkg/mod/github.com/pingcap/errors@v0.11.4/juju_adaptor.go:15 github.com/pingcap/tidb-binlog/pkg/util.RetryOnError /home/jenkins/workspace/release_tidb_3.0/go/src/github.com/pingcap/tidb-binlog/pkg/util/util.go:157 github.com/pingcap/tidb-binlog/pkg/loader.(*loaderImpl).execDDL /home/jenkins/workspace/release_tidb_3.0/go/src/github.com/pingcap/tidb-binlog/pkg/loader/load.go:278ngithub.com/pingcap/tidb-binlog/pkg/loader.(*batchManager).execDDL /home/jenkins/workspace/release_tidb_3.0/go/src/github.com/pingcap/tidb-binlog/pkg/loader/load.go:555ngithub.com/pingcap/tidb-binlog/pkg/loader.(*batchManager).put /home/jenkins/workspace/release_tidb_3.0/go/src/github.com/pingcap/tidb-binlog/pkg/loader/load.go:578ngithub.com/pingcap/tidb-binlog/pkg/loader.(*loaderImpl).Run /home/jenkins/workspace/release_tidb_3.0/go/src/github.com/pingcap/tidb-binlog/pkg/loader/load.go:441ngithub.com/pingcap/tidb-binlog/drainer/sync.(*MysqlSyncer).run /home/jenkins/workspace/release_tidb_3.0/go/src/github.com/pingcap/tidb-binlog/drainer/sync/mysql.go:117 runtime.goexit /usr/local/go/src/runtime/asm_amd64.s:1337”] [2019/12/10 14:56:56.090 +08:00] [INFO] [load.go:402] [“Run()… in Loader quit”] [2019/12/10 14:56:56.090 +08:00] [INFO] [mysql.go:113] [“Successes chan quit”] [2019/12/10 14:56:56.090 +08:00] [INFO] [syncer.go:249] [“write save point”] [ts=413128644333666350]

看报错有无效链接,请问下 drainer 下游的数据库是 mysql 还是 tidb ?

rds

嗯,看日志在 同步该 ddl 的时候报错无效链接,确定下游已经执行该 ddl sql 后,可以跳过这个 binlog,继续往下执行。

嗯嗯,是的,我准备跳过那个ddl,然后我执行了./stop_drainer.sh,执行了好久,然后我./start_drainer.sh,发现报地址已经占用了,我stop了好几次,进程依然在,然后我迫不得已直接使用kill -9 ,然后./start_drainer.sh发现它直接跳过了中间一段tso,直接从另一个正常的drainer同步的时间点开始同步,这样中间就有一段数据没有同步了,我有两个drainer,同步到下游的同一个rds的两个不同的数据库。

ps: 这个是生产环境的问题,已经出现过两次大表ddl导致的问题,第一次直接重新启动就好了,第二次出现这种问题,现在准备把数据重新再按照全量 + 增量重新备份一遍。

现在比较担心后续再出现这种问题

1.日志报错 “invalid connection”,看下 rds 是不是有配置超时时间,达到超时时间后,仍然没有执行完该 SQL

2.kill -9 的操作是错误的,可以看下这段时间的 drainer 日志。

如果遇到这种情况,正确的操作方式应该是什么样的呢,请大佬明示

报错日志只有这么多了,我怀疑是不是直接kill -9 后,然后重启后drainer直接去下游rds的tidb_binlog库中的checkpoint表的中commitTs开始同步(ps:有两个drainer,另外一个drainer正常)

首先如果上游有大表ddl同步到下游 mysql 时,建议把超时时间跳大一些。如果有跳过 binlog 的操作,可以试下以下步骤:

  • 停掉 drainer
  • 修改 checkpoint 中的 commitTS
  • 启动 drainer