TiDB使用BR恢复集群时报错

为提高效率,请提供以下信息,问题描述清晰能够更快得到解决:
【 TiDB 使用环境】
TiDB v4.0.9
BR v4.0.9

【概述】 场景 + 问题概述

使用 BR 命令行备份集群数据,然后恢复集群到另一个集群环境。

【备份和数据迁移策略逻辑】

【背景】 做过哪些操作

【现象】 业务和数据库现象

【问题】 当前遇到的问题
Error: failed to validate checksum: [BR:Restore:ErrRestoreChecksumMismatch]restore checksum mismatch
QQ%E6%88%AA%E5%9B%BE20211230111149
【业务影响】

【TiDB 版本】

【附件】

  • 相关日志、配置文件、Grafana 监控(https://metricstool.pingcap.com/)
  • TiUP Cluster Display 信息
  • TiUP CLuster Edit config 信息
  • TiDB-Overview 监控
  • 对应模块的 Grafana 监控(如有 BR、TiDB-binlog、TiCDC 等)
  • 对应模块日志(包含问题前后 1 小时日志)

报错日志:
[2021/12/29 22:20:23.856 +00:00] [INFO] [domain.go:622] [“domain closed”] [“take time”=7.982087774s]
[2021/12/29 22:20:23.863 +00:00] [INFO] [collector.go:188] [“Database restore Failed summary : total restore files: 4803, total success: 4803, total failed: 0”] [“split region”=2h46m46.681634392s] [“restore checksum”=42h48m33.103512213s] [“restore ranges”=4044] [Size=24012833189]
[2021/12/29 22:20:23.864 +00:00] [ERROR] [restore.go:24] [“failed to restore”] [error=“failed to validate checksum: [BR:Restore:ErrRestoreChecksumMismatch]restore checksum mismatch”] [errorVerbose=“[BR:Restore:ErrRestoreChecksumMismatch]restore checksum mismatch
failed to validate checksum
github.com/pingcap/br/pkg/restore.(*Client).execChecksum
\tgithub.com/pingcap/br@/pkg/restore/client.go:796
github.com/pingcap/br/pkg/restore.(*Client).GoValidateChecksum.func1.2
\tgithub.com/pingcap/br@/pkg/restore/client.go:742
github.com/pingcap/br/pkg/utils.(*WorkerPool).ApplyOnErrorGroup.func1
\tgithub.com/pingcap/br@/pkg/utils/worker.go:63
golang.org/x/sync/errgroup.(*Group).Go.func1
\tgolang.org/x/sync@v0.0.0-20201020160332-67f06af15bc9/errgroup/errgroup.go:57
runtime.goexit
\truntime/asm_amd64.s:1357”] [stack=“github.com/pingcap/br/cmd.runRestoreCommand
\tgithub.com/pingcap/br@/cmd/restore.go:24
github.com/pingcap/br/cmd.newDBRestoreCommand.func1
\tgithub.com/pingcap/br@/cmd/restore.go:106
github.com/spf13/cobra.(*Command).execute
\tgithub.com/spf13/cobra@v1.0.0/command.go:842
github.com/spf13/cobra.(*Command).ExecuteC
\tgithub.com/spf13/cobra@v1.0.0/command.go:950
github.com/spf13/cobra.(*Command).Execute
\tgithub.com/spf13/cobra@v1.0.0/command.go:887
main.main
\tgithub.com/pingcap/br@/main.go:58
runtime.main
\truntime/proc.go:203”]
[2021/12/29 22:20:23.864 +00:00] [ERROR] [main.go:59] [“br failed”] [error=“failed to validate checksum: [BR:Restore:ErrRestoreChecksumMismatch]restore checksum mismatch”] [errorVerbose=“[BR:Restore:ErrRestoreChecksumMismatch]restore checksum mismatch
failed to validate checksum
github.com/pingcap/br/pkg/restore.(*Client).execChecksum
\tgithub.com/pingcap/br@/pkg/restore/client.go:796
github.com/pingcap/br/pkg/restore.(*Client).GoValidateChecksum.func1.2
\tgithub.com/pingcap/br@/pkg/restore/client.go:742
github.com/pingcap/br/pkg/utils.(*WorkerPool).ApplyOnErrorGroup.func1
\tgithub.com/pingcap/br@/pkg/utils/worker.go:63
golang.org/x/sync/errgroup.(*Group).Go.func1
\tgolang.org/x/sync@v0.0.0-20201020160332-67f06af15bc9/errgroup/errgroup.go:57
runtime.goexit
\truntime/asm_amd64.s:1357”] [stack=“main.main
\tgithub.com/pingcap/br@/main.go:59
runtime.main
\truntime/proc.go:203”]


若提问为性能优化、故障排查类问题,请下载脚本运行。终端输出的打印结果,请务必全选并复制粘贴上传。

3 个赞

两个集群的 new_collations_enabled_on_first_bootstrap参数值是一样吗

2 个赞

一样的,都是False
mysql> SELECT VARIABLE_VALUE FROM mysql.tidb WHERE VARIABLE_NAME=‘new_collation_enabled’;
±---------------+
| VARIABLE_VALUE |
±---------------+
| False |
±---------------+
1 row in set (0.00 sec)

2 个赞

新集群在恢复之前存在相同名称的库表吗

2 个赞

不存在,库、表都没有建

1 个赞

tidb_enable_clustered_index这个参数源集群和目标集群一致吗

1 个赞

还有源集群和目标集群的tidb版本一致吗

1 个赞

一致的,我是为了BR特意搭的集群环境。高度还原。

1 个赞

Hi,请问重试之后还有这个问题吗?

1 个赞

重试了还一样检验不通过。

1 个赞

如果能稳定复现的话,请问一下目标集群(尤其是和恢复相关的库表)是否有数据呢?
能否 grep 一下 checksum mismatch 相关的日志呢?

150366:[2022/01/18 08:01:21.579 +00:00] [ERROR] [restore.go:24] [“failed to restore”] [error=“failed to validate checksum: [BR:Restore:ErrRestoreChecksumMismatch]restore checksum mismatch”] [errorVerbose="[BR:Restore:ErrRestoreChecksumMismatch]restore checksum mismatch\ failed to validate checksum\ngithub.com/pingcap/br/pkg/restore.(*Client).execChecksum\ \tgithub.com/pingcap/br@/pkg/restore/client.go:796\ github.com/pingcap/br/pkg/restore.(*Client).GoValidateChecksum.func1.2\ \tgithub.com/pingcap/br@/pkg/restore/client.go:742\ github.com/pingcap/br/pkg/utils.(*WorkerPool).ApplyOnErrorGroup.func1\ \tgithub.com/pingcap/br@/pkg/utils/worker.go:63\ golang.org/x/sync/errgroup.(*Group).Go.func1\ \tgolang.org/x/sync@v0.0.0-20201020160332-67f06af15bc9/errgroup/errgroup.go:57\ runtime.goexit\ \truntime/asm_amd64.s:1357"] [stack=“github.com/pingcap/br/cmd.runRestoreCommand\ \tgithub.com/pingcap/br@/cmd/restore.go:24\ github.com/pingcap/br/cmd.newDBRestoreCommand.func1\ \tgithub.com/pingcap/br@/cmd/restore.go:106\ github.com/spf13/cobra.(*Command).execute\ \tgithub.com/spf13/cobra@v1.0.0/command.go:842\ github.com/spf13/cobra.(*Command).ExecuteC\ \tgithub.com/spf13/cobra@v1.0.0/command.go:950\ github.com/spf13/cobra.(*Command).Execute\ \tgithub.com/spf13/cobra@v1.0.0/command.go:887\ main.main\ \tgithub.com/pingcap/br@/main.go:58\ runtime.main\ \truntime/proc.go:203”]
150367:[2022/01/18 08:01:21.579 +00:00] [ERROR] [main.go:59] [“br failed”] [error=“failed to validate checksum: [BR:Restore:ErrRestoreChecksumMismatch]restore checksum mismatch”] [errorVerbose="[BR:Restore:ErrRestoreChecksumMismatch]restore checksum mismatch\ failed to validate checksum\ngithub.com/pingcap/br/pkg/restore.(*Client).execChecksum\ \tgithub.com/pingcap/br@/pkg/restore/client.go:796\ github.com/pingcap/br/pkg/restore.(*Client).GoValidateChecksum.func1.2\ \tgithub.com/pingcap/br@/pkg/restore/client.go:742\ github.com/pingcap/br/pkg/utils.(*WorkerPool).ApplyOnErrorGroup.func1\ \tgithub.com/pingcap/br@/pkg/utils/worker.go:63\ golang.org/x/sync/errgroup.(*Group).Go.func1\ \tgolang.org/x/sync@v0.0.0-20201020160332-67f06af15bc9/errgroup/errgroup.go:57\ runtime.goexit\ \truntime/asm_amd64.s:1357"] [stack=“main.main\ \tgithub.com/pingcap/br@/main.go:59\ runtime.main\ \truntime/proc.go:203”]


没有这条相关的日志吗?:thinking:

不是很明白你的截图,我上面贴的日志就是 grep 一下 checksum mismatch出来的。

failed in validate checksum 呢?:thinking:

150201:[2022/01/18 07:51:39.492 +00:00] [ERROR] [client.go:786] [“failed in validate checksum”] [database=zmd] [table=zmd_login_log] [“origin tidb crc64”=8527804401663786955] [“calculated crc64”=1440874610585256742] [“origin tidb total kvs”=9464208] [“calculated total kvs”=9464211] [“origin tidb total bytes”=929551841] [“calculated total bytes”=929552125] [stack=“github.com/pingcap/br/pkg/restore.(*Client).execChecksum\ \tgithub.com/pingcap/br@/pkg/restore/client.go:786\ github.com/pingcap/br/pkg/restore.(*Client).GoValidateChecksum.func1.2\ \tgithub.com/pingcap/br@/pkg/restore/client.go:742\ github.com/pingcap/br/pkg/utils.(*WorkerPool).ApplyOnErrorGroup.func1\ \tgithub.com/pingcap/br@/pkg/utils/worker.go:63\ golang.org/x/sync/errgroup.(*Group).Go.func1\ \tgolang.org/x/sync@v0.0.0-20201020160332-67f06af15bc9/errgroup/errgroup.go:57”]

[“origin tidb total kvs”=9464208] [“calculated total kvs”=9464211]
这段日志说明在备份时只有 9464208 个 KV 对,但是在恢复的时候却有 9464211 个 KV 对。这说明恢复之后的数据并没有缺失,而是多了一些。
这种状况一般都会发生在恢复集群已经有数据的情况下。不过楼主说恢复集群里面没有数据……这点比较奇怪。

可能恢复到了错误的集群?(例如,指定了上游集群的 PD?)

如果想要找到多余的记录是哪些,或许可以考虑在恢复时指定 --checksum=false 跳过校验并恢复集群,然后使用 sync-diff 工具对比这张表上下游的数据。

我BR备份的时候是全量热备的,不知跟这个有没关系。br backup full -db 我目标集群是一组新的机器,恢复前特意删除掉相应的数据库。

在线备份应该不会导致这个问题,BR 在原理上是支持备份在线集群的(只会对集群性能造成一些不利影响,但是不影响正确性)。

1 个赞