BR恢复数据库报错

为提高效率,请提供以下信息,问题描述清晰能够更快得到解决:
【 TiDB 使用环境】
生产环境

【概述】 场景 + 问题概述
使用BR恢复单库时报错,备份时没有报错显示成功。
备份显示成功
br backup db --pd “10...:2379" --db wallet --storage “local:///nfs/backup/tidb-dumpbackup/wallet/${TODAY}” --ratelimit 120 --log-file backuptable-wallet_${TODAY}.log
image
恢复时报错。
br restore db --pd "10.110.
.**:2379” --db “wallet” --ratelimit 128 --storage “local:///nfs/backup/wallet/nfs/backup/tidb-dumpbackup/wallet/20220318” --log-file restorefull-wallet.logDetail BR log in restorefull-wallet.log
image

【备份和数据迁移策略逻辑】

【背景】 做过哪些操作

【现象】 业务和数据库现象

【问题】 当前遇到的问题

【业务影响】

【TiDB 版本】
Tidbv4.0.15
【附件】

  • 相关日志、配置文件、Grafana 监控(https://metricstool.pingcap.com/)
  • TiUP Cluster Display 信息
  • TiUP CLuster Edit config 信息
  • TiDB-Overview 监控
  • 对应模块的 Grafana 监控(如有 BR、TiDB-binlog、TiCDC 等)
  • 对应模块日志(包含问题前后 1 小时日志)

[2022/03/24 10:22:50.455 +08:00] [INFO] [client.go:559] [“import file done”] [file=“{name=1_1294966_7028_832249a2b2156435a0612a79e4acef7213260e98683dd96e99af59137ca29c20_1647554989832_write.sst,CF=write,sha256=4bd3cf1f8fa1d0b333de205899a16b07e0acba0deb9be5dfcf07773eb4a07311,startKey=7480000000000003B55F728000000039B1E8F2,endKey=7480000000000003B55F728000000039B45DA4,startVersion=0,endVersion=431896501412429852,totalKvs=0,totalBytes=0,CRC64Xor=0}”] [take=64.16µs][2022/03/24 10:22:50.455 +08:00] [WARN] [backoff.go:79] [“unexcepted error, stop to retry”] [error=“rpc error: code = Canceled desc = context canceled”] [errorVerbose=“rpc error: code = Canceled desc = context canceled
github.com/tikv/pd/client.(*client).ScanRegions
\tgithub.com/tikv/pd@v0.0.0-20210105112549-e5be7fd38659/client/client.go:598
github.com/pingcap/br/pkg/restore.(*pdClient).ScanRegions
\tgithub.com/pingcap/br@/pkg/restore/split_client.go:385
github.com/pingcap/br/pkg/restore.PaginateScanRegion.func1
\tgithub.com/pingcap/br@/pkg/restore/split.go:328
github.com/pingcap/br/pkg/utils.WithRetry
\tgithub.com/pingcap/br@/pkg/utils/retry.go:47
github.com/pingcap/br/pkg/restore.PaginateScanRegion
\tgithub.com/pingcap/br@/pkg/restore/split.go:324
github.com/pingcap/br/pkg/restore.(*FileImporter).Import.func1
\tgithub.com/pingcap/br@/pkg/restore/import.go:226
github.com/pingcap/br/pkg/utils.WithRetry
\tgithub.com/pingcap/br@/pkg/utils/retry.go:47
github.com/pingcap/br/pkg/restore.(*FileImporter).Import
\tgithub.com/pingcap/br@/pkg/restore/import.go:222
github.com/pingcap/br/pkg/restore.(*Client).RestoreFiles.func2
\tgithub.com/pingcap/br@/pkg/restore/client.go:563
github.com/pingcap/br/pkg/utils.(*WorkerPool).ApplyOnErrorGroup.func1
\tgithub.com/pingcap/br@/pkg/utils/worker.go:73
golang.org/x/sync/errgroup.(*Group).Go.func1
\tgolang.org/x/sync@v0.0.0-20201020160332-67f06af15bc9/errgroup/errgroup.go:57
runtime.goexit
\truntime/asm_amd64.s:1357”][2022/03/24 10:22:50.455 +08:00] [INFO] [client.go:559] [“import file done”] [file=“{name=1_2235117_7122_c43723965281a9f7e431b9a17ecf2ee253fb8340e76a272975d7aa7632043146_1647555024804_write.sst,CF=write,sha256=8baa8be36f947ba09532e05558df2a093bb9b1998e6fb668b5c1f3bc3aaab797,startKey=7480000000000003B55F728000000045D5C4A5,endKey=7480000000000003B55F728000000045D7C266,startVersion=0,endVersion=431896501412429852,totalKvs=0,totalBytes=0,CRC64Xor=0}”] [take=58.094µs][2022/03/24 10:22:50.455 +08:00] [WARN] [backoff.go:79] [“unexcepted error, stop to retry”] [error=“rpc error: code = Canceled desc = context canceled”] [errorVerbose=“rpc error: code = Canceled desc = context canceled
github.com/pingcap/errors.AddStack
\tgithub.com/pingcap/errors@v0.11.5-0.20201126102027-b0a155152ca3/errors.go:174
github.com/pingcap/errors.Trace
\tgithub.com/pingcap/errors@v0.11.5-0.20201126102027-b0a155152ca3/juju_adaptor.go:15
github.com/pingcap/br/pkg/restore.(*FileImporter).ingestSST
\tgithub.com/pingcap/br@/pkg/restore/import.go:479
github.com/pingcap/br/pkg/restore.(*FileImporter).Import.func1
\tgithub.com/pingcap/br@/pkg/restore/import.go:278
github.com/pingcap/br/pkg/utils.WithRetry
\tgithub.com/pingcap/br@/pkg/utils/retry.go:47
github.com/pingcap/br/pkg/restore.(*FileImporter).Import
\tgithub.com/pingcap/br@/pkg/restore/import.go:222
github.com/pingcap/br/pkg/restore.(*Client).RestoreFiles.func2
\tgithub.com/pingcap/br@/pkg/restore/client.go:563
github.com/pingcap/br/pkg/utils.(*WorkerPool).ApplyOnErrorGroup.func1
\tgithub.com/pingcap/br@/pkg/utils/worker.go:73
golang.org/x/sync/errgroup.(*Group).Go.func1
\tgolang.org/x/sync@v0.0.0-20201020160332-67f06af15bc9/errgroup/errgroup.go:57
runtime.goexit
\truntime/asm_amd64.s:1357”][2022/03/24 10:22:50.455 +08:00] [INFO] [client.go:559] [“import file done”] [file=“{name=2_1915562_7462_47e0e8a985c5b79f0276749ece3da6835e92a6727e96692ab9022fa13e9c9425_1647555224214_write.sst,CF=write,sha256=be4cf6b04389329acc20365b6deb3695ea1d92d9ecfc50efad1759498b7666e1,startKey=7480000000000003B55F698000000000000009014E00000000000000F80419ABD400000000000380000000000F4240013130303030353532FF0000000000000000F7014555520000000000FA0610048000000000010510013131340000000000FA014E00000000000000F8014E00000000000000F803800000005031B125,endKey=7480000000000003B55F698000000000000009014E00000000000000F80419ABD600000000000380000000000F4240013333303030303030FF3233000000000000F9015553440000000000FA0610048000000000010000013131340000000000FA014E00000000000000F8014E00000000000000F8038000000050403D24,startVersion=0,endVersion=431896501412429852,totalKvs=612417,totalBytes=76778967,CRC64Xor=3650794703154434730}”] [take=210.44µs][2022/03/24 10:22:50.442 +08:00] [WARN] [backoff.go:79] [“unexcepted error, stop to retry”] [error=“rpc error: code = Canceled desc = context canceled”] [errorVerbose=“rpc error: code = Canceled desc = context canceled
github.com/tikv/pd/client.(*client).ScanRegions
\tgithub.com/tikv/pd@v0.0.0-20210105112549-e5be7fd38659/client/client.go:598
github.com/pingcap/br/pkg/restore.(*pdClient).ScanRegions
\tgithub.com/pingcap/br@/pkg/restore/split_client.go:385
github.com/pingcap/br/pkg/restore.PaginateScanRegion.func1
\tgithub.com/pingcap/br@/pkg/restore/split.go:328
github.com/pingcap/br/pkg/utils.WithRetry
\tgithub.com/pingcap/br@/pkg/utils/retry.go:47
github.com/pingcap/br/pkg/restore.PaginateScanRegion
\tgithub.com/pingcap/br@/pkg/restore/split.go:324
github.com/pingcap/br/pkg/restore.(*FileImporter).Import.func1
\tgithub.com/pingcap/br@/pkg/restore/import.go:226
github.com/pingcap/br/pkg/utils.WithRetry
\tgithub.com/pingcap/br@/pkg/utils/retry.go:47
github.com/pingcap/br/pkg/restore.(*FileImporter).Import
\tgithub.com/pingcap/br@/pkg/restore/import.go:222
github.com/pingcap/br/pkg/restore.(*Client).RestoreFiles.func2
\tgithub.com/pingcap/br@/pkg/restore/client.go:563
github.com/pingcap/br/pkg/utils.(*WorkerPool).ApplyOnErrorGroup.func1
\tgithub.com/pingcap/br@/pkg/utils/worker.go:73
golang.org/x/sync/errgroup.(*Group).Go.func1
\tgolang.org/x/sync@v0.0.0-20201020160332-67f06af15bc9/errgroup/errgroup.go:57
runtime.goexit
\truntime/asm_amd64.s:1357”][2022/03/24 10:22:50.442 +08:00] [ERROR] [main.go:58] [“br failed”] [error=“scan region return empty result, startKey: 7480000000000001FF6B5F728000000046FFEA41180000000000FA, endkey: 7480000000000001FF6B5F728000000047FFFE10C40000000000FA: [BR:PD:ErrPDBatchScanRegion]batch scan region; scan region return empty result, startKey: 7480000000000001FF6B5F728000000046FFEA41180000000000FA, endkey: 7480000000000001FF6B5F728000000047FFFE10C40000000000FA: [BR:PD:ErrPDBatchScanRegion]batch scan region; scan region return empty result, startKey: 7480000000000001FF6B5F728000000046FFEA41180000000000FA, endkey: 7480000000000001FF6B5F728000000047FFFE10C40000000000FA: [BR:PD:ErrPDBatchScanRegion]batch scan region”] [errorVerbose="the following errors occurred:\


若提问为性能优化、故障排查类问题,请下载脚本运行。终端输出的打印结果,请务必全选并复制粘贴上传。

BR恢复日志
restorefull-wallet.log (16.4 MB)

看起来是备份到了本地,但是恢复的时候并没有将数据复制到其他节点,你看看有没有做这个操作

https://docs.pingcap.com/zh/tidb/v4.0/tidb-lightning-checkpoints

不是这个原因,备份文件我放在共享盘,每个tikv节点都可以看到。

抱歉,是我没看仔细
请问还原是在一个新的集群还原么,还是在当前的集群还原的

还原到一个新的集群

请问下两个集群的变量new_collations_enabled_on_first_bootstrap 一致么,集群是否开启cdc,另外还原之前新集群的库里是否有数据,有其他连接

这两个集群的参数值都是一样的 ,都为false。
我这边生产库有四个库,现在四个库的备份在新搭建的集群做个备份恢复测试。前三个库恢复没有问题,第四个库恢复就报这个错误。另外,生产库是开启了ticdc的,新搭建的这个集群没有开启ticdc

连接只有我用登录进去验证恢复数据,其它连接没有。

看了下,主要的问题应该是这个,有一部分sst文件找不到

请问你这个库是有些表的数据同步到tiflash中了么,如果有tiflash的话则将tiflash节点也挂在NFS,如果没有的话看看是否还有哪个节点不能正常访问sst文件

BR不是只备份TiKV数据么?还会探测TiFlash? 找时间再去了解下。很多都忘记了

新搭建的库没有创建tiflash,其它三个库也有数据同步到tiflash但是恢复没有问题。另外,这几个文件可以在备份里找到。

在新集群tiflash节点加上nfs再试下呢,感觉像是这个问题,https://github.com/pingcap/docs-cn/pull/6172
之前大佬提到过:
4.0.11 BR备份1.2T数据后恢复到新集群出现sst找不到问题

应该是个bug,我也是查看帖子才知道,官网好像没有提及

增加tiflash还是一样的问题。

好奇怪,新的日志,刚刚br恢复这段时间的,br,pd,tikv的日志辛苦截取下呗

restorefull-wallet132.log (5.4 MB)

pd.zip (10.8 MB)

tikv.zip (20.1 MB)

收到,请问有没有用pd-ctl的region key查看下具体报错空region具体信息,看pd也没有明显报错,很奇怪,社区里没有相同的帖子,issue里也没有相同的问题

有个猜想啊,你gc_life_time是多长时间,有一个猜测啊,虽然br备份4.0.8之后不用调整tikv_gc_life_time,但是数据量不大的话可以手动调长,备份,然后还原下看看