4.0.11 BR备份1.2T数据后恢复到新集群出现sst找不到问题

挖掘机呀 · 2021 年4 月 25 日 09:57

为提高效率，请提供以下信息，问题描述清晰能够更快得到解决：

【TiDB 版本】
4.0.5备份 → 恢复4.0.11
BR版本：4.0.11

【问题描述】
两套集群都为 false

SELECT VARIABLE_VALUE FROM mysql.tidb WHERE VARIABLE_NAME='new_collation_enabled';
False

备份前设置GC时间，备份后又修改了回来

SELECT * FROM mysql.tidb WHERE VARIABLE_NAME = 'tikv_gc_life_time';
720h

备份成功后恢复的时候遇到 sst 找不到的问题，

[2021/04/25 17:08:31.057 +08:00] [ERROR] [import.go:262] ["download file failed"] [file="{name=2614325_86068_68_1f0940410b155d6d44c49bb2cd48c7c374b8c89ec28b4054e47ef1028e80deae_write.sst,CF=write,
sha256=cf2b486485172aff171e460e5a8c3696e129d633bc71ce0e6f7de886f1852ed5,startKey=7480000000000000805F720000000000000000,endKey=7480000000000000805F72FFFFFFFFFFFFFFFF00,startVersion=0,endVersion=42
4485480598601745,totalKvs=1605,totalBytes=108218,CRC64Xor=3778136912300119293}"] [region="{ID=343126,startKey=7480000000000014FF1C5F720000000000FA,endKey=7480000000000014FF1C5F72FFFFFFFFFFFFFFFFFF
0000000000FA,epoch=\"conf_ver:33 version:926 \",peers=\"id:343127 store_id:4 ,id:343128 store_id:6 ,id:343129 store_id:8 ,id:343130 store_id:5 ,id:343131 store_id:7 ,id:343132 store_id:161 is_lear
ner:true ,id:343138 store_id:158 is_learner:true ,id:343614 store_id:159 is_learner:true \"}"] [startKey=7480000000000014FF1C5F720000000000FF0000000000000000FA] [endKey=7480000000000014FF1C5F72FFF
FFFFFFFFFFFFFFF0000000000FB] [error="entity not found: [BR:KV:ErrKVDownloadFailed]download sst failed; entity not found: [BR:KV:ErrKVDownloadFailed]download sst failed; entity not found: [BR:KV:Er
rKVDownloadFailed]download sst failed; entity not found: [BR:KV:ErrKVDownloadFailed]download sst failed; entity not found: [BR:KV:ErrKVDownloadFailed]download sst failed; entity not found: [BR:KV:
ErrKVDownloadFailed]download sst failed; entity not found: [BR:KV:ErrKVDownloadFailed]download sst failed; entity not found: [BR:KV:ErrKVDownloadFailed]download sst failed"] [stack="github.com/pin
gcap/br/pkg/restore.(*FileImporter).Import.func1\
\tgithub.com/pingcap/br@/pkg/restore/import.go:262\
github.com/pingcap/br/pkg/utils.WithRetry\
\tgithub.com/pingcap/br@/pkg/utils/retry.go:35\
git
hub.com/pingcap/br/pkg/restore.(*FileImporter).Import\
\tgithub.com/pingcap/br@/pkg/restore/import.go:222\
github.com/pingcap/br/pkg/restore.(*Client).RestoreFiles.func2\
\tgithub.com/pingcap/br@/
pkg/restore/client.go:562\
github.com/pingcap/br/pkg/utils.(*WorkerPool).ApplyOnErrorGroup.func1\
\tgithub.com/pingcap/br@/pkg/utils/worker.go:63\
golang.org/x/sync/errgroup.(*Group).Go.func1\
\tg
olang.org/x/sync@v0.0.0-20201020160332-67f06af15bc9/errgroup/errgroup.go:57"]

可以保证日志中download file filed 的 sst 文件都是真实存在的

 ll -sh /backup/br_backup/bistudio_tidb_backup/full_2021-04-25/2614325_86068_68_1f0940410b155d6d44c49bb2cd48c7c374b8c89ec28b4054e47ef1028e80deae_write.sst
20K -rw-r--r-- 1 tidb tidb 20K Apr 25 01:16 /backup/br_backup/bistudio_tidb_backup/full_2021-04-25/2614325_86068_68_1f0940410b155d6d44c49bb2cd48c7c374b8c89ec28b4054e47ef1028e80deae_write.sst

KV节点使用的为 NFS 作为共享存储，出现这个问题

然后将备份目录全量拷贝到所有 KV 节点后恢复也是会有同样的问题

若提问为性能优化、故障排查类问题，请下载脚本运行。终端输出的打印结果，请务必全选并复制粘贴上传。

来了老弟 · 2021 年4 月 25 日 10:16

你好

需要判断下 NFS 对所有 tikv 节点是否可读，（确认该 dir 的读权限，非文件（或者再每个 tikv 节点都执行下 ll *.sst 这个命令，来判断所有 tikv 对 NFS 这个 dir 都有访问权限
提供 welcome 日志以及后面两行 arguments

挖掘机呀 · 2021 年4 月 26 日 05:44

我分别登陆了所有KV节点，使用tidb用户ls -l 备份目录的sst文件，都是可以正常访问的

挖掘机呀 · 2021 年4 月 26 日 05:49

[2021/04/25 10:22:07.400 +08:00] [INFO] [version.go:37] ["Welcome to Backup & Restore (BR)"]
[2021/04/25 10:22:07.400 +08:00] [INFO] [version.go:38] [BR] [release-version=v4.0.11]
[2021/04/25 10:22:07.400 +08:00] [INFO] [version.go:39] [BR] [git-hash=f8b930f124ae070ce9dd0c06d0caa9c4cd35338e]
[2021/04/25 10:22:07.400 +08:00] [INFO] [version.go:40] [BR] [git-branch=heads/refs/tags/v4.0.11]
[2021/04/25 10:22:07.400 +08:00] [INFO] [version.go:41] [BR] [go-version=go1.13]
[2021/04/25 10:22:07.400 +08:00] [INFO] [version.go:42] [BR] [utc-build-time="2021-02-25 04:41:30"]
[2021/04/25 10:22:07.400 +08:00] [INFO] [version.go:43] [BR] [race-enabled=false]
[2021/04/25 10:22:07.400 +08:00] [INFO] [common.go:450] [arguments] [__command="br restore full"] [log-file=/tmp/restore.log] [pd="[xxxx:2379]"] [storage=local:///backup/br_backup/xxxx/full_2021-04-25]
[2021/04/25 10:22:07.400 +08:00] [INFO] [client.go:166] ["[pd] create pd client with endpoints"] [pd-address="[xxxxxx:2379]"]
[2021/04/25 10:22:07.402 +08:00] [INFO] [base_client.go:236] ["[pd] update member urls"] [old-urls="[http://xxxxxx:2379]"] [new-urls="[http://xxxxxx:2379,http://xxxx:2379,http://xxxx:2379,http://xxxx:2379,http://xxxx:2379]"]
[2021/04/25 10:22:07.403 +08:00] [INFO] [base_client.go:252] ["[pd] switch leader"] [new-leader=http://10.10.92.166:2379] [old-leader=]
[2021/04/25 10:22:07.403 +08:00] [INFO] [base_client.go:102] ["[pd] init cluster id"] [cluster-id=6954717817779010993]
[2021/04/25 10:22:07.406 +08:00] [INFO] [client.go:166] ["[pd] create pd client with endpoints"] [pd-address="[xxxx:2379]"]
[2021/04/25 10:22:07.408 +08:00] [INFO] [base_client.go:236] ["[pd] update member urls"] [old-urls="[http://xxxx:2379]"] [new-urls="[http://xxxx:2379,http://xxxx:2379,http://xxxx:2379,http://xxxx:2379,http://xxxx:2379]"]
[2021/04/25 10:22:07.408 +08:00] [INFO] [base_client.go:252] ["[pd] switch leader"] [new-leader=http://xxxx:2379] [old-leader=]
[2021/04/25 10:22:07.408 +08:00] [INFO] [base_client.go:102] ["[pd] init cluster id"] [cluster-id=6954717817779010993]
[2021/04/25 10:22:07.409 +08:00] [INFO] [conn.go:127] ["new mgr"] [pdAddrs=xxxx:2379]    
[2021/04/25 10:22:07.410 +08:00] [INFO] [tidb.go:72] ["new domain"] [store=tikv-6954717817779010993] ["ddl lease"=1s] ["stats lease"=-1ns]
[2021/04/25 10:22:07.413 +08:00] [INFO] [ddl.go:322] ["[ddl] start DDL"] [ID=38bd8784-3dcd-4889-9359-404454665a7b] [runWorker=true]
[2021/04/25 10:22:07.413 +08:00] [INFO] [manager.go:188] ["start campaign owner"] [ownerInfo="[ddl] /tidb/ddl/fg/owner"]
[2021/04/25 10:22:07.417 +08:00] [INFO] [ddl.go:311] ["[ddl] start delRangeManager OK"] ["is a emulator"=false]
[2021/04/25 10:22:07.417 +08:00] [INFO] [ddl_worker.go:131] ["[ddl] start DDL worker"] [worker="worker 1, tp general"]
[2021/04/25 10:22:07.417 +08:00] [INFO] [ddl_worker.go:131] ["[ddl] start DDL worker"] [worker="worker 2, tp add index"]  
[2021/04/25 10:22:07.541 +08:00] [INFO] [domain.go:148] ["full load InfoSchema success"] [usedSchemaVersion=0] [neededSchemaVersion=981] ["start time"=92.94453ms]

这是恢复时的日志，辛苦PingCAP专家看一下

来了老弟 · 2021 年4 月 26 日 05:56

收到，劳烦看下私信。

@挖掘机呀请问下咱们 v4.0.11 和 v4.0.5 的集群存在 tiflash 节点数辛苦同步下。

挖掘机呀 · 2021 年4 月 26 日 06:03

v4.0.5版本有3个tiflush节点，v4.0.11版本有4个tiflush节点

来了老弟 · 2021 年4 月 26 日 06:10

非常感谢你的反馈，

这部分的验证能否加上 tiflash 节点，在所有 tiflash 节点 ls NFS dir 看下是否 ok

挖掘机呀 · 2021 年4 月 26 日 06:42

tiflash都没有挂在NFS共享存储，不过我看官方文档上没有写tiflash也要挂载呀

来了老弟 · 2021 年4 月 26 日 07:00

嗯，我们正在跟踪 br 恢复没有 tiflash 节点的问题：
https://github.com/pingcap/docs-cn/pull/6172
todo：
在 v4.0.11 的 tiflash 节点加上 NFS 在进行恢复，在进行 restore 操作。看是否成功

挖掘机呀 · 2021 年4 月 26 日 07:01

因为这个集群是测试集群，所以我把tiflash节点全部缩容了，没有啦tiflash我再试一下

来了老弟 · 2021 年4 月 26 日 07:03

那肯定是没问题的了，最好带着 tiflash 节点去尝试，更有价值。

挖掘机呀 · 2021 年4 月 26 日 07:15

大佬，你的意思是之前恢复报错就是因为tiflash节点没有挂载NFS共享存储，然后日志报错也是因为tiflash节点的确下载不到sst文件的原因吗

来了老弟 · 2021 年4 月 26 日 07:19

是的~

来了老弟 · 2021 年4 月 26 日 08:15

@挖掘机呀如何

挖掘机呀 · 2021 年4 月 26 日 09:42

数据还在恢复，因为是同城双活架构，加5个副本，同时还限速恢复，所以恢复的比较慢，不过状态很正常，没有找不到sst文件的error啦

挖掘机呀 · 2021 年4 月 26 日 09:43

感谢PingCAP的专家帮助

希望官方文档也可以更新下tiflash也需要挂载NFS存储的问题

来了老弟 · 2021 年4 月 27 日 02:06

@挖掘机呀恢复成功了吗

追下这个 pr 即可，文档会先更新的，。

Kongdom · 2021 年12 月 25 日 04:01

同一个大版本下，低版本的备份可以恢复到高版本？

来了老弟 · 2022 年3 月 25 日 03:19

理论上是 ok 的，我们时常会通过 br 备份低版本的 tidb 集群，恢复到高版本的 tidb 集群中，以作为升级时的存量数据。如果出现问题，那这并不符合预期，可以反馈给我们。

Kongdom · 2022 年3 月 25 日 03:27

感谢解答