br 备份发生 kv io error

tidb v5.1.1

br 无论备份 全量,单库,单表都立即失败。
tikv 备份目录挂载在其他nfs服务器,不与执行br的服务器一起。
创建集群使用tidb用户,使用root执行是否有问题,但是也不是报的permission denied 问题
log如下:
[2021/09/22 11:03:30.762 +08:00] [INFO] [collector.go:66] [“Table backup failed summary”] [total-ranges=6] [ranges-succeed=0] [ranges-failed=6] [backup-total-regions=6] [backup-total-ranges=6] [unit-name=“range start:748000000000004ceb5f720000000000000000 end:748000000000004ceb5f72ffffffffffffffff00”] [error="[BR:KV:ErrKVStorage]tikv storage occur I/O error"] [errorVerbose="[BR:KV:ErrKVStorage]tikv storage occur I/O error\ngithub.com/pingcap/errors.AddStack\n\tgithub.com/pingcap/errors@v0.11.5-0.20201126102027-b0a155152ca3/errors.go:174\ngithub.com/pingcap/errors.Trace\n\tgithub.com/pingcap/errors@v0.11.5-0.20201126102027-b0a155152ca3/juju_adaptor.go:15\ngithub.com/pingcap/br/pkg/backup.(*Client).BackupRange\n\tgithub.com/pingcap/br/pkg/backup/client.go:473\ngithub.com/pingcap/br/pkg/backup.(*Client).BackupRanges.func1\n\tgithub.com/pingcap/br/pkg/backup/client.go:424\ngithub.com/pingcap/br/pkg/utils.(*WorkerPool).ApplyOnErrorGroup.func1\n\tgithub.com/pingcap/br/pkg/utils/worker.go:63\ngolang.org/x/sync/errgroup.(*Group).Go.func1\n\tgolang.org/x/sync@v0.0.0-20210220032951-036812b2e83c/errgroup/errgroup.go:57\nruntime.goexit\n\truntime/asm_amd64.s:1371"] [unit-name=“range start:748000000000004ceb5f69800000000000000100 end:748000000000004ceb5f698000000000000001fb”] [error=“rpc error: code = Canceled desc = context canceled”] [errorVerbose=“rpc error: code = Canceled desc = context canceled\ngithub.com/tikv/pd/client.(*client).GetAllStores\n\tgithub.com/tikv/pd@v1.1.0-beta.0.20210323121136-78679e5e209d/client/client.go:1196\ngithub.com/pingcap/br/pkg/conn.GetAllTiKVStores\n\tgithub.com/pingcap/br/pkg/conn/conn.go:139\ngithub.com/pingcap/br/pkg/backup.(*Client).BackupRange\n\tgithub.com/pingcap/br/pkg/backup/client.go:459\ngithub.com/pingcap/br/pkg/backup.(*Client).BackupRanges.func1\n\tgithub.com/pingcap/br/pkg/backup/client.go:424\ngithub.com/pingcap/br/pkg/utils.(*WorkerPool).ApplyOnErrorGroup.func1\n\tgithub.com/pingcap/br/pkg/utils/worker.go:63\ngolang.org/x/sync/errgroup.(*Group).Go.func1\n\tgolang.org/x/sync@v0.0.0-20210220032951-036812b2e83c/errgroup/errgroup.go:57\nruntime.goexit\n\truntime/asm_amd64.s:1371”] [unit-name=“range start:748000000000004ceb5f69800000000000000200 end:748000000000004ceb5f698000000000000002fb”] [error=“rpc error: code = Canceled desc = context canceled”] [errorVerbose=“rpc error: code = Canceled desc = context canceled\ngithub.com/tikv/pd/client.(*client).GetAllStores\n\tgithub.com/tikv/pd@v1.1.0-beta.0.20210323121136-78679e5e209d/client/client.go:1196\ngithub.com/pingcap/br/pkg/conn.GetAllTiKVStores\n\tgithub.com/pingcap/br/pkg/conn/conn.go:139\ngithub.com/pingcap/br/pkg/backup.(*Client).BackupRange\n\tgithub.com/pingcap/br/pkg/backup/client.go:459\ngithub.com/pingcap/br/pkg/backup.(*Client).BackupRanges.func1\n\tgithub.com/pingcap/br/pkg/backup/client.go:424\ngithub.com/pingcap/br/pkg/utils.(*WorkerPool).ApplyOnErrorGroup.func1\n\tgithub.com/pingcap/br/pkg/utils/worker.go:63\ngolang.org/x/sync/errgroup.(*Group).Go.func1\n\tgolang.org/x/sync@v0.0.0-20210220032951-036812b2e83c/errgroup/errgroup.go:57\nruntime.goexit\n\truntime/asm_amd64.s:1371”] [unit-name=“range start:748000000000004ceb5f69800000000000000300 end:748000000000004ceb5f698000000000000003fb”] [error=“rpc error: code = Canceled desc = context canceled”] [errorVerbose=“rpc error: code = Canceled desc = context canceled\ngithub.com/tikv/pd/client.(*client).GetAllStores\n\tgithub.com/tikv/pd@v1.1.0-beta.0.20210323121136-78679e5e209d/client/client.go:1196\ngithub.com/pingcap/br/pkg/conn.GetAllTiKVStores\n\tgithub.com/pingcap/br/pkg/conn/conn.go:139\ngithub.com/pingcap/br/pkg/backup.(*Client).BackupRange\n\tgithub.com/pingcap/br/pkg/backup/client.go:459\ngithub.com/pingcap/br/pkg/backup.(*Client).BackupRanges.func1\n\tgithub.com/pingcap/br/pkg/backup/client.go:424\ngithub.com/pingcap/br/pkg/utils.(*WorkerPool).ApplyOnErrorGroup.func1\n\tgithub.com/pingcap/br/pkg/utils/worker.go:63\ngolang.org/x/sync/errgroup.(*Group).Go.func1\n\tgolang.org/x/sync@v0.0.0-20210220032951-036812b2e83c/errgroup/errgroup.go:57\nruntime.goexit\n\truntime/asm_amd64.s:1371”] [unit-name=“range start:748000000000004ceb5f69800000000000000400 end:748000000000004ceb5f698000000000000004fb”] [error=“rpc error: code = Canceled desc = context canceled”] [errorVerbose=“rpc error: code = Canceled desc = context canceled\ngithub.com/tikv/pd/client.(*client).GetAllStores\n\tgithub.com/tikv/pd@v1.1.0-beta.0.20210323121136-78679e5e209d/client/client.go:1196\ngithub.com/pingcap/br/pkg/conn.GetAllTiKVStores\n\tgithub.com/pingcap/br/pkg/conn/conn.go:139\ngithub.com/pingcap/br/pkg/backup.(*Client).BackupRange\n\tgithub.com/pingcap/br/pkg/backup/client.go:459\ngithub.com/pingcap/br/pkg/backup.(*Client).BackupRanges.func1\n\tgithub.com/pingcap/br/pkg/backup/client.go:424\ngithub.com/pingcap/br/pkg/utils.(*WorkerPool).ApplyOnErrorGroup.func1\n\tgithub.com/pingcap/br/pkg/utils/worker.go:63\ngolang.org/x/sync/errgroup.(*Group).Go.func1\n\tgolang.org/x/sync@v0.0.0-20210220032951-036812b2e83c/errgroup/errgroup.go:57\nruntime.goexit\n\truntime/asm_amd64.s:1371”] [unit-name=“range start:748000000000004ceb5f69800000000000000500 end:748000000000004ceb5f698000000000000005fb”] [error=“rpc error: code = Canceled desc = context canceled”] [errorVerbose=“rpc error: code = Canceled desc = context canceled\ngithub.com/tikv/pd/client.(*client).GetAllStores\n\tgithub.com/tikv/pd@v1.1.0-beta.0.20210323121136-78679e5e209d/client/client.go:1196\ngithub.com/pingcap/br/pkg/conn.GetAllTiKVStores\n\tgithub.com/pingcap/br/pkg/conn/conn.go:139\ngithub.com/pingcap/br/pkg/backup.(*Client).BackupRange\n\tgithub.com/pingcap/br/pkg/backup/client.go:459\ngithub.com/pingcap/br/pkg/backup.(*Client).BackupRanges.func1\n\tgithub.com/pingcap/br/pkg/backup/client.go:424\ngithub.com/pingcap/br/pkg/utils.(*WorkerPool).ApplyOnErrorGroup.func1\n\tgithub.com/pingcap/br/pkg/utils/worker.go:63\ngolang.org/x/sync/errgroup.(*Group).Go.func1\n\tgolang.org/x/sync@v0.0.0-20210220032951-036812b2e83c/errgroup/errgroup.go:57\nruntime.goexit\n\truntime/asm_amd64.s:1371”]
Table backup <…> 0.00%
Error: [BR:KV:ErrKVStorage]tikv storage occur I/O error

提供的信息太少了,你只提供了报错日志,但你环境是啥样的?执行步骤是按照官网来的吗

集群状态都是正常的,就是普通的3pd,3tidb,12kv。
内存,cpu,磁盘都管够,br占不了多少资源。
使用命令;./br backup table --pd “pdleaderhost:port” --db db --table tb --storage “local:///data/2021-9-22-12” --ratelimit 120

报错:[BR:KV:ErrKVStorage]tikv storage occur I/O error

在社区内查所有类似的错误,错误日志都能看出发生什么错误了,现在遇到的日志报错就不清楚了。

1、你所有的 tikv 节点,都挂载到了 /data/2021-9-22-12 对吧(权限也要对)

生产挂载了,测试没挂载,两个都报一样的错。
感觉跟共享磁盘的问题不大。
我现在两个集群io压力都在70%以上,br会不会是因为集群io压力大所以报错中止了备份操作?

刚才io降到10%,有备份了一次,也是同样的错误。

也有可能,不过排查过程,建议还是先看看 权限 及 盘是否挂载正确,然后再排查其他的


其中一个的,其他tikv检查也是正确的。

自己排查出来了,手动在共享目录下建子目录–s指定就行了,查了tikv日志才发现。
只能说这个报错信息太坑了,完全没指到点上。

:ok_hand::smile::smile:

这个报错我们改善一下

1赞