使用tiup br备份报错:Error: error happen in store 5 at 10.3.8.199:20160: Io(Os { code: 22, kind: InvalidInput, message: "Invalid argument" }): [BR:KV:ErrKVStorage]tikv storage occur I/O error

,

【 TiDB 使用环境】生产
【 TiDB 版本】v5.4.0
【遇到的问题】使用tiup br备份报错:Error: error happen in store 5 at 10.3.8.199:20160: Io(Os { code: 22, kind: InvalidInput, message: “Invalid argument” }): [BR:KV:ErrKVStorage]tikv storage occur I/O error
【复现路径】查看3个tikv的日志发现有大量的如下报错:[2022/05/05 09:59:26.438 +08:00] [INFO] [<unknown>] ["New connected subchannel at 0x7fbffcef4590 for subchannel 0x7fc073210c80"] [2022/05/05 09:59:26.538 +08:00] [WARN] [kv.rs:1092] ["call CheckLeader failed"] [err=Grpc(RemoteStopped)] [2022/05/05 09:59:27.849 +08:00] [WARN] [kv.rs:1092] ["call CheckLeader failed"] [err=Grpc(RemoteStopped)] [2022/05/05 09:59:27.891 +08:00] [INFO] [<unknown>] ["New connected subchannel at 0x7fc041562a30 for subchannel 0x7fc072eee300"] [2022/05/05 09:59:27.892 +08:00] [INFO] [<unknown>] ["New connected subchannel at 0x7fc03ad381c0 for subchannel 0x7fc072eef9c0"] [2022/05/05 09:59:29.442 +08:00] [INFO] [<unknown>] ["New connected subchannel at 0x7fbffcdddd10 for subchannel 0x7fc073224540"] [2022/05/05 09:59:29.850 +08:00] [WARN] [kv.rs:1092] ["call CheckLeader failed"] [err=Grpc(RemoteStopped)] [2022/05/05 09:59:29.893 +08:00] [INFO] [<unknown>] ["New connected subchannel at 0x7fbfda50b300 for subchannel 0x7fc072eeef40"] [2022/05/05 09:59:31.895 +08:00] [INFO] [<unknown>] ["New connected subchannel at 0x7fc03adb9060 for subchannel 0x7fc072eef9c0"] [2022/05/05 09:59:32.545 +08:00] [WARN] [kv.rs:1092] ["call CheckLeader failed"] [err=Grpc(RemoteStopped)] [2022/05/05 09:59:33.896 +08:00] [INFO] [<unknown>] ["New connected subchannel at 0x7fbf81e36b90 for subchannel 0x7fc072eefb80"] [2022/05/05 09:59:34.447 +08:00] [INFO] [<unknown>] ["New connected subchannel at 0x7fbfdb85f460 for subchannel 0x7fc073225340"] [2022/05/05 09:59:34.547 +08:00] [WARN] [kv.rs:1092] ["call CheckLeader failed"] [err=Grpc(RemoteStopped)] [2022/05/05 09:59:35.195 +08:00] [WARN] [kv.rs:1092] ["call CheckLeader failed"] [err=Grpc(RemoteStopped)]
【问题现象及影响】在论坛上搜索到一个类似的帖子,好象说是tidb的bug。

v5.4.0版本的cluster集群,tikv的3个节点日志中,出来这个帖子说的报错问题。是咱们的tidb的bug还是?

目前来看,我使用br命令备份到本地是没有问题的,备份到samba共享出来的共享存储盘上就会报错:
1.尝试使用本地路径备份:(/tmp/tikv_car_news_2022-05-04_bk,此目录权限的用户及用户组已授权给tidb)

tiup br backup db --pd “10.3.8.196:2379” --db car_news --storage “local:///tmp/tikv_car_news_2022-05-04_bk” --ratelimit 128 --log-file backuptable.log

tiup is checking updates for component br …
Starting component br: /root/.tiup/components/br/v5.4.0/br /root/.tiup/components/br/v5.4.0/br backup db --pd 10.3.8.196:2379 --db car_news --storage local:///tmp/tikv_car_news_2022-05-04_bk --ratelimit 128 --log-file backuptable.log
Detail BR log in backuptable.log
Database backup <-------------------------------------------------------------------------------------------> 100.00%
Checksum <--------------------------------------------------------------------------------------------------> 100.00%
[2022/05/05 10:06:43.533 +08:00] [INFO] [collector.go:67] [“Database backup success summary”] [total-ranges=31] [ranges-succeed=31] [ranges-failed=0] [backup-checksum=299.112179ms] [backup-fast-checksum=3.672789ms] [backup-total-regions=23] [backup-total-ranges=22] [total-take=5.056491896s] [BackupTS=432987543561568258] [total-kv=1467304] [total-kv-size=449.8MB] [average-speed=88.96MB/s] [backup-data-size(after-compressed)=35.89MB] [Size=35890697]

2.使用共享盘备份失败:()

tiup br backup db --pd “10.3.8.196:2379” --db car_news --storage “local:///tidb_backup_data/nfs/backup/tikv_car_news_2022-05-04_bk” --ratelimit 128 --log-file backuptable.log

[error=“rpc error: code = Canceled desc = context canceled”] [errorVerbose=“rpc error: code = Canceled desc = context canceled
github.com/tikv/pd/client.(*client).GetAllStores
\t/nfs/cache/mod/github.com/tikv/pd@v1.1.0-beta.0.20211118054146-02848d2660ee/client/client.go:1523
github.com/pingcap/tidb/br/pkg/conn.GetAllTiKVStores
\t/home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/br/br/pkg/conn/conn.go:142
github.com/pingcap/tidb/br/pkg/conn.GetAllTiKVStoresWithRetry.func1
\t/home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/br/br/pkg/conn/conn.go:179
github.com/pingcap/tidb/br/pkg/utils.WithRetry
\t/home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/br/br/pkg/utils/retry.go:58
github.com/pingcap/tidb/br/pkg/conn.GetAllTiKVStoresWithRetry
\t/home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/br/br/pkg/conn/conn.go:176
github.com/pingcap/tidb/br/pkg/backup.(*Client).BackupRange
\t/home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/br/br/pkg/backup/client.go:511
github.com/pingcap/tidb/br/pkg/backup.(*Client).BackupRanges.func1
\t/home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/br/br/pkg/backup/client.go:471
github.com/pingcap/tidb/br/pkg/utils.(*WorkerPool).ApplyOnErrorGroup.func1
\t/home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/br/br/pkg/utils/worker.go:73
golang.org/x/sync/errgroup.(*Group).Go.func1
\t/nfs/cache/mod/golang.org/x/sync@v0.0.0-20210220032951-036812b2e83c/errgroup/errgroup.go:57
runtime.goexit
\t/usr/local/go/src/runtime/asm_amd64.s:1371”] [unit-name=“range start:74800000000000094e5f69800000000000000100 end:74800000000000094e5f698000000000000001fb”] [error=“rpc error: code = Canceled desc = context canceled”] [errorVerbose=“rpc error: code = Canceled desc = context canceled
github.com/tikv/pd/client.(*client).GetAllStores
\t/nfs/cache/mod/github.com/tikv/pd@v1.1.0-beta.0.20211118054146-02848d2660ee/client/client.go:1523
github.com/pingcap/tidb/br/pkg/conn.GetAllTiKVStores
\t/home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/br/br/pkg/conn/conn.go:142
github.com/pingcap/tidb/br/pkg/conn.GetAllTiKVStoresWithRetry.func1
\t/home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/br/br/pkg/conn/conn.go:179
github.com/pingcap/tidb/br/pkg/utils.WithRetry
\t/home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/br/br/pkg/utils/retry.go:58
github.com/pingcap/tidb/br/pkg/conn.GetAllTiKVStoresWithRetry
\t/home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/br/br/pkg/conn/conn.go:176
github.com/pingcap/tidb/br/pkg/backup.(*Client).BackupRange
\t/home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/br/br/pkg/backup/client.go:511
github.com/pingcap/tidb/br/pkg/backup.(*Client).BackupRanges.func1
\t/home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/br/br/pkg/backup/client.go:471
github.com/pingcap/tidb/br/pkg/utils.(*WorkerPool).ApplyOnErrorGroup.func1
\t/home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/br/br/pkg/utils/worker.go:73
golang.org/x/sync/errgroup.(*Group).Go.func1
\t/nfs/cache/mod/golang.org/x/sync@v0.0.0-20210220032951-036812b2e83c/errgroup/errgroup.go:57
runtime.goexit
\t/usr/local/go/src/runtime/asm_amd64.s:1371”]
Error: error happen in store 5 at 10.3.8.199:20160: Io(Os { code: 13, kind: PermissionDenied, message: “Permission denied” }): [BR:KV:ErrKVStorage]tikv storage occur I/O error

【附件】

  • 相关日志、配置文件、Grafana 监控(https://metricstool.pingcap.com/)
  • TiUP Cluster Display 信息
  • TiUP CLuster Edit config 信息
  • TiDB-Overview 监控
  • 对应模块的 Grafana 监控(如有 BR、TiDB-binlog、TiCDC 等)
  • 对应模块日志(包含问题前后 1 小时日志)

若提问为性能优化、故障排查类问题,请下载脚本运行。终端输出的打印结果,请务必全选并复制粘贴上传。

tikv 数据目录权限不对?

1 个赞

刚才看了下,确实是共享盘的备份路径的权限问题。我挂载后,手动执行如下命令:

chown -R tidb.tidb /tidb_backup_data/nfs/backup/

但是,一直没有生效。

看下目录权限,节点监控

1 个赞

共享盘我们这边是运维通过samba分配的,我这边在3个tikv节点上,使用root账号挂载后,修改目录所属为tidb,执行后没有任何报错,但是,再次查询,没有生效。还是属于root

Run command on 10.3.8.200(sudo:true): ls -l /tidb_backup_data/nfs/backup/enterprise_group
Run command on 10.3.8.199(sudo:true): ls -l /tidb_backup_data/nfs/backup/enterprise_group
Run command on 10.3.8.198(sudo:true): ls -l /tidb_backup_data/nfs/backup/enterprise_group
Outputs of ls -l /tidb_backup_data/nfs/backup/enterprise_group on 10.3.8.198:
stdout:
total 0
drwxrwxrwx. 2 root root 0 May 5 12:16 car_news_0505_bk

Outputs of ls -l /tidb_backup_data/nfs/backup/enterprise_group on 10.3.8.199:
stdout:
total 0
drwxrwxrwx. 2 root root 0 May 5 12:16 car_news_0505_bk

Outputs of ls -l /tidb_backup_data/nfs/backup/enterprise_group on 10.3.8.200:
stdout:
total 0
drwxrwxrwx. 2 root root 0 May 5 12:16 car_news_0505_bk

tidb使用共享存储备份时,共享存储的备份目录所属权限必须得为tidb才行吗?
我这边挂载的为给了0777权限,理论上tidb用户备份也是允许的

好像恢复的时候,不是 root 确实有点问题(恢复时,需要 tikv 能顺利执行一些命令)

1 个赞

BR 备份恢复需要两个用户权限:

  1. 运行 BR 用户对备份数据 storage 的读写权限 (运行 BR 账号一般用 root)
    BR 需要管理控制目标 storage 多次 BR 写入导致数据 corrupted
  2. 运行 TiKV 用户对备份数据 storage 的读写权限 (一般是 tidb 这个用户)
    TiKV 需要读写写备份数据

需要检查两个账号对目标 storage 的权限

该主题在最后一个回复创建后60天后自动关闭。不再允许新的回复。