BR备份失败,报“Disk quota exceeded”,但是磁盘空间和inode是足够的

【 TiDB 使用环境】生产\测试环境\ POC
生产
【 TiDB 版本】
v4.0.15
【遇到的问题】
【复现路径】做过哪些操作出现的问题
【问题现象及影响】
BR备份过程中报“Disk quota exceeded”导致备份失败。但是磁盘空间和inode是足够的。
【附件】

  • 相关日志、配置文件、Grafana 监控(https://metricstool.pingcap.com/)
    [2022/04/15 02:06:35.489 +08:00] [ERROR] [router.rs:174] [“failed to send significant msg”] [msg=LeaderCallback(Callback::Read(…))]
    [2022/04/15 02:09:16.382 +08:00] [ERROR] [router.rs:174] [“failed to send significant msg”] [msg=“CaptureChange { cmd: RegisterObserver { observe_id: ObserveID(360127), region_id: 2602086, enabled: true }, region_epoch: conf_ver: 18 version: 5692, callback: Callback::Re
    ad(…) }”]
    [2022/04/15 02:09:16.911 +08:00] [ERROR] [router.rs:174] [“failed to send significant msg”] [msg=LeaderCallback(Callback::Read(…))]
    [2022/04/15 02:16:21.872 +08:00] [ERROR] [router.rs:174] [“failed to send significant msg”] [msg=LeaderCallback(Callback::Read(…))]
    [2022/04/15 03:28:41.809 +08:00] [ERROR] [service.rs:86] [“backup canceled”] [error=RemoteStopped]
    [2022/04/15 03:28:59.256 +08:00] [ERROR] [endpoint.rs:269] [“backup save file failed”] [err_code=KV:Unknown] [err=“Io(Os { code: 122, kind: Other, message: “Disk quota exceeded” })”]
    [2022/04/15 03:28:59.257 +08:00] [ERROR] [endpoint.rs:689] [“backup region failed”] [err_code=KV:Unknown] [err=“Io(Os { code: 122, kind: Other, message: “Disk quota exceeded” })”] [end_key=7480000000000001875F698000000000000005014152493132383839FF3134353837000000FC038
    00000008AEECD28] [start_key=7480000000000001875F698000000000000005014152493132383832FF3930353932000000FC03800000008ADBC229] [region=“id: 2532541 start_key: 7480000000000001FF875F698000000000FF0000050141524931FF32383832FF393035FF3932000000FC0380FF0000008ADBC22900FE end_k
    ey: 7480000000000001FF875F698000000000FF0000050141524931FF32383839FF313435FF3837000000FC0380FF0000008AEECD2800FE region_epoch { conf_ver: 17 version: 17127 } peers { id: 2532542 store_id: 2 } peers { id: 2532543 store_id: 7 } peers { id: 2532544 store_id: 1 }”]
    [2022/04/15 03:28:59.257 +08:00] [ERROR] [endpoint.rs:718] [“backup failed to send response”] [err_code=KV:Unknown] [err=“TrySendError { kind: Disconnected }”]
    [2022/04/15 03:29:00.334 +08:00] [ERROR] [endpoint.rs:718] [“backup failed to send response”] [err_code=KV:Unknown] [err=“TrySendError { kind: Disconnected }”]
    [2022/04/15 03:29:01.201 +08:00] [ERROR] [endpoint.rs:718] [“backup failed to send response”] [err_code=KV:Unknown] [err=“TrySendError { kind: Disconnected }”]
    [2022/04/15 03:29:01.210 +08:00] [ERROR] [endpoint.rs:718] [“backup failed to send response”] [err_code=KV:Unknown] [err=“TrySendError { kind: Disconnected }”]
    [2022/04/15 03:29:01.228 +08:00] [ERROR] [endpoint.rs:718] [“backup failed to send response”] [err_code=KV:Unknown] [err=“TrySendError { kind: Disconnected }”]
    [2022/04/15 03:29:01.239 +08:00] [ERROR] [endpoint.rs:718] [“backup failed to send response”] [err_code=KV:Unknown] [err=“TrySendError { kind: Disconnected }”]
    [2022/04/15 03:29:01.276 +08:00] [ERROR] [endpoint.rs:718] [“backup failed to send response”] [err_code=KV:Unknown] [err=“TrySendError { kind: Disconnected }”]
    [2022/04/15 03:29:01.356 +08:00] [ERROR] [endpoint.rs:718] [“backup failed to send response”] [err_code=KV:Unknown] [err=“TrySendError { kind: Disconnected }”]
    [2022/04/15 03:29:01.356 +08:00] [ERROR] [endpoint.rs:718] [“backup failed to send response”] [err_code=KV:Unknown] [err=“TrySendError { kind: Disconnected }”]
    [2022/04/15 03:29:01.356 +08:00] [ERROR] [endpoint.rs:718] [“backup failed to send response”] [err_code=KV:Unknown] [err=“TrySendError { kind: Disconnected }”]
    [2022/04/15 03:29:01.356 +08:00] [ERROR] [endpoint.rs:718] [“backup failed to send response”] [err_code=KV:Unknown] [err=“TrySendError { kind: Disconnected }”]
    [2022/04/15 03:29:01.374 +08:00] [ERROR] [endpoint.rs:718] [“backup failed to send response”] [err_code=KV:Unknown] [err=“TrySendError { kind: Disconnected }”]
  • TiUP Cluster Display 信息
  • TiUP CLuster Edit config 信息
  • TiDB-Overview 监控
  • 对应模块的 Grafana 监控(如有 BR、TiDB-binlog、TiCDC 等)
  • 对应模块日志(包含问题前后 1 小时日志)

若提问为性能优化、故障排查类问题,请下载脚本运行。终端输出的打印结果,请务必全选并复制粘贴上传。

是使用本地存储备份吗,是的话每个tikv节点的存储都得看下

1 个赞

确认一下几个问题:
1、各个tikv节点的备份目录是共享盘还是本地盘
2、集群本身的tikv数据盘空间是否足够
3、集群节点状态是否都是正常的,看下tiup cluster display的输出

1 个赞

1.用的共享存储,240T的容量,现在才使用60T。
2.tikv的数据磁盘空间才使用30%,还剩余2.4T.
3.集群状态正常。

1 个赞

1 个赞

1 个赞

测试过tikv节点往备份目录写入文件正常吗

1 个赞

这个是生产环境,备份几个月了,都没有问题。

1 个赞

共享存储磁盘是否有写入读取的监控,其他tikv节点的磁盘情况呢

会不会是你备份的盘有对配额做了限制?可以从这方面查看下。

磁盘做配额了吗?

用的是云厂商的,他们说没有限制。在操作系统层面也没有看到这样的报错。而且其它mysql数据库备份也是存放在这个目录但没有报这个错误。

tidb有自带的监控,tikv节点的磁盘没有问题。

建议你手动执行一下 quota ,确认没有对当前用户做限制

或者检查下 /etc/fstab ,确认挂载的nfs盘,没有usrquota,grpquota的限制。这个磁盘配额限制一般是你们自己的系统挂盘的时候做的限制,不是云厂商那边做的限制。

Br 用的是什么版本?

tidb-toolkit-v4.0.15-linux-amd64

没有指定这种参数。
com:/share-40af9d19/cnhw-vm-tikv-am01 /nfs/backup nfs vers=3,timeo=600,nolock 0 0

用 root 执行

quota -v -f /nfs/backup

看看。另外也往这个目录写一些大文件看看,比如用 dd 写个几十G 的文件试试

写500G的文件都没有问题。