BR 备份 TiDB 到阿里云 OSS 报错

【 TiDB 使用环境】生产环境
【 TiDB 版本】V7.5.0
【复现路径】BR V7.5.7 版本执行日志备份任务
【遇到的问题:问题现象及影响】Detail BR log in brbackuplog-20251106091851.log
[2025/11/06 09:22:19.961 +08:00] [INFO] [collector.go:77] [“log start failed summary”] [total-ranges=1] [ranges-succeed=0] [ranges-failed=1] [unit-name=“log start”] [error=“RequestError: send request failed\ncaused by: Head "https://csi-tidb-backup.oss-cn-shanghai.aliyuncs.com/backup-data/log-backup/backup.lock\”: read tcp 10.207.38.40:38904->106.14.228.174:443: read: connection reset by peer"] [errorVerbose=“RequestError: send request failed\ncaused by: Head "https://csi-tidb-backup.oss-cn-shanghai.aliyuncs.com/backup-data/log-backup/backup.lock\”: read tcp 10.207.38.40:38904->106.14.228.174:443: read: connection reset by peer\ngithub.com/pingcap/errors.Trace\n\t/root/go/pkg/mod/github.com/pingcap/errors@v0.11.5-0.20241219054535-6b8c588c3122/juju_adaptor.go:15\ngithub.com/pingcap/tidb/br/pkg/storage.(*S3Storage).FileExists\n\t/workspace/source/tidb/br/pkg/storage/s3.go:676\ngithub.com/pingcap/tidb/br/pkg/task.(*streamMgr).checkLock\n\t/workspace/source/tidb/br/pkg/task/stream.go:350\ngithub.com/pingcap/tidb/br/pkg/task.RunStreamStart\n\t/workspace/source/tidb/br/pkg/task/stream.go:590\ngithub.com/pingcap/tidb/br/pkg/task.RunStreamCommand\n\t/workspace/source/tidb/br/pkg/task/stream.go:531\nmain.streamCommand\n\t/workspace/source/tidb/br/cmd/br/stream.go:232\nmain.newStreamStartCommand.func1\n\t/workspace/source/tidb/br/cmd/br/stream.go:70\ngithub.com/spf13/cobra.(*Command).execute\n\t/root/go/pkg/mod/github.com/spf13/cobra@v1.7.0/command.go:940\ngithub.com/spf13/cobra.(*Command).ExecuteC\n\t/root/go/pkg/mod/github.com/spf13/cobra@v1.7.0/command.go:1068\ngithub.com/spf13/cobra.(*Command).Execute\n\t/root/go/pkg/mod/github.com/spf13/cobra@v1.7.0/command.go:992\nmain.main\n\t/workspace/source/tidb/br/cmd/br/main.go:36\nruntime.main\n\t/usr/local/go/src/runtime/proc.go:267\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:1650"]
Error: RequestError: send request failed
caused by: Head “https://csi-tidb-backup.oss-cn-shanghai.aliyuncs.com/backup-data/log-backup/backup.lock”: read tcp 10.207.38.40:38904->106.14.228.174:443: read: connection reset by peer
【资源配置】进入到 TiDB Dashboard -集群信息 (Cluster Info) -主机(Hosts) 截图此页面

【附件:截图/日志/监控】日志报错如下:
2025/11/06 09:18:52.237 +08:00] [INFO] [s3.go:425] [“succeed to get bucket region from s3”] [“bucket region”=]
[2025/11/06 09:18:52.268 +08:00] [INFO] [common.go:180] [“trying to connect to etcd”] [addr=“[10.207.38.42:2379]”]
[2025/11/06 09:18:52.476 +08:00] [WARN] [s3.go:1176] [“failed to request s3, retrying”] [error=“RequestError: send request failed\ncaused by: Head "https://csi-tidb-backup.oss-cn-shanghai.aliyuncs.com/backup-data/log-backup/backup.lock\”: read tcp 10.207.38.40:38882->106.14.228.174:443: read: connection reset by peer"] [backoff=1.292076205s]
[2025/11/06 09:18:53.788 +08:00] [WARN] [s3.go:1176] [“failed to request s3, retrying”] [error=“RequestError: send request failed\ncaused by: Head "https://csi-tidb-backup.oss-cn-shanghai.aliyuncs.com/backup-data/log-backup/backup.lock\”: read tcp 10.207.38.40:38884->106.14.228.174:443: read: connection reset by peer"] [backoff=3.639634402s]
[2025/11/06 09:18:54.236 +08:00] [INFO] [pd.go:430] [“adaptive update ts interval state transition”] [configuredInterval=2s] [prevAdaptiveUpdateInterval=2s] [newAdaptiveUpdateInterval=2s] [requiredStaleness=0s] [prevState=unknown(0)] [newState=normal]
[2025/11/06 09:18:57.448 +08:00] [WARN] [s3.go:1176] [“failed to request s3, retrying”] [error=“RequestError: send request failed\ncaused by: Head "https://csi-tidb-backup.oss-cn-shanghai.aliyuncs.com/backup-data/log-backup/backup.lock\”: read tcp 10.207.38.40:38886->106.14.228.174:443: read: connection reset by peer"] [backoff=6.854112628s]
[2025/11/06 09:19:04.322 +08:00] [WARN] [s3.go:1176] [“failed to request s3, retrying”] [error=“RequestError: send request failed\ncaused by: Head "https://csi-tidb-backup.oss-cn-shanghai.aliyuncs.com/backup-data/log-backup/backup.lock\”: read tcp 10.207.38.40:38888->106.14.228.174:443: read: connection reset by peer"] [backoff=11.53871172s]
[2025/11/06 09:19:15.880 +08:00] [WARN] [s3.go:1176] [“failed to request s3, retrying”] [error=“RequestError: send request failed\ncaused by: Head "https://csi-tidb-backup.oss-cn-shanghai.aliyuncs.com/backup-data/log-backup/backup.lock\”: read tcp 10.207.38.40:38890->106.14.228.174:443: read: connection reset by peer"] [backoff=26.958831968s]
[2025/11/06 09:19:42.860 +08:00] [WARN] [s3.go:1176] [“failed to request s3, retrying”] [error=“RequestError: send request failed\ncaused by: Head "https://csi-tidb-backup.oss-cn-shanghai.aliyuncs.com/backup-data/log-backup/backup.lock\”: read tcp 10.207.38.40:38900->106.14.228.174:443: read: connection reset by peer"] [backoff=54.558994944s]
[2025/11/06 09:20:37.441 +08:00] [WARN] [s3.go:1176] [“failed to request s3, retrying”] [error=“RequestError: send request failed\ncaused by: Head "https://csi-tidb-backup.oss-cn-shanghai.aliyuncs.com/backup-data/log-backup/backup.lock\”: read tcp 10.207.38.40:38902->106.14.228.174:443: read: connection reset by peer"] [backoff=1m42.46164352s]
[2025/11/06 09:22:19.959 +08:00] [INFO] [pd_service_discovery.go:249] [“[pd] exit member loop due to context canceled”]
。。。
。。。
。。。
[2025/11/06 09:22:19.960 +08:00] [INFO] [pd_service_discovery.go:295] [“[pd] close pd service discovery client”]
[2025/11/06 09:22:19.961 +08:00] [ERROR] [stream.go:532] [“failed to stream”] [command=“log start”] [error=“RequestError: send request failed\ncaused by: Head "https://csi-tidb-backup.oss-cn-shanghai.aliyuncs.com/backup-data/log-backup/backup.lock\”: read tcp 10.207.38.40:38904->106.14.228.174:443: read: connection reset by peer"] [errorVerbose=“RequestError: send request failed\ncaused by: Head "https://csi-tidb-backup.oss-cn-shanghai.aliyuncs.com/backup-data/log-backup/backup.lock\”: read tcp 10.207.38.40:38904->106.14.228.174:443: read: connection reset by peer\ngithub.com/pingcap/errors.Trace\n\t/root/go/pkg/mod/github.com/pingcap/errors@v0.11.5-0.20241219054535-6b8c588c3122/juju_adaptor.go:15\ngithub.com/pingcap/tidb/br/pkg/storage.(*S3Storage).FileExists\n\t/workspace/source/tidb/br/pkg/storage/s3.go:676\ngithub.com/pingcap/tidb/br/pkg/task.(*streamMgr).checkLock\n\t/workspace/source/tidb/br/pkg/task/stream.go:350\ngithub.com/pingcap/tidb/br/pkg/task.RunStreamStart\n\t/workspace/source/tidb/br/pkg/task/stream.go:590\ngithub.com/pingcap/tidb/br/pkg/task.RunStreamCommand\n\t/workspace/source/tidb/br/pkg/task/stream.go:531\nmain.streamCommand\n\t/workspace/source/tidb/br/cmd/br/stream.go:232\nmain.newStreamStartCommand.func1\n\t/workspace/source/tidb/br/cmd/br/stream.go:70\ngithub.com/spf13/cobra.(*Command).execute\n\t/root/go/pkg/mod/github.com/spf13/cobra@v1.7.0/command.go:940\ngithub.com/spf13/cobra.(*Command).ExecuteC\n\t/root/go/pkg/mod/github.com/spf13/cobra@v1.7.0/command.go:1068\ngithub.com/spf13/cobra.(*Command).Execute\n\t/root/go/pkg/mod/github.com/spf13/cobra@v1.7.0/command.go:992\nmain.main\n\t/workspace/source/tidb/br/cmd/br/main.go:36\nruntime.main\n\t/usr/local/go/src/runtime/proc.go:267\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:1650"] [stack=“github.com/pingcap/tidb/br/pkg/task.RunStreamCommand\n\t/workspace/source/tidb/br/pkg/task/stream.go:532\nmain.streamCommand\n\t/workspace/source/tidb/br/cmd/br/stream.go:232\nmain.newStreamStartCommand.func1\n\t/workspace/source/tidb/br/cmd/br/stream.go:70\ngithub.com/spf13/cobra.(*Command).execute\n\t/root/go/pkg/mod/github.com/spf13/cobra@v1.7.0/command.go:940\ngithub.com/spf13/cobra.(*Command).ExecuteC\n\t/root/go/pkg/mod/github.com/spf13/cobra@v1.7.0/command.go:1068\ngithub.com/spf13/cobra.(*Command).Execute\n\t/root/go/pkg/mod/github.com/spf13/cobra@v1.7.0/command.go:992\nmain.main\n\t/workspace/source/tidb/br/cmd/br/main.go:36\nruntime.main\n\t/usr/local/go/src/runtime/proc.go:267”]

TiDB 集群运行在本地VMware虚拟化中,想备份到阿里云OSS存储,网络已经打通:

[root@tidb ~]# telnet 106.14.228.174 443
Trying 106.14.228.174...
Connected to 106.14.228.174.
Escape character is '^]'.

目前不知道问题出在哪,有没有大手子遇到过?

可以用单表,测试下,是不是并发太高,导致oss服务端限制。

你是怎么备份的
给你写个我的命令
br backup full --pd “xxx” --storage “s3://tsp-prod-tidb/${date1}?access-key=xxxx&secret-access-key=xxxx” --s3.provider “alibaba” --s3.region “oss-cn-shanghai” --s3.endpoint “https://oss-cn-shanghai-internal.aliyuncs.com

BR 执行日志备份失败的核心原因是 与阿里云 OSS 之间的网络连接被重置(connection reset by peer) ,具体发生在尝试访问 backup.lock 文件时。
虽然 “网络已经打通”,但实际可能存在连接阻断的情况

  1. 测试 DNS 解析(确保能解析 OSS 域名)
    nslookup csi-tidb-backup.oss-cn-shanghai.aliyuncs.com

  2. 测试 TCP 443 端口连通性(使用 telnet 或 nc)
    telnet csi-tidb-backup.oss-cn-shanghai.aliyuncs.com 443

  3. 测试 HTTPS 访问(使用 curl 模拟 Head 请求)
    curl -I https://csi-tidb-backup.oss-cn-shanghai.aliyuncs.com/backup-data/log-backup/backup.lock

  4. 长ping 测试到 OSS IP 的丢包率

学习下

[root@tidb ~]# nslookup csi-tidb-backup.oss-cn-shanghai.aliyuncs.com
Server: 10.206.32.5
Address: 10.206.32.5#53

Non-authoritative answer:
Name: csi-tidb-backup.oss-cn-shanghai.aliyuncs.com
Address: 106.14.228.174

[root@tidb ~]# telnet csi-tidb-backup.oss-cn-shanghai.aliyuncs.com 443
Trying 106.14.228.174…
Connected to csi-tidb-backup.oss-cn-shanghai.aliyuncs.com.
Escape character is ‘^]’.

[root@tidb ~]# curl -I https://csi-tidb-backup.oss-cn-shanghai.aliyuncs.com/backup-data/log-backup/backup.lock
curl: (35) TCP connection reset by peer

[root@tidb ~]# ping oss-cn-shanghai.aliyuncs.com
PING oss-cn-shanghai.aliyuncs.com (106.14.228.198) 56(84) bytes of data.
64 bytes from 106.14.228.198 (106.14.228.198): icmp_seq=1 ttl=89 time=17.0 ms
64 bytes from 106.14.228.198 (106.14.228.198): icmp_seq=2 ttl=89 time=17.2 ms
64 bytes from 106.14.228.198 (106.14.228.198): icmp_seq=3 ttl=89 time=17.2 ms
64 bytes from 106.14.228.198 (106.14.228.198): icmp_seq=4 ttl=89 time=17.0 ms
64 bytes from 106.14.228.198 (106.14.228.198): icmp_seq=5 ttl=89 time=17.3 ms
64 bytes from 106.14.228.198 (106.14.228.198): icmp_seq=6 ttl=89 time=16.9 ms
64 bytes from 106.14.228.198 (106.14.228.198): icmp_seq=7 ttl=89 time=17.4 ms
64 bytes from 106.14.228.198 (106.14.228.198): icmp_seq=8 ttl=89 time=17.2 ms
64 bytes from 106.14.228.198 (106.14.228.198): icmp_seq=9 ttl=89 time=16.5 ms
64 bytes from 106.14.228.198 (106.14.228.198): icmp_seq=10 ttl=89 time=17.0 ms

我试一下你的命令

可能是防火墙或安全组拦截

应该没有,网络组和安全组都已经排查过了

使用你的命令测试还是一样的结果:

[2025/11/07 09:47:25.119 +08:00] [INFO] [collector.go:77] ["Full Backup failed summary"] [total-ranges=0] [ranges-succeed=0] [ranges-failed=0]
Error: error occurred when checking backupmeta file: RequestError: send request failed
caused by: Head "https://csi-tidb-backup.oss-cn-shanghai.aliyuncs.com/backup-data/snapshot-20251107094019/backupmeta": read tcp 10.207.38.40:40656->106.14.228.174:443: read: connection reset by peer

日志详情是:

[2025/11/07 09:40:22.360 +08:00] [INFO] [s3.go:425] ["succeed to get bucket region from s3"] ["bucket region"=oss-cn-shanghai]
[2025/11/07 09:40:22.397 +08:00] [WARN] [s3.go:1176] ["failed to request s3, retrying"] [error="RequestError: send request failed\ncaused by: Get \"https://csi-tidb-backup.oss-cn-shanghai.aliyuncs.com/?object-lock=\": read tcp 10.207.38.40:40614->106.14.228.174:443: read: connection reset by peer"] [backoff=1.931033146s]
[2025/11/07 09:40:23.102 +08:00] [INFO] [pd.go:430] ["adaptive update ts interval state transition"] [configuredInterval=2s] [prevAdaptiveUpdateInterval=2s] [newAdaptiveUpdateInterval=2s] [requiredStaleness=0s] [prevState=unknown(0)] [newState=normal]
[2025/11/07 09:40:23.605 +08:00] [INFO] [manager.go:318] ["revoke session"] ["owner info"="[log-backup] /tidb/br-stream/owner ownerManager c624004a-8606-49d7-b12b-6579bdbb48c4"] [error="rpc error: code = Canceled desc = grpc: the client connection is closing"]
[2025/11/07 09:40:24.686 +08:00] [WARN] [s3.go:1176] ["failed to request s3, retrying"] [error="RequestError: send request failed\ncaused by: Get \"https://csi-tidb-backup.oss-cn-shanghai.aliyuncs.com/?object-lock=\": read tcp 10.207.38.40:40616->106.14.228.174:443: read: connection reset by peer"] [backoff=3.281223942s]
[2025/11/07 09:40:27.988 +08:00] [WARN] [s3.go:1176] ["failed to request s3, retrying"] [error="RequestError: send request failed\ncaused by: Get \"https://csi-tidb-backup.oss-cn-shanghai.aliyuncs.com/?object-lock=\": read tcp 10.207.38.40:40618->106.14.228.174:443: read: connection reset by peer"] [backoff=4.003288736s]
[2025/11/07 09:40:32.013 +08:00] [WARN] [s3.go:1176] ["failed to request s3, retrying"] [error="RequestError: send request failed\ncaused by: Get \"https://csi-tidb-backup.oss-cn-shanghai.aliyuncs.com/?object-lock=\": read tcp 10.207.38.40:40620->106.14.228.174:443: read: connection reset by peer"] [backoff=14.09269068s]

你用的归档存储还是标准存储?归档存储没有读取权限拉取不到backup.lock

标准存储

我之前也试了下AWS S3,可以备份PITR日志。
快照备份也能写进去文件,但是备到10%左右就报错断开了,提示:

Error: error happen in store 1 at *********: Io(Custom { kind: Other, error: "failed to put object rusoto error timeout after 15mins for upload part in s3 storage" }): [BR:KV:ErrKVStorage]tikv storage occur I/O error

调了 --ratelimit 为 16 都不行,AWS 的延迟在 48ms 左右,所以想着试试 阿里云 OSS 存储,结果连日志备份都写不进去 :joy:

我一直用的阿里云的oss备份数据库的,没问题

你是本地部署的TiDB 集群,然后用阿里云 OSS 备份的吗?

方便看下你在阿里云OSS上的授权是怎么写的吗?

已解决,因为网络未放行 bucket 域名导致,开放后即可备份。

此话题已在最后回复的 7 天后被自动关闭。不再允许新回复。