br 备份失败问题 [pd] failed updateMember

在群集版本 v5.1.1 ,备份到s3存储中,在备份时首先会报一个 [pd] failed updateMember 的错误,备份一阵就报 [BR:KV:ErrKVStorage]tikv storage occur I/O error ,我检查过集群状态 pd 和 tikv 的状态都是正常的,请问有人遇到过吗?

pd报错:

[2022/06/27 12:52:17.438 +08:00] [ERROR] [base_client.go:166] ["[pd] failed updateMember"] [error="[PD:client:ErrClientGetMember]error:rpc error: code = Canceled desc = context canceled target:xxx.xxx.xxx.89:2379 status:READY"] [errorVerbose="[PD:client:ErrClientGetMember]error:rpc error: code = Canceled desc = context canceled target:xxx.xxx.xxx.89:2379 status:READY\
github.com/tikv/pd/client.(*baseClient).updateMember\
\tgithub.com/tikv/pd@v1.1.0-beta.0.20210323121136-78679e5e209d/client/base_client.go:301\
github.com/tikv/pd/client.(*baseClient).memberLoop\
\tgithub.com/tikv/pd@v1.1.0-beta.0.20210323121136-78679e5e209d/client/base_client.go:165\
runtime.goexit\
\truntime/asm_amd64.s:1371"] [stack="github.com/tikv/pd/client.(*baseClient).memberLoop\
\tgithub.com/tikv/pd@v1.1.0-beta.0.20210323121136-78679e5e209d/client/base_client.go:166"]
[2022/06/27 12:52:17.439 +08:00] [INFO] [base_client.go:296] ["[pd] cannot update member from this address"] [address=http://xxx.xxx.xxx.89:2379] [error="[PD:client:ErrClientGetMember]error:rpc error: code = Canceled desc = context canceled target:xxx.xxx.xxx.89:2379 status:READY"]
[2022/06/27 12:52:17.439 +08:00] [ERROR] [base_client.go:166] ["[pd] failed updateMember"] [error="[PD:client:ErrClientGetMember]error:rpc error: code = Canceled desc = context canceled target:xxx.xxx.xxx.89:2379 status:READY"] [errorVerbose="[PD:client:ErrClientGetMember]error:rpc error: code = Canceled desc = context canceled target:xxx.xxx.xxx.89:2379 status:READY\
github.com/tikv/pd/client.(*baseClient).updateMember\
\tgithub.com/tikv/pd@v1.1.0-beta.0.20210323121136-78679e5e209d/client/base_client.go:301\
github.com/tikv/pd/client.(*baseClient).memberLoop\
\tgithub.com/tikv/pd@v1.1.0-beta.0.20210323121136-78679e5e209d/client/base_client.go:165\
runtime.goexit\
\truntime/asm_amd64.s:1371"] [stack="github.com/tikv/pd/client.(*baseClient).memberLoop\
\tgithub.com/tikv/pd@v1.1.0-beta.0.20210323121136-78679e5e209d/client/base_client.go:166"]

tikv报错:

Error: error happen in store 6 at xxx.xxx.xxx.7:20160: Io(Custom { kind: Other, error: "failed to put object Request ID: None Body: <?xml version=\"1.0\" encoding=\"UTF-8\"?>\
<Error><Code>NoSuchUpload</Code><Message>The specified multipart upload does not exist. The upload ID may be invalid, or the upload may have been aborted or completed.</Message><Key>BOOL-5.1.4_20220619120002/6_231692646_9153_6907cac0f12da38fe08638edcf05ef221c1408439071c10fcad460bf0c087d69_1656304290496_write.sst</Key><BucketName>s3test</BucketName><Resource>/s3test/BOOL-5.1.4_20220619120002/6_231692646_9153_6907cac0f12da38fe08638edcf05ef221c1408439071c10fcad460bf0c087d69_1656304290496_write.sst</Resource><RequestId>16FC6000C92D863D</RequestId><HostId>c8864a4e-b9ea-4860-b1e7-e4fe536de2e0</HostId></Error>" }): [BR:KV:ErrKVStorage]tikv storage occur I/O error
Error: error happen in store 229754136 at xxx.xxx.xxx.57:20160: Io(Custom { kind: Other, error: "failed to put object Request ID: None Body: <?xml version=\"1.0\" encoding=\"UTF-8\"?>\
<Error><Code>NoSuchUpload</Code><Message>The specified multipart upload does not exist. The upload ID may be invalid, or the upload may have been aborted or completed.</Message><Key>BOOL-5.1.4_20220619120002/229754136_230097385_9163_72c9048c429f0a7f8174adcbc4f77697fc31c5ec231719bbf49ba6d3f7642858_1656304681695_write.sst</Key><BucketName>s3test</BucketName><Resource>/s3test/BOOL-5.1.4_20220619120002/229754136_230097385_9163_72c9048c429f0a7f8174adcbc4f77697fc31c5ec231719bbf49ba6d3f7642858_1656304681695_write.sst</Resource><RequestId>16FC605DA41EDB9E</RequestId><HostId>c8864a4e-b9ea-4860-b1e7-e4fe536de2e0</HostId></Error>" }): [BR:KV:ErrKVStorage]tikv storage occur I/O error
Error: error happen in store 110331059 at xxx.xxx.xxx.24:20160: Io(Custom { kind: Other, error: "failed to put object Request ID: None Body: <?xml version=\"1.0\" encoding=\"UTF-8\"?>\
<Error><Code>InternalError</Code><Message>We encountered an internal error, please try again.: cause(Read failed. Insufficient number of disks online)</Message><Key>BOOL-5.1.4_20220619120002/110331059_231621883_9170_901fe8da2907c0af190553ae11100206e1b24aeb99b85c4d2ceff317483b17a3_1656305451584_default.sst</Key><BucketName>s3test</BucketName><Resource>/s3test/BOOL-5.1.4_20220619120002/110331059_231621883_9170_901fe8da2907c0af190553ae11100206e1b24aeb99b85c4d2ceff317483b17a3_1656305451584_default.sst</Resource><RequestId>16FC611BB7244917</RequestId><HostId>c8864a4e-b9ea-4860-b1e7-e4fe536de2e0</HostId></Error>" }): [BR:KV:ErrKVStorage]tikv storage occur I/O error

方便把br备份的配置贴出来吗?

出现这个问题,如果权限相关的没啥问题,大概率是配置 s3endpoint 时,在url后面多了个 ‘/’,可以从这方面入手。

具体参考:

以及

专栏 - br 备份到 s3 时 endpoint 参数加目录分隔符后缀问题排查 | TiDB 社区

/home/tidb/tidb-5.1.1/resources/bin/br backup full --pd xxx.xxx.xxx.89:2379 --storage s3://db-backup/xxxx-5.1.1_20220626120002 --s3.endpoint ‘http://xxxxx.xxxxx.xxxx.com’ --s3.region ‘xxxx’ --ratelimit 30 --log-file 20220626120002.log --ignore-stats

比较诡异的是:用相同的命令有时可以成功有时报上述报错

--send-credentials-to-tikv=true
把这个配置加上试试呢?

–send-credentials-to-tikv`:表示将 S3 的访问权限传递给 TiKV 节点

:+1::+1::+1::+1:

经测试加上参数后,有一次备份失败,相同的命令在再执行一次就成功了,太诡异了,感觉br这个工具不太稳定呀?而且报出来的错误不知从何下手

Error: msg:“Io(Custom { kind: Other, error: “failed to put object Error during dispatch: connection error: Connection reset by peer (os error 104)” })”

备份参数如下:

br backup full --pd xxx.148:2379 --storage s3://cxxx/Wxxxxx628002 --s3.endpoint ‘http://xxx’ --s3.region ‘xxx’ --ratelimit 30 --send-credentials-to-tikv=true --log-file /data/deploy/xxxx_20220628002.log

备份还是有问题

[2022/06/29 15:39:57.859 +08:00] [ERROR] [endpoint.rs:284] ["backup save file failed"] [error="Io(Custom { kind: Other, error: \"failed to put object Request ID: None Body: <html>\\r\\
<head><title>502 Bad Gateway</title></head>\\r\\
<body bgcolor=\\\"white\\\">\\r\\
<center><h1>502 Bad Gateway</h1></center>\\r\\
<hr><center>openresty</center>\\r\\
</body>\\r\\
</html>\\r\\
\" })"]
[2022/06/29 15:39:57.859 +08:00] [ERROR] [endpoint.rs:669] ["backup region failed"] [error="Io(Custom { kind: Other, error: \"failed to put object Request ID: None Body: <html>\\r\\
<head><title>502 Bad Gateway</title></head>\\r\\
<body bgcolor=\\\"white\\\">\\r\\
<center><h1>502 Bad Gateway</h1></center>\\r\\
<hr><center>openresty</center>\\r\\
</body>\\r\\
</html>\\r\\
\" })"] [end_key=] [start_key=] [region="id: 481976392 start_key: 7480000000000004FF0E5F72800000117DFFC7A7E20000000000FA end_key: 7480000000000004FF0E5F72800000117DFFD41D1B0000000000FA region_epoch { conf_ver: 169199 version: 41247 } peers { id: 481976393 store_id: 1428951 } peers { id: 517307504 store_id: 5 } peers { id: 538409640 store_id: 1236441 }"]
[2022/06/29 15:39:57.860 +08:00] [ERROR] [service.rs:86] ["backup canceled"] [error=RemoteStopped]

感觉连接不稳定,你有NFT 吗? 备份一下试试