使用BR备份失败,提示“backup occur region error”

tidb version: 4.0.0 ,在持续写数据的过程中利用BR在k8s中进行备份,pod 状态error, job未完成,查看bk的信息,kubectl describe bk 得到如下内容:

Message: cluster tidb-1hvpotizkn/tidb-1hvpotizkn-manaul-backup-1594007121632, wait pipe message failed, errMsg [2020/07/06 03:45:41.442 +00:00] [ERROR] [push.go:102] [“backup occur region error”] [error=“{"RegionError":{"message":"EpochNotMatch current epoch of region 2 is conf_ver: 5 version: 38, but you sent conf_ver: 5 version: 37","epoch_not_match":{"current_regions":[{"id":2,"start_key":"dIAAAAAAAAD/MV9ygAAAAAH/DH5HAAAAAAD6","region_epoch":{"conf_ver":5,"version":38},"peers":[{"id":3,"store_id":1},{"id":66,"store_id":4},{"id":87,"store_id":5}]},{"id":188,"start_key":"dIAAAAAAAAD/MV9ygAAAAAD/+dz8AAAAAAD6","end_key":"dIAAAAAAAAD/MV9ygAAAAAH/DH5HAAAAAAD6","region_epoch":{"conf_ver":5,"version":38},"peers":[{"id":189,"store_id":1},{"id":190,"store_id":4},{"id":191,"store_id":5}]}]}}}”] [stack=“github.com/pingcap/log.Error
\t/go/pkg/mod/github.com/pingcap/log@v0.0.0-20200117041106-d28c14d3b1cd/global.go:42
github.com/pingcap/br/pkg/backup.(*pushDown).pushBackup
\t/home/jenkins/agent/workspace/build_br_multi_branch_v4.0.0/go/src/github.com/pingcap/br/pkg/backup/push.go:102
github.com/pingcap/br/pkg/backup.(*Client).BackupRange
\t/home/jenkins/agent/workspace/build_br_multi_branch_v4.0.0/go/src/github.com/pingcap/br/pkg/backup/client.go:475
github.com/pingcap/br/pkg/backup.(*Client).BackupRanges.func2
\t/home/jenkins/agent/workspace/build_br_multi_branch_v4.0.0/go/src/github.com/pingcap/br/pkg/backup/client.go:381”]
[2020/07/06 03:45:56.231 +00:00] [ERROR] [client.go:408] [“update GC safePoint with TTL failed”] [error=“rpc error: code = DeadlineExceeded desc = context deadline exceeded”] [errorVerbose=“rpc error: code = DeadlineExceeded desc = context deadline exceeded
github.com/pingcap/pd/v4/client.(*client).UpdateServiceGCSafePoint
\t/go/pkg/mod/github.com/pingcap/pd/v4@v4.0.0-rc.1.0.20200511074607-3bb650739add/client/client.go:638
github.com/pingcap/br/pkg/backup.UpdateServiceSafePoint
\t/home/jenkins/agent/workspace/build_br_multi_branch_v4.0.0/go/src/github.com/pingcap/br/pkg/backup/safe_point.go:51
github.com/pingcap/br/pkg/backup.(*Client).BackupRanges
\t/home/jenkins/agent/workspace/build_br_multi_branch_v4.0.0/go/src/github.com/pingcap/br/pkg/backup/client.go:406
github.com/pingcap/br/pkg/task.RunBackup
\t/home/jenkins/agent/workspace/build_br_multi_branch_v4.0.0/go/src/github.com/pingcap/br/pkg/task/backup.go:187
github.com/pingcap/br/cmd.runBackupCommand
\t/home/jenkins/agent/workspace/build_br_multi_branch_v4.0.0/go/src/github.com/pingcap/br/cmd/backup.go:22
github.com/pingcap/br/cmd.newFullBackupCommand.func1
\t/home/jenkins/agent/workspace/build_br_multi_branch_v4.0.0/go/src/github.com/pingcap/br/cmd/backup.go:74
github.com/spf13/cobra.(*Command).execute
\t/go/pkg/mod/github.com/spf13/cobra@v1.0.0/command.go:842
github.com/spf13/cobra.(*Command).ExecuteC
\t/go/pkg/mod/github.com/spf13/cobra@v1.0.0/command.go:950
github.com/spf13/cobra.(*Command).Execute
\t/go/pkg/mod/github.com/spf13/cobra@v1.0.0/command.go:887
main.main
\t/home/jenkins/agent/workspace/build_br_multi_branch_v4.0.0/go/src/github.com/pingcap/br/main.go:54
runtime.main
\t/usr/local/go/src/runtime/proc.go:203
runtime.goexit
\t/usr/local/go/src/runtime/asm_amd64.s:1357”] [stack=“github.com/pingcap/log.Error
\t/go/pkg/mod/github.com/pingcap/log@v0.0.0-20200117041106-d28c14d3b1cd/global.go:42
github.com/pingcap/br/pkg/backup.(*Client).BackupRanges
\t/home/jenkins/agent/workspace/build_br_multi_branch_v4.0.0/go/src/github.com/pingcap/br/pkg/backup/client.go:408
github.com/pingcap/br/pkg/task.RunBackup
\t/home/jenkins/agent/workspace/build_br_multi_branch_v4.0.0/go/src/github.com/pingcap/br/pkg/task/backup.go:187
github.com/pingcap/br/cmd.runBackupCommand
\t/home/jenkins/agent/workspace/build_br_multi_branch_v4.0.0/go/src/github.com/pingcap/br/cmd/backup.go:22
github.com/pingcap/br/cmd.newFullBackupCommand.func1
\t/home/jenkins/agent/workspace/build_br_multi_branch_v4.0.0/go/src/github.com/pingcap/br/cmd/backup.go:74
github.com/spf13/cobra.(*Command).execute
\t/go/pkg/mod/github.com/spf13/cobra@v1.0.0/command.go:842
github.com/spf13/cobra.(*Command).ExecuteC
\t/go/pkg/mod/github.com/spf13/cobra@v1.0.0/command.go:950
github.com/spf13/cobra.(*Command).Execute
\t/go/pkg/mod/github.com/spf13/cobra@v1.0.0/command.go:887
main.main
\t/home/jenkins/agent/workspace/build_br_multi_branch_v4.0.0/go/src/github.com/pingcap/br/main.go:54
runtime.main
\t/usr/local/go/src/runtime/proc.go:203”]
[2020/07/06 03:45:56.234 +00:00] [ERROR] [base_client.go:130] [“[pd] failed updateLeader”] [error=“error:rpc error: code = Canceled desc = context canceled target:tidb-1hvpotizkn-pd-0.tidb-1hvpotizkn-pd-peer.tidb-1hvpotizkn.svc:2379 status:READY”] [errorVerbose=“error:rpc error: code = Canceled desc = context canceled target:tidb-1hvpotizkn-pd-0.tidb-1hvpotizkn-pd-peer.tidb-1hvpotizkn.svc:2379 status:READY
github.com/pingcap/pd/v4/client.(*baseClient).getMembers
\t/go/pkg/mod/github.com/pingcap/pd/v4@v4.0.0-rc.1.0.20200511074607-3bb650739add/client/base_client.go:208
github.com/pingcap/pd/v4/client.(*baseClient).updateLeader
\t/go/pkg/mod/github.com/pingcap/pd/v4@v4.0.0-rc.1.0.20200511074607-3bb650739add/client/base_client.go:182
github.com/pingcap/pd/v4/client.(*baseClient).leaderLoop
\t/go/pkg/mod/github.com/pingcap/pd/v4@v4.0.0-rc.1.0.20200511074607-3bb650739add/client/base_client.go:129
runtime.goexit
\t/usr/local/go/src/runtime/asm_amd64.s:1357
github.com/pingcap/pd/v4/client.(*baseClient).getMembers
\t/go/pkg/mod/github.com/pingcap/pd/v4@v4.0.0-rc.1.0.20200511074607-3bb650739add/client/base_client.go:209
github.com/pingcap/pd/v4/client.(*baseClient).updateLeader
\t/go/pkg/mod/github.com/pingcap/pd/v4@v4.0.0-rc.1.0.20200511074607-3bb650739add/client/base_client.go:182
github.com/pingcap/pd/v4/client.(*baseClient).leaderLoop
\t/go/pkg/mod/github.com/pingcap/pd/v4@v4.0.0-rc.1.0.20200511074607-3bb650739add/client/base_client.go:129
runtime.goexit
\t/usr/local/go/src/runtime/asm_amd64.s:1357
github.com/pingcap/pd/v4/client.(*baseClient).updateLeader
\t/go/pkg/mod/github.com/pingcap/pd/v4@v4.0.0-rc.1.0.20200511074607-3bb650739add/client/base_client.go:190
github.com/pingcap/pd/v4/client.(*baseClient).leaderLoop
\t/go/pkg/mod/github.com/pingcap/pd/v4@v4.0.0-rc.1.0.20200511074607-3bb650739add/client/base_client.go:129
runtime.goexit
\t/usr/local/go/src/runtime/asm_amd64.s:1357”] [stack=“github.com/pingcap/log.Error
\t/go/pkg/mod/github.com/pingcap/log@v0.0.0-20200117041106-d28c14d3b1cd/global.go:42
github.com/pingcap/pd/v4/client.(*baseClient).leaderLoop
\t/go/pkg/mod/github.com/pingcap/pd/v4@v4.0.0-rc.1.0.20200511074607-3bb650739add/client/base_client.go:130”]
Error: rpc error: code = DeadlineExceeded desc = context deadline exceeded

提示:"EpochNotMatch current epoch of region 2 is conf_ver: 5 version: 38, but you sent conf_ver: 5 version: 37”
这表示什么意思?能简单解释下吗?谢谢:handshake:

参考官方文档 https://docs.pingcap.com/zh/tidb-in-kubernetes/v1.1/backup-to-aws-s3-using-br 做的操作。

发起备份请求时需要向 leader 请求,发生这个错误时候可能是 leader 发生了切换;后续的日志看有很多 pd 请求失败,请问出现异常时 pd 服务是否正常


从上图看,pd应该没有发生切换吧(是不是不建议用一个pd节点:joy:)
我这边pd的log,你看有帮助不?
pd.log (2.6 MB)
查了下ERROR信息,发现有大量如下信息

麻烦了:handshake:

[2020/07/06 03:45:56.231 +00:00] [ERROR] [client.go:408] [“update GC safePoint with TTL failed”] [error=“rpc error: code = DeadlineExceeded desc = context deadline exceeded”]

不是指 pd 的 leader 切换,是指 region 的 leader 切换。EpochNotMatch current epoch of region 2 这个错误是可以忽略的,不会导致出错退出。

是这个错误导致出错退出的,原因是向 pd 更新 service 的 GC safepoint 失败。从 pd 的日志看,可能是网络出现了一些问题。

rpc error: code = DeadlineExceeded desc 这个 rpc error 一般是 pd 服务不可用,或者网络异常会出现

pd更新service 的 GC safepoin的时候,pd要跟谁通信?跟tikv吗?这个测试环境只是创建了一个pd。

更新的是 br 这个 service 在 pd 记录的 GC safepoint,br 和 pd 进行通信

这个问题是否还会复现?这段时间我们更新了 BR 和 PD 直接交互的逻辑,可以尝试升级下看看