tidbv7.5.1版本使用br做全备有时成功有时失败,报错:can not find a valid leader for key

,

【 TiDB 使用环境】生产环境
【 TiDB 版本】v7.5.1
【遇到的问题:br执行全备有时成功有时失败】
【集群配置】

br工具报错打印日志如下
完整错误日志:
err_log_file (3.0 MB)

部分错误提示:
[2024/05/10 03:00:20.789 +08:00] [WARN] [push.go:81] [“skip store”] [range-sn=1] [store-id=5] [error=“the store last heartbeat is too far, at 31m13.240519664s: [BR:KV:ErrKVStorage]tikv storage occur I/O error”]

[2024/05/10 03:04:31.571 +08:00] [ERROR] [client.go:1022] [“find region failed”] [range-sn=568] [error=“rpc error: code = Canceled desc = context canceled”] [errorVerbose=“rpc error: code = Canceled desc = context canceled\ngithub.com/tikv/pd/client.(*client).respForErr\n\t/go/pkg/mod/github.com/tikv/pd/client@v0.0.0-20240210135946-3488a653ddd9/client.go:1602\ngithub.com/tikv/pd/client.(*client).GetRegion\n\t/go/pkg/mod/github.com/tikv/pd/client@v0.0.0-20240210135946-3488a653ddd9/client.go:947\ngithub.com/pingcap/tidb/br/pkg/backup.(*Client).findTargetPeer\n\t/home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/br/br/pkg/backup/client.go:1020\ngithub.com/pingcap/tidb/br/pkg/backup.(*Client).handleFineGrained\n\t/home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/br/br/pkg/backup/client.go:1229\ngithub.com/pingcap/tidb/br/pkg/backup.(*Client).fineGrainedBackup.func2\n\t/home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/br/br/pkg/backup/client.go:1108\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:1650”] [region=null] [stack=“github.com/pingcap/tidb/br/pkg/backup.(*Client).findTargetPeer\n\t/home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/br/br/pkg/backup/client.go:1022\ngithub.com/pingcap/tidb/br/pkg/backup.(*Client).handleFineGrained\n\t/home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/br/br/pkg/backup/client.go:1229\ngithub.com/pingcap/tidb/br/pkg/backup.(*Client).fineGrainedBackup.func2\n\t/home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/br/br/pkg/backup/client.go:1108”]

[2024/05/10 03:04:32.575 +08:00] [ERROR] [client.go:1051] [“can not find a valid target peer”] [range-sn=567] [key=7480000000000002FF1F5F698000000000FF0000030135313437FF30393932FF333331FF3834353633FF3200FF000000000000F803FF8000000000000000FF0380000000018C26FFE400000000000000F8] [stack=“github.com/pingcap/tidb/br/pkg/backup.(*Client).findTargetPeer\n\t/home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/br/br/pkg/backup/client.go:1051\ngithub.com/pingcap/tidb/br/pkg/backup.(*Client).handleFineGrained\n\t/home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/br/br/pkg/backup/client.go:1229\ngithub.com/pingcap/tidb/br/pkg/backup.(*Client).fineGrainedBackup.func2\n\t/home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/br/br/pkg/backup/client.go:1108”]

[2024/05/10 03:04:32.608 +08:00] [ERROR] [main.go:60] [“br failed”] [error=“can not find a valid leader for key t\ufffd\u0000\u0000\u0000\u0000\u0000\u0002\ufffd\u001f_r\ufffd\u0000\u0000\u0000\u0000\ufffd\ufffd\u001et\u0000\u0000\u0000\u0000\u0000\ufffd: [BR:Backup:ErrBackupNoLeader]backup no leader”] [errorVerbose=“[BR:Backup:ErrBackupNoLeader]backup no leader\ncan not find a valid leader for key t\ufffd\u0000\u0000\u0000\u0000\u0000\u0002\ufffd\u001f_r\ufffd\u0000\u0000\u0000\u0000\ufffd\ufffd\u001et\u0000\u0000\u0000\u0000\u0000\ufffd\ngithub.com/pingcap/tidb/br/pkg/backup.(*Client).findTargetPeer\n\t/home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/br/br/pkg/backup/client.go:1053\ngithub.com/pingcap/tidb/br/pkg/backup.(*Client).handleFineGrained\n\t/home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/br/br/pkg/backup/client.go:1229\ngithub.com/pingcap/tidb/br/pkg/backup.(*Client).fineGrainedBackup.func2\n\t/home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/br/br/pkg/backup/client.go:1108\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:1650”] [stack=“main.main\n\t/home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/br/br/cmd/br/main.go:60\nruntime.main\n\t/usr/local/go/src/runtime/proc.go:267”]

你这个集群拓扑很奇怪🤔,你只有两台机器吗,你副本是几副本呢,部署这么多组件,组件状态正常吗,我估计你集群状态都有问题

只有两个PD吗

是的目前只有两个pd

每个tikv节点应该是一个副本,请问如果要减少实例的话,大概多少比较比较合理呢

因为集群生产环境仅部署了两台物理主机

你这个拓扑,你一台机器上又是tidb又是pd又是tikv,生产环境不要这么搞

您好,理想的状态是一台机器仅部署一个tidb实例或者一个pd或者tikv实例对吗

想请问下,备份报错的原因,是因为机器部署太多实例的原因吗

看报错原因大致是不能从PD找到某个 Key的Region/Leader

你可以参考下官方文档的部署相关内容
https://docs.pingcap.com/zh/tidb/v6.5/hardware-and-software-requirements
这几篇都要看下


生产环境错误的部署方式可能导致非预期结果,比如丢失高可用能力

好的万分感谢

好的谢谢