br对33个tikv节点备份,每个节点1.9Tb数据,备份到20.56%失败 context canceled

为提高效率,请提供以下信息,问题描述清晰能够更快得到解决:
【 TiDB 使用环境】
docker环境 k8s部署 tikv pd tidb分别部署多台docker上,数据备份到共享存储上

【概述】 场景 + 问题概述
[2021/10/29 12:53:14.210 +00:00] [ERROR] [backup.go:41] [“failed to backup”] [error=“rpc error: code = Canceled desc = context canceled”] [errorVerbose=“rpc error: code = Canceled desc = context canceled\ngithub.com/tikv/pd/client.(*client).GetAllStores\n\tgithub.com/tikv/pd@v1.1.0-beta.0.20210323121136-78679e5e209d/client/client.go:1196\ngithub.com/pingcap/br/pkg/conn.GetAllTiKVStores\n\tgithub.com/pingcap/br/pkg/conn/conn.go:139\ngithub.com/pingcap/br/pkg/backup.(*Client).BackupRange\n\tgithub.com/pingcap/br/pkg/backup/client.go:459\ngithub.com/pingcap/br/pkg/backup.(*Client).BackupRanges.func1\n\tgithub.com/pingcap/br/pkg/backup/client.go:424\ngithub.com/pingcap/br/pkg/utils.(*WorkerPool).ApplyOnErrorGroup.func1\n\tgithub.com/pingcap/br/pkg/utils/worker.go:63\ngolang.org/x/sync/errgroup.(*Group).Go.func1\n\tgolang.org/x/sync@v0.0.0-20210220032951-036812b2e83c/errgroup/errgroup.go:57\nruntime.goexit\n\truntime/asm_amd64.s:1371”] [stack=“main.runBackupCommand\n\tcommand-line-arguments/backup.go:41\nmain.newFullBackupCommand.func1\n\tcommand-line-arguments/backup.go:109\ngithub.com/spf13/cobra.(*Command).execute\n\tgithub.com/spf13/cobra@v1.0.0/command.go:842\ngithub.com/spf13/cobra.(*Command).ExecuteC\n\tgithub.com/spf13/cobra@v1.0.0/command.go:950\ngithub.com/spf13/cobra.(*Command).Execute\n\tgithub.com/spf13/cobra@v1.0.0/command.go:887\nmain.main\n\tcommand-line-arguments/main.go:56\nruntime.main\n\truntime/proc.go:225”]
[2021/10/29 12:53:14.210 +00:00] [ERROR] [main.go:58] [“br failed”] [error=“rpc error: code = Canceled desc = context canceled”] [errorVerbose=“rpc error: code = Canceled desc = context canceled\ngithub.com/tikv/pd/client.(*client).GetAllStores\n\tgithub.com/tikv/pd@v1.1.0-beta.0.20210323121136-78679e5e209d/client/client.go:1196\ngithub.com/pingcap/br/pkg/conn.GetAllTiKVStores\n\tgithub.com/pingcap/br/pkg/conn/conn.go:139\ngithub.com/pingcap/br/pkg/backup.(*Client).BackupRange\n\tgithub.com/pingcap/br/pkg/backup/client.go:459\ngithub.com/pingcap/br/pkg/backup.(*Client).BackupRanges.func1\n\tgithub.com/pingcap/br/pkg/backup/client.go:424\ngithub.com/pingcap/br/pkg/utils.(*WorkerPool).ApplyOnErrorGroup.func1\n\tgithub.com/pingcap/br/pkg/utils/worker.go:63\ngolang.org/x/sync/errgroup.(*Group).Go.func1\n\tgolang.org/x/sync@v0.0.0-20210220032951-036812b2e83c/errgroup/errgroup.go:57\nruntime.goexit\n\truntime/asm_amd64.s:1371”] [stack=“main.main\n\tcommand-line-arguments/main.go:58\nruntime.main\n\truntime/proc.go:225”]

【备份和数据迁移策略逻辑】
执行备份语句 nohup /tidb/bin/br backup full --pd “11.45.241.131:2379” --storage local:///mnt/cfs/cbsql2/extern/tikv/tidb-bigdata/whole-tidb-bigdata-time-2021-10-29 --log-file /export/backup-nfs.log >/export/tikv_dobackup_record.log 2>&1 &
【背景】 做过哪些操作

【现象】 业务和数据库现象

【问题】 当前遇到的问题
[“failed to backup”] [error=“rpc error: code = Canceled desc = context canceled”]
【业务影响】

【TiDB 版本】
TiKV
Release Version: 5.1.1
Edition: Community
Git Commit Hash: 4705d7c6e9c42d129d3309e05911ec6b08a25a38
Git Commit Branch: heads/refs/tags/v5.1.1
UTC Build Time: 2021-07-28 10:59:26
Rust Version: rustc 1.53.0-nightly (16bf626a3 2021-04-14)
Enable Features: jemalloc mem-profiling portable sse protobuf-codec test-engines-rocksdb cloud-aws cloud-gcp
Profile: dist_release
【附件】

  • 相关日志、配置文件、Grafana 监控(https://metricstool.pingcap.com/)
  • TiUP Cluster Display 信息
  • TiUP CLuster Edit config 信息
  • TiDB-Overview 监控
  • 对应模块的 Grafana 监控(如有 BR、TiDB-binlog、TiCDC 等)
  • 对应模块日志(包含问题前后 1 小时日志)

若提问为性能优化、故障排查类问题,请下载脚本运行。终端输出的打印结果,请务必全选并复制粘贴上传。

麻烦看下在 br 备份期间 pd leader 和 tikv 节点状态是否都正常,有无发生重启等现象

能传一下完整日志,或者 grep 一下 received signal to exit 吗?看起来感觉有点像是 nohup 的进程被杀了……

请问你的问题解决了吗?

对 是这样

目前看没有这个情况

我找到解决办法了

解决办法:把相关的nohup命令放到shell脚本中执行,这样避免了nohup的进程被kill;如下:

cat runbackup.sh
#!/bin/bash
nohup /tidb/bin/br backup full --pd “11.45.241.131:2379” --storage local:///mnt/cfs/cbsql2/extern/tikv/tidb-bigdata/whole-tidb-bigdata-time-2021-10-29 --log-file /export/backup-nfs.log >/export/tikv_dobackup_record.log 2>&1 &

但被kill的这个情况如果需要完全解决,即能放到脚本外执行,应该还是代码中需要做些工作,类似如:https://stackoverflow.com/questions/64479821/why-hangup-signal-is-caught-even-with-nohup

1赞