br对33个tikv节点备份,每个节点1.9Tb数据,备份到20.56%失败 context canceled

为提高效率,请提供以下信息,问题描述清晰能够更快得到解决:
【 TiDB 使用环境】
docker环境 k8s部署 tikv pd tidb分别部署多台docker上,数据备份到共享存储上

【概述】 场景 + 问题概述
[2021/10/29 12:53:14.210 +00:00] [ERROR] [backup.go:41] [“failed to backup”] [error=“rpc error: code = Canceled desc = context canceled”] [errorVerbose=“rpc error: code = Canceled desc = context canceled\ngithub.com/tikv/pd/client.(*client).GetAllStores\ \tgithub.com/tikv/pd@v1.1.0-beta.0.20210323121136-78679e5e209d/client/client.go:1196\ github.com/pingcap/br/pkg/conn.GetAllTiKVStores\ \tgithub.com/pingcap/br/pkg/conn/conn.go:139\ github.com/pingcap/br/pkg/backup.(*Client).BackupRange\ \tgithub.com/pingcap/br/pkg/backup/client.go:459\ github.com/pingcap/br/pkg/backup.(*Client).BackupRanges.func1\ \tgithub.com/pingcap/br/pkg/backup/client.go:424\ github.com/pingcap/br/pkg/utils.(*WorkerPool).ApplyOnErrorGroup.func1\ \tgithub.com/pingcap/br/pkg/utils/worker.go:63\ golang.org/x/sync/errgroup.(*Group).Go.func1\ \tgolang.org/x/sync@v0.0.0-20210220032951-036812b2e83c/errgroup/errgroup.go:57\ runtime.goexit\ \truntime/asm_amd64.s:1371”] [stack=“main.runBackupCommand\ \tcommand-line-arguments/backup.go:41\ main.newFullBackupCommand.func1\ \tcommand-line-arguments/backup.go:109\ngithub.com/spf13/cobra.(*Command).execute\ \tgithub.com/spf13/cobra@v1.0.0/command.go:842\ github.com/spf13/cobra.(*Command).ExecuteC\ \tgithub.com/spf13/cobra@v1.0.0/command.go:950\ github.com/spf13/cobra.(*Command).Execute\ \tgithub.com/spf13/cobra@v1.0.0/command.go:887\ main.main\ \tcommand-line-arguments/main.go:56\ runtime.main\ \truntime/proc.go:225”]
[2021/10/29 12:53:14.210 +00:00] [ERROR] [main.go:58] [“br failed”] [error=“rpc error: code = Canceled desc = context canceled”] [errorVerbose=“rpc error: code = Canceled desc = context canceled\ngithub.com/tikv/pd/client.(*client).GetAllStores\ \tgithub.com/tikv/pd@v1.1.0-beta.0.20210323121136-78679e5e209d/client/client.go:1196\ github.com/pingcap/br/pkg/conn.GetAllTiKVStores\ \tgithub.com/pingcap/br/pkg/conn/conn.go:139\ github.com/pingcap/br/pkg/backup.(*Client).BackupRange\ \tgithub.com/pingcap/br/pkg/backup/client.go:459\ github.com/pingcap/br/pkg/backup.(*Client).BackupRanges.func1\ \tgithub.com/pingcap/br/pkg/backup/client.go:424\ github.com/pingcap/br/pkg/utils.(*WorkerPool).ApplyOnErrorGroup.func1\ \tgithub.com/pingcap/br/pkg/utils/worker.go:63\ golang.org/x/sync/errgroup.(*Group).Go.func1\ \tgolang.org/x/sync@v0.0.0-20210220032951-036812b2e83c/errgroup/errgroup.go:57\ runtime.goexit\ \truntime/asm_amd64.s:1371”] [stack=“main.main\ \tcommand-line-arguments/main.go:58\ runtime.main\ \truntime/proc.go:225”]

【备份和数据迁移策略逻辑】
执行备份语句 nohup /tidb/bin/br backup full --pd “11.45.241.131:2379” --storage local:///mnt/cfs/cbsql2/extern/tikv/tidb-bigdata/whole-tidb-bigdata-time-2021-10-29 --log-file /export/backup-nfs.log >/export/tikv_dobackup_record.log 2>&1 &
【背景】 做过哪些操作

【现象】 业务和数据库现象

【问题】 当前遇到的问题
[“failed to backup”] [error=“rpc error: code = Canceled desc = context canceled”]
【业务影响】

【TiDB 版本】
TiKV
Release Version: 5.1.1
Edition: Community
Git Commit Hash: 4705d7c6e9c42d129d3309e05911ec6b08a25a38
Git Commit Branch: heads/refs/tags/v5.1.1
UTC Build Time: 2021-07-28 10:59:26
Rust Version: rustc 1.53.0-nightly (16bf626a3 2021-04-14)
Enable Features: jemalloc mem-profiling portable sse protobuf-codec test-engines-rocksdb cloud-aws cloud-gcp
Profile: dist_release
【附件】

  • 相关日志、配置文件、Grafana 监控(https://metricstool.pingcap.com/)
  • TiUP Cluster Display 信息
  • TiUP CLuster Edit config 信息
  • TiDB-Overview 监控
  • 对应模块的 Grafana 监控(如有 BR、TiDB-binlog、TiCDC 等)
  • 对应模块日志(包含问题前后 1 小时日志)

若提问为性能优化、故障排查类问题,请下载脚本运行。终端输出的打印结果,请务必全选并复制粘贴上传。

麻烦看下在 br 备份期间 pd leader 和 tikv 节点状态是否都正常,有无发生重启等现象

能传一下完整日志,或者 grep 一下 received signal to exit 吗?看起来感觉有点像是 nohup 的进程被杀了……

请问你的问题解决了吗?

对 是这样

目前看没有这个情况

我找到解决办法了

解决办法:把相关的nohup命令放到shell脚本中执行,这样避免了nohup的进程被kill;如下:

cat runbackup.sh
#!/bin/bash
nohup /tidb/bin/br backup full --pd “11.45.241.131:2379” --storage local:///mnt/cfs/cbsql2/extern/tikv/tidb-bigdata/whole-tidb-bigdata-time-2021-10-29 --log-file /export/backup-nfs.log >/export/tikv_dobackup_record.log 2>&1 &

但被kill的这个情况如果需要完全解决,即能放到脚本外执行,应该还是代码中需要做些工作,类似如:linux - Why hangup signal is caught even with nohup? - Stack Overflow

1 个赞

此话题已在最后回复的 1 分钟后被自动关闭。不再允许新回复。