1.2亿的分区表dumpling报[error="Error 9005: Region is unavailable"]

Bug 反馈
清晰准确地描述您发现的问题,提供任何可能复现问题的步骤有助于研发同学及时处理问题
【 Bug 的影响】
全表备份 导出异常中断 。
【可能的问题复现步骤】
./dumpling -h 127.0.0.1 -P 4000 -u backup -p -B yixintui_operate --filetype sql --threads 1 -o ${Bak_dir}/${Ip}/${Port}/${Time} -F 1024MiB –
compress gz --params “tidb_distsql_scan_concurrency=1,tidb_mem_quota_query=28589934592” -f ‘yixintui_operate.’ -f '!yixintui_operate.Agent_material_report_cost_2
0210917_bak,!yixintui_operate.synrpt_gdt_advertiser_ad_test,!yixintui_operate.creative_spec_material_participant_1
,!yixintui_operate.Material_creative_count_2*’

$Bak_log 2>&1
【看到的非预期行为】
[2022/01/07 15:31:57.787 +08:00] [WARN] [writer_util.go:181] [“fail to dumping table(chunk), will revert some metrics and start a retry if possible”] [database=yixintui_operate] [table=Agent_material_report_cost] [“finished rows”=784941] [“finished size”=339228293] [error=“Error 9005: Region is unavailable”]
【期望看到的行为】

【相关组件及具体版本】
select * from mysql.tidb where variable_name=‘tikv_gc_life_time’;
| VARIABLE_NAME | VARIABLE_VALUE | COMMENT |
| tikv_gc_life_time | 4h0m0s | |
tidbv5.2.2 tikv5.2.2
dumpling 5.1.1 异常 换成 5.2.3 也异常 。bak.log (14.7 KB)
每天凌晨5.1.1 备份最近两个月的分区数据是没问题的。
【其他背景信息或者截图】
如集群拓扑,系统和内核版本,应用 app 信息等;如果问题跟 SQL 有关,请提供 SQL 语句和相关表的 Schema 信息;如果节点日志存在关键报错,请提供相关节点的日志内容或文件;如果一些业务敏感信息不便提供,请留下联系方式,我们与您私下沟通。

1 个赞

dumpling更换为对应版本

1 个赞

5.2.2 5.2.3 我记得是一样的 没有什么区别吧 。

请问集群最近做过什么维护操作,还有确认下这个表的region是否健康

如何查看表的region的是否健康, 文档只查到 tikv 的region 检查

tikv-ctl --db /path/to/tikv/db bad-regions

每个tikv 都执行这个检测吗?

/ # ./tikv-ctl --db /data/tikv/db bad-regions
error: Invalid value for ‘–db ’: DEPRECATED!!! Use --data-dir and --config instead

/ # ps -ef|grep tikv
1 root 226d /tikv-server --pd=xxx.xx.17.188:2479,xxx.xx.17.132:2479,xxx.xx.17.135:2479 --advertise-addr=xxx.xx.17.132:20160 --addr=0.0.0.0:20160 --status-addr=0.0.0.0:20180 --advertise-status-addr=xxx.xx.17.132:20180 --data-dir=/data/tikv --capacity=0 --config=/etc/tikv/tikv.toml

/ # ./tikv-ctl --data-dir /data/tikv --config=/etc/tikv/tikv.toml bad-regions > /data/tikv/tikvlog/bad-regions.txt 2>&1
/ # more /data/tikv/tikvlog/bad-regions.txt
[2022/01/10 13:56:55.539 +08:00] [INFO] [mod.rs:118] [“encryption: none of key dictionary and file dictionary are found.”]
[2022/01/10 13:56:55.540 +08:00] [INFO] [mod.rs:479] [“encryption is disabled.”]
[2022/01/10 13:56:55.542 +08:00] [WARN] [config.rs:587] [“compaction guard is disabled due to region info provider not available”]
[2022/01/10 13:56:55.542 +08:00] [WARN] [config.rs:682] [“compaction guard is disabled due to region info provider not available”]
[2022/01/10 13:56:55.804 +08:00] [ERROR] [main.rs:80] [“error while open kvdb: Storage Engine IO error: While lock file: /data/tikv/db/LOCK: Resource temporarily unavailable”]
[2022/01/10 13:56:55.804 +08:00] [ERROR] [main.rs:83] [“LOCK file conflict indicates TiKV process is running. Do NOT delete the LOCK file and force the command to run. Doing so could cause data corruption.”]

上面那个会锁,

tikv-ctl --host 127.0.0.1:20160 consistency-check -r 2。 这个检查看看

/ # ./tikv-ctl --host 127.0.0.1:20160 consistency-check -r 2 > /data/tikv/tikvlog/bad-regions.txt 2>&1
/ # more /data/tikv/tikvlog/bad-regions.txt
[2022/01/10 14:02:00.739 +08:00] [INFO] [] [“TCP_USER_TIMEOUT is available. TCP_USER_TIMEOUT will be used thereafter”]
[2022/01/10 14:02:00.755 +08:00] [INFO] [] [“New connected subchannel at 0x7f67fb02e210 for subchannel 0x7f67fe013240”]
DebugClient::check_region_consistency: RpcFailure: 2-UNKNOWN RegionNotFound(2)

region id不对啊

我要自己循环写region id 吗?

先写一个看看

/ # ./tikv-ctl --host xxx.xx.17.132:20160 consistency-check -r 32818892
[2022/01/10 14:59:32.476 +08:00] [INFO] [] [“TCP_USER_TIMEOUT is available. TCP_USER_TIMEOUT will be used thereafter”]
[2022/01/10 14:59:32.477 +08:00] [INFO] [] [“New connected subchannel at 0x7ff43962e150 for subchannel 0x7ff43c613240”]
DebugClient::check_region_consistency: RpcFailure: 2-UNKNOWN “Leader is on store 8”

太费劲了 有3000多个region ,没法统一查询检测

/ # ./tikv-ctl --host xx.xx.17.135:20160 consistency-check -r 32818892
[2022/01/10 15:00:36.666 +08:00] [INFO] [] [“TCP_USER_TIMEOUT is available. TCP_USER_TIMEOUT will be used thereafter”]
[2022/01/10 15:00:36.672 +08:00] [INFO] [] [“New connected subchannel at 0x7f23e362e150 for subchannel 0x7f23e6613240”]
success!

SHOW TABLE Agent_material_report_cost regions ;
查询的哪个系统表里的数据 ,能不能把sql发一下 。 我看看 怎么把 leader 跟 store对应的tikv地址映射起来 生成脚本批量执行 。

问下,如果region 有问题 检查结果是 bad region吗, 这输出太大,不好过滤

https://docs.pingcap.com/zh/tidb/v5.1/tikv-control/

参考这里 region有问题的修复

tikv-ctl --db /path/to/tikv/db bad-regions
这个命令 必须停 tikv才能跑吗

是的。

dumpling 异常的时候 能不能打印具体哪个region is unavailable ?
dumpling 导出的时候能不能不要加 _tidb_rowid 和 order by
SELECT * FROM yixintui_operate.Agent_material_report_cost WHERE ( report_date >=‘2021-01-01’ and report_date <‘2021-02-01’ ) AND (_tidb_rowid>=3452416045 and _tidb_rowid<3452440709) ORDER BY _tidb_rowid