触发TiKV_approximate_region_size告警,手动split超时

【 TiDB 使用环境】生产环境 /测试/ Poc
【 TiDB 版本】
【复现路径】做过哪些操作出现的问题
【遇到的问题:问题现象及影响】
触发了TiKV_approximate_region_size告警,监控如下:
PD


TiKV-Details

TiKV-Trouble-Shooting

查看Region

都集中在同一张表上

手动split
tiup ctl:v6.5.2 pd -u http://xx.xx.xx.xx:2379 operator add split-region 417438
可以看到PD Dashboard


有对应的Create, Check, Timeout
问题1:什么原因造成region不自动分裂?
问题2:为什么手动split会超时

这个表是业务相关的表吗,有什么特点吗

这个表上有一个json字段,其它的没啥特殊之处

INFORMATION_SCHEMA.TIKV_REGION_STATUS的APPROXIMATE_KEYS也看看

就是这个图

考虑一下参数 shard_row_id_bits和pre_split_regions

一个json有多大啊。。。

大小不等,随便查了一些最大的1455120,小的也有35233

operator add split-region 1 --policy=approximate是这么分的吗?这样会粗略分,可能会快一点,另外你可以先分个小点的region试一下 超时不超时

先开始加了–policy=approximate,后面就没有加,一样的效果,我试试小的region行不行

这是自动split失败的日志
[2024/02/05 23:42:35.657 +00:00] [INFO] [size.rs:202] [“Run size checker”] [policy=Approximate] [threshold=150994944] [size=1797417437] [region_id=417290]
[2024/02/05 23:42:35.657 +00:00] [INFO] [range_properties.rs:130] [“range size is too large”] [cf=default] [ssts_size=“5916496.sst:270598017, 5931357.sst:208660678, 5930493.sst:12131525, 5931316.sst :12832681, 5932616.sst:12689300, 5932848.sst:11735428, 5933096.sst:11280137, 5933324.sst:11333861, 5933929.sst:10920084, 5931769.sst:11980596, 5934186.sst:11492591, 5930921.sst:11778132, 5934437.sst :11546315, 5934665.sst:11600039, 5934903.sst:12715860, 5935272.sst:12246028, 5935992.sst:11339664, 5936221.sst:11388615, 5936467.sst:12529411, 5936960.sst:12646849, 5937176.sst:12151931, 5937585.sst :12820453, 5947314.sst:12651302, 5933701.sst:11387585, 5955335.sst:11020170, 5940824.sst:13586353, 5947709.sst:12733553, 5947511.sst:13360339, 5956266.sst:12597085, 5948065.sst:12136293, 5946977.sst :13225030, 5948259.sst:11495077, 5948436.sst:10848200, 5956073.sst:11088435, 5946025.sst:12366587, 5939435.sst:13241698, 5949768.sst:11032904, 5950484.sst:11849833, 5945617.sst:12927550, 5952983.sst :12901527, 5950662.sst:11881912, 5950288.sst:11817754, 5951752.sst:12021550, 5951220.sst:11231816, 5946279.sst:9791400, 5930707.sst:12200900, 5953503.sst:11563928, 5951404.sst:10555635, 5931535.sst: 12411800, 5951572.sst:10580610, 5942933.sst:12343690, 5932415.sst:13123279, 5951985.sst:12053629, 5950117.sst:11785675, 5941280.sst:13754851, 5935775.sst:12363466, 5956455.sst:12629164, 5938249.sst: 10717349, 5937374.sst:12761734, 5955890.sst:11800136, 5956283.sst:18921968, 5944095.sst:13192809, 5938036.sst:10677278, 5954872.sst:11700680, 5942706.sst:12913089, 5938599.sst:11324710, 5947883.sst: 12100329, 5953702.sst:12317809, 5949942.sst:11753596, 5953315.sst:13695257, 5940977.sst:11861950, 5939226.sst:11466790, 5955086.sst:10995195, 5953151.sst:12937491, 5947143.sst:13936398, 5942134.sst: 13372931, 5952813.sst:12151753, 5946826.sst:12522653, 5941610.sst:13265483, 5935506.sst:12304747, 5945850.sst:13621713, 5945425.sst:13526142, 5945202.sst:12195758, 5943349.sst:11809811, 5943138.sst: 12388090, 5945030.sst:13432902, 5954658.sst:11672264, 5944678.sst:13337331, 5944491.sst:13288380, 5937832.sst:11195950, 5952253.sst:9239126, 5942495.sst:12252670, 5943869.sst:12519070, 5936769.sst:1 2588130, 5943527.sst:12474670, 5944280.sst:12610090, 5942308.sst:13426655, 5939830.sst:13359136, 5939635.sst:13300417, 5931989.sst:12547775, 5946667.sst:13138450, 5941776.sst:13319207, 5941445.sst:1 3211759, 5941127.sst:13696132, 5940236.sst:12886973, 5944848.sst:12747730, 5940021.sst:12833249, 5939031.sst:12562187, 5938829.sst:12508463”] [memtable=49752] [total_size=1797410892] [end=7A74800000 00000001FF7B5F7282CA33BAADFFBAA48E0000000000FA] [start=7A7480000000000001FF7B5F7282CA33BAADFFB653B70000000000FA]
[2024/02/05 23:42:35.658 +00:00] [INFO] [peer.rs:5550] [“on split”] [source=“split checker”] [split_keys=“10 keys range from 7480000000000001FF7B5F7282CA33BAADFFB653B70000000000FAF9CA1257DCD3FFF0 to 7480000000000001FF7B5F7282CA33BAADFFB653B70000000000FAF9CA213E3963FFF9”] [peer_id=417293] [region_id=417290]
[2024/02/05 23:42:35.658 +00:00] [INFO] [pd.rs:1082] [“try to batch split region”] [task=batch_split] [region=“id: 417290 start_key: 7480000000000001FF7B5F7282CA33BAADFFB653B70000000000FA end_key: 7 480000000000001FF7B5F7282CA33BAADFFBAA48E0000000000FA region_epoch { conf_ver: 5 version: 3894 } peers { id: 417291 store_id: 1 } peers { id: 417292 store_id: 2 } peers { id: 417293 store_id: 3 }”] [new_region_ids=“[new_region_id: 499054 new_peer_ids: 499055 new_peer_ids: 499056 new_peer_ids: 499057, new_region_id: 499058 new_peer_ids: 499059 new_peer_ids: 499060 new_peer_ids: 499061, new_regi on_id: 499062 new_peer_ids: 499063 new_peer_ids: 499064 new_peer_ids: 499065, new_region_id: 499066 new_peer_ids: 499067 new_peer_ids: 499068 new_peer_ids: 499069, new_region_id: 499070 new_peer_ids : 499071 new_peer_ids: 499072 new_peer_ids: 499073, new_region_id: 499074 new_peer_ids: 499075 new_peer_ids: 499076 new_peer_ids: 499077, new_region_id: 499078 new_peer_ids: 499079 new_peer_ids: 499 080 new_peer_ids: 499081, new_region_id: 499082 new_peer_ids: 499083 new_peer_ids: 499084 new_peer_ids: 499085, new_region_id: 499086 new_peer_ids: 499087 new_peer_ids: 499088 new_peer_ids: 499089, new_region_id: 499090 new_peer_ids: 499091 new_peer_ids: 499092 new_peer_ids: 499093]”] [region_id=417290]
[2024/02/05 23:42:35.659 +00:00] [WARN] [split_observer.rs:38] [“skip invalid split key: key is not in region”] [index=0] [end_key=7480000000000001FF7B5F7282CA33BAADFFBAA48E0000000000FA] [start_key= 7480000000000001FF7B5F7282CA33BAADFFB653B70000000000FA] [region_id=417290] [key=7480000000000001FF7B5F7282CA33BAADFFB653B70000000000FA]
[2024/02/05 23:42:35.659 +00:00] [WARN] [split_observer.rs:38] [“skip invalid split key: key is not in region”] [index=1] [end_key=7480000000000001FF7B5F7282CA33BAADFFBAA48E0000000000FA] [start_key= 7480000000000001FF7B5F7282CA33BAADFFB653B70000000000FA] [region_id=417290] [key=7480000000000001FF7B5F7282CA33BAADFFB653B70000000000FA]
[2024/02/05 23:42:35.659 +00:00] [WARN] [split_observer.rs:38] [“skip invalid split key: key is not in region”] [index=2] [end_key=7480000000000001FF7B5F7282CA33BAADFFBAA48E0000000000FA] [start_key= 7480000000000001FF7B5F7282CA33BAADFFB653B70000000000FA] [region_id=417290] [key=7480000000000001FF7B5F7282CA33BAADFFB653B70000000000FA]
[2024/02/05 23:42:35.659 +00:00] [WARN] [split_observer.rs:38] [“skip invalid split key: key is not in region”] [index=3] [end_key=7480000000000001FF7B5F7282CA33BAADFFBAA48E0000000000FA] [start_key= 7480000000000001FF7B5F7282CA33BAADFFB653B70000000000FA] [region_id=417290] [key=7480000000000001FF7B5F7282CA33BAADFFB653B70000000000FA]
[2024/02/05 23:42:35.659 +00:00] [WARN] [split_observer.rs:38] [“skip invalid split key: key is not in region”] [index=4] [end_key=7480000000000001FF7B5F7282CA33BAADFFBAA48E0000000000FA] [start_key= 7480000000000001FF7B5F7282CA33BAADFFB653B70000000000FA] [region_id=417290] [key=7480000000000001FF7B5F7282CA33BAADFFB653B70000000000FA]
[2024/02/05 23:42:35.659 +00:00] [WARN] [split_observer.rs:38] [“skip invalid split key: key is not in region”] [index=5] [end_key=7480000000000001FF7B5F7282CA33BAADFFBAA48E0000000000FA] [start_key= 7480000000000001FF7B5F7282CA33BAADFFB653B70000000000FA] [region_id=417290] [key=7480000000000001FF7B5F7282CA33BAADFFB653B70000000000FA]
[2024/02/05 23:42:35.659 +00:00] [WARN] [split_observer.rs:38] [“skip invalid split key: key is not in region”] [index=6] [end_key=7480000000000001FF7B5F7282CA33BAADFFBAA48E0000000000FA] [start_key= 7480000000000001FF7B5F7282CA33BAADFFB653B70000000000FA] [region_id=417290] [key=7480000000000001FF7B5F7282CA33BAADFFB653B70000000000FA]
[2024/02/05 23:42:35.659 +00:00] [WARN] [split_observer.rs:38] [“skip invalid split key: key is not in region”] [index=7] [end_key=7480000000000001FF7B5F7282CA33BAADFFBAA48E0000000000FA] [start_key= 7480000000000001FF7B5F7282CA33BAADFFB653B70000000000FA] [region_id=417290] [key=7480000000000001FF7B5F7282CA33BAADFFB653B70000000000FA]
[2024/02/05 23:42:35.659 +00:00] [WARN] [split_observer.rs:38] [“skip invalid split key: key is not in region”] [index=8] [end_key=7480000000000001FF7B5F7282CA33BAADFFBAA48E0000000000FA] [start_key= 7480000000000001FF7B5F7282CA33BAADFFB653B70000000000FA] [region_id=417290] [key=7480000000000001FF7B5F7282CA33BAADFFB653B70000000000FA]
[2024/02/05 23:42:35.659 +00:00] [WARN] [split_observer.rs:38] [“skip invalid split key: key is not in region”] [index=9] [end_key=7480000000000001FF7B5F7282CA33BAADFFBAA48E0000000000FA] [start_key= 7480000000000001FF7B5F7282CA33BAADFFB653B70000000000FA] [region_id=417290] [key=7480000000000001FF7B5F7282CA33BAADFFB653B70000000000FA]
[2024/02/05 23:42:35.659 +00:00] [ERROR] [split_observer.rs:142] [“failed to handle split req”] [err=“"no valid key found for split."”] [region_id=417290]
[2024/02/05 23:42:35.659 +00:00] [WARN] [peer.rs:4339] [“skip proposal”] [error_code=KV:Raftstore:Coprocessor] [err=“Coprocessor(Other("[components/raftstore/src/coprocessor/split_observer.rs:147]: no valid key found for split."))”] [peer_id=417293] [region_id=417290]

1 个赞

操作了一下其它表上稍微大一点的region很快完成split,这个表上的region看日志一直有split的任务,但是失败了,手动操作也超时

417290这个region多大啊,我看你截图上没有这个region啊,这个region可能真是触发bug了,导致自动split进程异常,所有的region都无法自动split了

我操作split的时候648M,观察是否分裂时,空间还不断增大

[2024/02/05 23:42:35.659 +00:00] [WARN] [split_observer.rs:38] [“skip invalid split key: key is not in region”] [index=8] [end_key=7480000000000001FF7B5F7282CA33BAADFFBAA48E0000000000FA] [start_key= 7480000000000001FF7B5F7282CA33BAADFFB653B70000000000FA] [region_id=417290] [key=7480000000000001FF7B5F7282CA33BAADFFB653B70000000000FA]
[2024/02/05 23:42:35.659 +00:00] [WARN] [split_observer.rs:38] [“skip invalid split key: key is not in region”] [index=9] [end_key=7480000000000001FF7B5F7282CA33BAADFFBAA48E0000000000FA] [start_key= 7480000000000001FF7B5F7282CA33BAADFFB653B70000000000FA] [region_id=417290] [key=7480000000000001FF7B5F7282CA33BAADFFB653B70000000000FA]

感觉像是个bug。上面的日志表明。split key就是这一串sst文件的start key.

然后,代码里面判断这个key是否在一串sst文件外的时候

start_key < key && (key < end_key || end_key.is_empty())

https://github.com/tikv/tikv/blob/master/components/tikv_util/src/store/region.rs#L9

会一直认为这个split key在这一串sst文件外面。那就一直找不到split key。

我的建议是先手工split下其他region,看看是不是只有这个region有问题,如果是的话,可以针对对应的表进行重建操作试下

某个表所有的region都不行

是那个有json字段的表?你新建个表把原表数据导一部分过去看看能复现不能。

表结构是这样的:
CREATE TABLE a (
account_id bigint(20) NOT NULL DEFAULT ‘0’ COMMENT ‘用户ID’,
response json DEFAULT NULL COMMENT ‘返回结果’,
create_time bigint(20) unsigned NOT NULL DEFAULT ‘0’ COMMENT ‘create_time’,
type tinyint(4) DEFAULT NULL COMMENT ‘类型(0实时,1离线)’,
PRIMARY KEY (account_id) /*T![clustered_index] CLUSTERED */,
KEY idx_create_time (create_time)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_bin COMMENT=‘结果表’
我验证一下,看看能不能复现

这么看应该就是bug,key 和start_key相等了