加索引卡主,如何确认是什么原因呢?

【 TiDB 使用环境】生产环境
【 TiDB 版本】5.0.2
【遇到的问题:问题现象及影响】
执行如下ddl,一直卡主不动,持续时间超过3小时,
alter table tb_xxx add index idx_yyyy(commissioner_id, month_id, city_id);
期间发现表的数据行数从2000w行左右,然后慢慢降到1000w行左右,然后现在慢慢涨到1900w行左右了,发现先有delete操作,然后insert操作导致的。

admin show ddl jobs中的row_count始终为0, 另外大概每20分钟会有如下类型的日志输出:
[2022/12/28 14:26:47.351 +08:00] [INFO] [ddl_worker.go:861] [“[ddl] wait latest schema version changed”] [worker=“worker 2, tp add index”] [ver=3484] [“take time”=53.953994ms] [job=“ID:4476, Type:add index, State:running, SchemaState:write reorganization, SchemaID:51, TableID:214, RowCount:0, ArgLen:0, start time: 2022-12-28 11:13:41.69 +0800 CST, Err:, ErrCount:0, SnapshotVersion:0”]
[2022/12/28 14:26:47.353 +08:00] [INFO] [ddl_worker.go:678] [“[ddl] run DDL job”] [worker=“worker 2, tp add index”] [job=“ID:4476, Type:add index, State:running, SchemaState:write reorganization, SchemaID:51, TableID:214, RowCount:0, ArgLen:0, start time: 2022-12-28 11:13:41.69 +0800 CST, Err:, ErrCount:0, SnapshotVersion:0”]
[2022/12/28 14:42:59.101 +08:00] [INFO] [ddl_worker.go:861] [“[ddl] wait latest schema version changed”] [worker=“worker 2, tp add index”] [ver=3485] [“take time”=53.108472ms] [job=“ID:4476, Type:add index, State:running, SchemaState:write reorganization, SchemaID:51, TableID:214, RowCount:0, ArgLen:0, start time: 2022-12-28 11:13:41.69 +0800 CST, Err:, ErrCount:0, SnapshotVersion:0”]
[2022/12/28 14:42:59.102 +08:00] [INFO] [ddl_worker.go:678] [“[ddl] run DDL job”] [worker=“worker 2, tp add index”] [job=“ID:4476, Type:add index, State:running, SchemaState:write reorganization, SchemaID:51, TableID:214, RowCount:0, ArgLen:0, start time: 2022-12-28 11:13:41.69 +0800 CST, Err:, ErrCount:0, SnapshotVersion:0”]
[2022/12/28 15:02:49.460 +08:00] [INFO] [ddl_worker.go:861] [“[ddl] wait latest schema version changed”] [worker=“worker 2, tp add index”] [ver=3486] [“take time”=53.689792ms] [job=“ID:4476, Type:add index, State:running, SchemaState:write reorganization, SchemaID:51, TableID:214, RowCount:0, ArgLen:0, start time: 2022-12-28 11:13:41.69 +0800 CST, Err:, ErrCount:0, SnapshotVersion:0”]
[2022/12/28 15:02:49.461 +08:00] [INFO] [ddl_worker.go:678] [“[ddl] run DDL job”] [worker=“worker 2, tp add index”] [job=“ID:4476, Type:add index, State:running, SchemaState:write reorganization, SchemaID:51, TableID:214, RowCount:0, ArgLen:0, start time: 2022-12-28 11:13:41.69 +0800 CST, Err:, ErrCount:0, SnapshotVersion:0”]

请问这个ddl执行慢的原因是什么呢?是因为表的数据变化大导致的吗?另外又该如何解决这种问题呢?

你这数据变化这么大,索引还没建上,对应的行都没了。。。。

admin show ddl jobs看下,explain analyze 看个全表扫描的SQL 贴下结果看看MVCC版本是不是比较多。

看这个帖子,应该跟你的问题比较类似,数据mvcc版本太多,老版本tidb,5.0的应该有问题,看看帖子中的办法是否能够解决,900w表添加索引超过一天还没执行完,如何排查?

现在表行数维持在21247099,重新执行alter table了,admin show ddl jobs显示row_count还是不变

explain analyze select count(*) from tb_xxx;
±---------------------------±------------±---------±----------±-----------------------------------------------------------------±-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------±----------------------------------±--------±-----+
| id | estRows | actRows | task | access object | execution info | operator info | memory | disk |
±---------------------------±------------±---------±----------±-----------------------------------------------------------------±-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------±----------------------------------±--------±-----+
| StreamAgg_32 | 1.00 | 1 | root | | time:105.3ms, loops:2 | funcs:count(Column#33)->Column#19 | 9.49 KB | N/A |
| └─IndexReader_33 | 1.00 | 4299 | root | | time:105.1ms, loops:6, cop_task: {num: 4299, max: 3.52ms, min: 190.9µs, avg: 285.1µs, p95: 377.5µs, rpc_num: 4299, rpc_time: 1.19s, copr_cache_hit_ratio: 1.00} | index:StreamAgg_8 | 3.19 KB | N/A |
| └─StreamAgg_8 | 1.00 | 4299 | cop[tikv] | | tikv_task:{proc max:576ms, min:74ms, p80:311ms, p95:421ms, iters:25037, tasks:4299}, scan_detail: {total_process_keys: 0, total_keys: 0, rocksdb: {delete_skipped_count: 0, key_skipped_count: 0, block: {cache_hit_count: 0, read_count: 0, read_byte: 0 Bytes}}} | funcs:count(1)->Column#33 | N/A | N/A |
| └─IndexFullScan_28 | 21247099.00 | 21247099 | cop[tikv] | table:tb_xxx, index:idx_dt(cal_dt) | tikv_task:{proc max:576ms, min:74ms, p80:311ms, p95:421ms, iters:25037, tasks:4299}, scan_detail: {total_process_keys: 0, total_keys: 0, rocksdb: {delete_skipped_count: 0, key_skipped_count: 0, block: {cache_hit_count: 0, read_count: 0, read_byte: 0 Bytes}}} | keep order:false, stats:pseudo | N/A | N/A |
±---------------------------±------------±---------±----------±-----------------------------------------------------------------±-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------±----------------------------------±--------±-----+

所以这种场景下有什么解决方案吗?

explain analyze select max(无索引列)看看,主要想看下total_process_keys: 相关的mvcc是不是比较多,有可能因为历史版本或不能GC的数据导致扫描时间长

explain analyze select max(sign_cnt) from tb_xxx;
±-----------------------------±------------±---------±----------±-------------------------------------------±-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------±----------------------------------------------------------------------------------------±----------±-----+
| id | estRows | actRows | task | access object | execution info | operator info | memory | disk |
±-----------------------------±------------±---------±----------±-------------------------------------------±-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------±----------------------------------------------------------------------------------------±----------±-----+
| StreamAgg_9 | 1.00 | 1 | root | | time:1m33.9s, loops:2 | funcs:max(tdb58_ifang_stats.tb_xxx.sign_cnt)->Column#19 | 1.45 KB | N/A |
| └─TopN_10 | 1.00 | 1 | root | | time:1m33.9s, loops:2 | tdb58_ifang_stats.tb_xxx.sign_cnt:desc, offset:0, count:1 | 200 Bytes | N/A |
| └─TableReader_18 | 1.00 | 46 | root | | time:1m33.9s, loops:3, cop_task: {num: 4931, max: 1.07s, min: 58.8ms, avg: 284.8ms, p95: 494.2ms, max_proc_keys: 511093, p95_proc_keys: 0, tot_proc: 21m23.6s, tot_wait: 1m58.5s, rpc_num: 4932, rpc_time: 23m24.4s, copr_cache_hit_ratio: 0.00} | data:TopN_17 | 463 Bytes | N/A |
| └─TopN_17 | 1.00 | 46 | cop[tikv] | | tikv_task:{proc max:605ms, min:51ms, p80:358ms, p95:422ms, iters:25649, tasks:4931}, scan_detail: {total_process_keys: 21247099, total_keys: 3936959991, rocksdb: {delete_skipped_count: 489891, key_skipped_count: 3939966387, block: {cache_hit_count: 1375902, read_count: 2751371, read_byte: 18.4 GB}}} | tdb58_ifang_stats.tb_xxx.sign_cnt:desc, offset:0, count:1 | N/A | N/A |
| └─TableFullScan_16 | 21247099.00 | 21247099 | cop[tikv] | table:tb_xxx | tikv_task:{proc max:605ms, min:51ms, p80:357ms, p95:421ms, iters:25649, tasks:4931}, scan_detail: {total_process_keys: 0, total_keys: 0, rocksdb: {delete_skipped_count: 0, key_skipped_count: 0, block: {cache_hit_count: 0, read_count: 0, read_byte: 0 Bytes}}} | keep order:false, stats:pseudo | N/A | N/A |
±-----------------------------±------------±---------±----------±-------------------------------------------±-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------±----------------------------------------------------------------------------------------±----------±-----+

差异很大啊,参考楼上发的那个,貌似也是你发的帖子

那就是无解了 :joy:

可以考虑升级

暂时没有升级的计划,我先跟业务沟通下能否重建表再导入的方式来解决。

小版本升级应该没啥问题,还有很多bug修复