升级V5.2.2后SQL从15秒变成15个小时

【概述】 两张千万级的表join

【背景】 版本升级到V5.2.2

【现象】 业务和数据库现象

【问题】 SQL执行变得极端

【业务影响】 慢慢慢

【TiDB 版本】 V5.2.2

SELECT
  count(1)
FROM
  t1ags
  LEFT JOIN t2 ar ON ags.t2_id = ar.t2_id
WHERE
  ags.status IN (0, 1, 2, 4)
  AND ags.merchant_id = 'dwj'
  AND ags.create_time >= date_format('2021-10-22 00:00:00', '%Y-%m-%d')
  AND ags.create_time < date_format('2021-12-22 00:00:00', '%Y-%m-%d');
	id                       	task     	estRows   	operator info                                                                                                                                                                                                                                                             	actRows	execution info                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                	memory   	disk
	HashAgg_8                	root     	1         	funcs:count(1)->Column#46                                                                                                                                                                                                                                                 	1      	time:17h35m53.9s, loops:2, partial_worker:{wall_time:17h35m53.818665882s, concurrency:5, task_num:276506, tot_wait:87h59m25.534435281s, tot_exec:3.365149708s, tot_time:87h59m29.092939118s, max:17h35m53.818623181s, p95:17h35m53.818623181s}, final_worker:{wall_time:17h35m53.91090043s, concurrency:5, task_num:5, tot_wait:87h59m29.093287161s, tot_exec:92.266445ms, tot_time:87h59m29.1855608s, max:17h35m53.910800171s, p95:17h35m53.910800171s}                                                                                      	5.55 KB  	N/A
	└─IndexJoin_13           	root     	8584973.13	left outer join, inner:IndexReader_12, outer key:dbdbdb.t1.t2_id, inner key:dbdbdb.t2.t2_id, equal cond:eq(dbdbdb.t1.t2_id, dbdbdb.t2.t2_id)	8848173	time:17h35m52.3s, loops:276507, inner:{total:87h11m9s, concurrency:5, task:276505, construct:5h18m16.8s, fetch:81h52m46s, build:5.74s}, probe:9.51s                                                                                                                                                                                                                                                                                                                                                                                           	1.59 GB  	N/A
	  ├─IndexLookUp_35       	root     	8567764.50	                                                                                                                                                                                                                                                                          	8848173	time:6.3s, loops:276508, index_task: {total_time: 2h0m4.2s, fetch_handle: 1.76s, build: 1.25ms, wait: 2h0m2.4s}, table_task: {total_time: 10h0m21s, num: 453, concurrency: 5}                                                                                                                                                                                                                                                                                                                                                                 	163.6 MB 	N/A
	  │ ├─IndexRangeScan_32  	cop[tikv]	9154137.14	table:ags, index:idx_t1_3(create_time, phone), range:[2021-10-22 00:00:00,2021-12-22 00:00:00), keep order:false                                                                                                                                        	9094286	time:627.4ms, loops:8928, cop_task: {num: 10, max: 652.8ms, min: 346.8ms, avg: 485.9ms, p95: 652.8ms, max_proc_keys: 1073274, p95_proc_keys: 1073274, tot_proc: 4.6s, tot_wait: 31ms, rpc_num: 10, rpc_time: 4.86s, copr_cache_hit_ratio: 0.00}, tikv_task:{proc max:590ms, min:324ms, p80:479ms, p95:590ms, iters:8926, tasks:10}, scan_detail: {total_process_keys: 9094286, total_keys: 9094296, rocksdb: {delete_skipped_count: 244, key_skipped_count: 9094530, block: {cache_hit_count: 77, read_count: 7896, read_byte: 72.0 MB}}}     	N/A      	N/A
	  │ └─Selection_34       	cop[tikv]	8567764.50	eq(dbdbdb.t1.merchant_id, "dwj"), in(dbdbdb.t1.status, 0, 1, 2, 4)                                                                                                                                                              	8848173	time:22.2s, loops:11998, cop_task: {num: 588, max: 142.3ms, min: 529.8μs, avg: 42.9ms, p95: 75.3ms, max_proc_keys: 20992, p95_proc_keys: 20480, tot_proc: 22.3s, tot_wait: 269ms, rpc_num: 588, rpc_time: 25.2s, copr_cache_hit_ratio: 0.00}, tikv_task:{proc max:88ms, min:0s, p80:50ms, p95:63ms, iters:11557, tasks:588}, scan_detail: {total_process_keys: 9094286, total_keys: 9531493, rocksdb: {delete_skipped_count: 292982, key_skipped_count: 19120129, block: {cache_hit_count: 1961079, read_count: 49014, read_byte: 793.7 MB}}}	N/A      	N/A
	  │   └─TableRowIDScan_33	cop[tikv]	9154137.14	table:ags, keep order:false                                                                                                                                                                                                                                               	9094286	tikv_task:{proc max:83ms, min:0s, p80:47ms, p95:60ms, iters:11557, tasks:588}                                                                                                                                                                                                                                                                                                                                                                                                                                                                 	N/A      	N/A
	  └─IndexReader_12       	root     	1         	index:IndexRangeScan_11                                                                                                                                                                                                                                                   	3563472	time:66h25m17.9s, loops:553011, cop_task: {num: 2997713, max: 1.64s, min: 134.4μs, avg: 812.3μs, p95: 1.41ms, max_proc_keys: 7, p95_proc_keys: 2, tot_proc: 7m30.9s, tot_wait: 7m44.3s, rpc_num: 2997713, rpc_time: 39m30.1s, copr_cache_hit_ratio: 0.00}                                                                                                                                                                                                                                                                                   	841 Bytes	N/A
	    └─IndexRangeScan_11  	cop[tikv]	1         	table:ar, index:PRIMARY(t2_id), range: decided by [eq(dbdbdb.t2.t2_id, dbdbdb.t1.t2_id)], keep order:false                           
3 个赞

3 个赞

有升级前的执行计划么

2 个赞

同样问题,有升级之前的执行计划么,还有你的cop任务实际执行时间不高,但是总的执行时间过高,考虑cop执行并行度的问题,参考:https://docs.pingcap.com/zh/tidb/stable/analyze-slow-queries/#%E5%B9%B6%E5%8F%91%E5%A4%AA%E4%BD%8E

2 个赞

升级前后的变量没有修改,另外测试环境结构和数据量相同的情况下,执行效率也很高。

测试和生产环境的explain analyze concurrency:5

测试环境大概10S以内可以出结果,生产极不稳定,从几十分钟到几个小时到十几个小时不等。

show variables like “%tidb_hash_join_concurrency%”;
tidb_hash_join_concurrency -1

2 个赞

麻烦贴一下测试环境这条语句对应的执行计划结果

2 个赞

测试环境,版本相同V5.2.2,数据量大致相当,数据量误差不超过5%。

测试环境explain analyze

HashAgg_8	1.00	1	root		time:4.46s, loops:2, partial_worker:{wall_time:4.461507336s, concurrency:5, task_num:2869, tot_wait:22.042533145s, tot_exec:263.536139ms, tot_time:22.307242786s, max:4.461474374s, p95:4.461474374s}, final_worker:{wall_time:4.461547625s, concurrency:5, task_num:5, tot_wait:22.307437568s, tot_exec:54.987μs, tot_time:22.307499592s, max:4.461517225s, p95:4.461517225s}	funcs:count(1)->Column#46	63.7 KB	N/A
└─IndexJoin_13	2847827.25	2937057	root		time:4.45s, loops:2870, inner:{total:17.2s, concurrency:5, task:2870, construct:5.38s, fetch:11s, build:811ms}, probe:1.66s	left outer join, inner:IndexReader_12, outer key:test_dbdbdb.t1.t2_id, inner key:test_dbdbdb.t2.t2_id, equal cond:eq(test_dbdbdb.t1.t2_id, test_dbdbdb.t2.t2_id)	460.3 MB	N/A
  ├─IndexLookUp_35(Build)	2828591.23	2937057	root		time:1.35s, loops:2872, index_task: {total_time: 3.3s, fetch_handle: 1.03s, build: 322.1μs, wait: 2.27s}, table_task: {total_time: 16.5s, num: 160, concurrency: 5}		105.2 MB	N/A
  │ ├─IndexRangeScan_32(Build)	3046331.56	3111057	cop[tikv]	table:ags, index:idx_t1_3(create_time, phone)	time:720.7ms, loops:3046, cop_task: {num: 3, max: 875ms, min: 709.4ms, avg: 771.4ms, p95: 875ms, max_proc_keys: 1200505, p95_proc_keys: 1200505, tot_proc: 2.12s, rpc_num: 3, rpc_time: 2.31s, copr_cache_hit_ratio: 0.00}, tikv_task:{proc max:773ms, min:622ms, p80:773ms, p95:773ms, iters:3052, tasks:3}, scan_detail: {total_process_keys: 3111057, total_keys: 3111060, rocksdb: {delete_skipped_count: 0, key_skipped_count: 3111057, block: {cache_hit_count: 2726, read_count: 1, read_byte: 64.0 KB}}}	range:[2021-10-22 00:00:00,2021-12-22 00:00:00), keep order:false	N/A	N/A
  │ └─Selection_34(Probe)	2828591.23	2937057	cop[tikv]		time:7.13s, loops:3353, cop_task: {num: 228, max: 137.9ms, min: 314.1μs, avg: 33.8ms, p95: 88.6ms, max_proc_keys: 20480, p95_proc_keys: 20480, tot_proc: 5.94s, tot_wait: 25ms, rpc_num: 228, rpc_time: 7.69s, copr_cache_hit_ratio: 0.26}, tikv_task:{proc max:119ms, min:0s, p80:50ms, p95:69ms, iters:4070, tasks:228}, scan_detail: {total_process_keys: 2250639, total_keys: 2260522, rocksdb: {delete_skipped_count: 0, key_skipped_count: 4420866, block: {cache_hit_count: 562339, read_count: 4, read_byte: 196.5 KB}}}	eq(test_dbdbdb.t1.merchant_id, "dwj"), in(test_dbdbdb.t1.status, 0, 1, 2, 4)	N/A	N/A
  │   └─TableRowIDScan_33	3046331.56	3111057	cop[tikv]	table:ags	tikv_task:{proc max:115ms, min:0s, p80:46ms, p95:67ms, iters:4070, tasks:228}	keep order:false	N/A	N/A
  └─IndexReader_12(Probe)	1.00	1015401	root		time:8.45s, loops:5740, cop_task: {num: 97531, max: 71.9ms, min: 219μs, avg: 838.3μs, p95: 1.39ms, max_proc_keys: 57, p95_proc_keys: 19, tot_proc: 30.8s, tot_wait: 4.72s, rpc_num: 97533, rpc_time: 1m20.6s, copr_cache_hit_ratio: 0.00}	index:IndexRangeScan_11	875 Bytes	N/A
    └─IndexRangeScan_11	1.00	1015401	cop[tikv]	table:ar, index:PRIMARY(t2_id)	tikv_task:{proc max:19ms, min:0s, p80:1ms, p95:1ms, iters:98033, tasks:97531}, scan_detail: {total_process_keys: 1015292, total_keys: 1015295, rocksdb: {delete_skipped_count: 0, key_skipped_count: 0, block: {cache_hit_count: 9203629, read_count: 23, read_byte: 3.96 MB}}}	range: decided by [eq(test_dbdbdb.t2.t2_id, test_dbdbdb.t1.t2_id)], keep order:false	N/A	N/A
2 个赞

grafana监控,cluster-tidb --> kv request和cluster-tikv-details --> gRPC
两个监控中的duration,先排除一下网络问题

2 个赞

QQ%E6%88%AA%E5%9B%BE20211222113119

网络有啥问题,部分峰值是因为凌晨跑定时任务以及我大数据库dumpling备份导致的。

主贴中反馈的问题,任何时段都可能存在。
再慢也不至于这个千万级join慢到15个小时。

2 个赞

感觉像是index join的某个循环出问题了

2 个赞

愁的不行,百思不得骑姐

1 个赞



这两个指标,相差的就是网络速度,kv的你没截取平均值,目前的监控看cop的最大值,有差异,我怀疑你的网络不稳定。

1 个赞

1、基本上我不会往底层去考虑故障,如果是网络不稳定,那受影响的远不止这一个SQL。

2、截图中延迟高的时间点是凌晨我在跑备份任务。dumpling的时候网卡都是几百兆字节流量每秒。

3、下图看业务正常的上午时间段,这duration像是底层网络不稳定的样子?

1 个赞

对应两个表的表结构是否可以发一下。以及对应的数据量

1 个赞

ags表 7000w
ar表 3000w

1 个赞

看了下表结构和sql, 需要增加一个联合索引(activity_record_id,merchant_id,create_time)

1 个赞

不加也不应该10多个小时,升级前就没加索引,扫表就扫表,性能可以扛得住。

1 个赞

升级前的版本是多少,在看下 SHOW STATS_HEALTHY

1 个赞

应该不是统计信息的问题,原因是基数估算是准确的,从计划上看应该是没啥问题的。确实像是 Index Join 卡住了

1 个赞

健康得不得了,不是99就是100.