【是否原创】是
【首发渠道】TiDB 社区
【正文】
1、环境
TiDB版本:v5.0.4
本文来源于案例:V5.0.4 SPM没有生效 - #26,来自 dockerfile
2、场景
程序运行过程中,TiDB大量慢SQL,查看dashboard的慢SQL执行计划,发现走的错误索引,将该慢SQL复制出来到客户端运行正常,查询确实走了SPM。
总结:使用SPM bingding了SQL,但是出现了大量的SQL,查看dashboard的慢SQL执行计划,发现没有走SPM,但是从dashboard复制出现的SQL到客户端执行,又是可以走SMP的。
3、排查分析
3.1、基础信息
表Schema:
CREATE TABLE `xxxx_extra` (
`id` varchar(50) NOT NULL COMMENT '主键',
`merchant_id` varchar(50) DEFAULT NULL,
`shop_id` varchar(50) DEFAULT NULL,
`order_code` varchar(50) DEFAULT NULL,
`field_name` varchar(50) DEFAULT NULL,
`field_value` varchar(4000) DEFAULT NULL,
`create_time` datetime DEFAULT NULL,
PRIMARY KEY (`id`) /*T![clustered_index] NONCLUSTERED */,
KEY `idx_order_extra_2` (`order_code`),
KEY `idx_order_extra_1` (`shop_id`,`create_time`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_bin;
dashboard慢SQL的执行计划(执行计划1)
SELECT
t.order_code,
t.coin_num
FROM
(
SELECT
order_code,
( sum( CASE field_name WHEN 'sale_quantity' THEN field_value ELSE 0 END ) - sum( CASE field_name WHEN 'give_quantity' THEN field_value ELSE 0 END ) ) coin_num
FROM
XXXXXTB
WHERE
shop_id = 'eb14a985-f788-48b2-a73b-738e08ef158e'
AND create_time >= "2021-10-04 00:00:00"
AND create_time < "2021-10-05 00:00:00" GROUP BY order_code ) t WHERE t.coin_num > 0;
id task estRows operator info actRows execution info memory disk
Projection_12 root 1 XXXXXDB.XXXXXTB.order_code, minus(Column#9, Column#10)->Column#11 1 time:20m17.4s, loops:2, Concurrency:OFF 1.34 KB N/A
└─Limit_15 root 1 offset:0, count:1 1 time:20m17.4s, loops:2 N/A N/A
└─Selection_16 root 1 gt(minus(Column#9, Column#10), 0) 1 time:20m17.4s, loops:1 41.4 KB N/A
└─StreamAgg_21 root 1 group by:Column#26, funcs:sum(Column#23)->Column#9, funcs:sum(Column#24)->Column#10, funcs:firstrow(Column#25)->XXXXXDB.XXXXXTB.order_code 771 time:20m17.4s, loops:1 84.0 KB N/A
└─Projection_41 root 1.25 cast(case(eq(XXXXXDB.XXXXXTB.field_name, sale_quantity), XXXXXDB.XXXXXTB.field_value, 0), double BINARY)->Column#23, cast(case(eq(XXXXXDB.XXXXXTB.field_name, give_quantity), XXXXXDB.XXXXXTB.field_value, 0), double BINARY)->Column#24, XXXXXDB.XXXXXTB.order_code, XXXXXDB.XXXXXTB.order_code 18196 time:20m17.4s, loops:19, Concurrency:OFF 302.6 KB N/A
└─Projection_40 root 1.25 XXXXXDB.XXXXXTB.shop_id, XXXXXDB.XXXXXTB.order_code, XXXXXDB.XXXXXTB.field_name, XXXXXDB.XXXXXTB.field_value, XXXXXDB.XXXXXTB.create_time 18196 time:20m17.4s, loops:19, Concurrency:OFF 819.2 KB N/A
└─IndexLookUp_39 root 1.25 18196 time:20m17.4s, loops:19, index_task: {total_time: 20m17.1s, fetch_handle: 2m54.6s, build: 8m55.9s, wait: 8m26.6s}, table_task: {total_time: 1h41m25.7s, num: 56723, concurrency: 5} 265.8 MB N/A
├─IndexFullScan_36 cop[tikv] 17452.40 table:XXXXXTB, index:idx_XXXXXTB_2(order_code), keep order:true 1161519342 time:1.87s, loops:1134751, cop_task: {num: 1125, max: 5.4s, min: 643.1μs, avg: 851.6ms, p95: 1.48s, max_proc_keys: 1448429, p95_proc_keys: 1334907, tot_proc: 13m0s, tot_wait: 297ms, rpc_num: 1133, rpc_time: 15m58s, copr_cache_hit_ratio: 0.00}, backoff{regionMiss: 10ms}, tikv_task:{proc max:2.48s, min:163ms, p80:757ms, p95:1.04s, iters:1139217, tasks:1125}, scan_detail: {total_process_keys: 1160620563, total_keys: 1160623468, rocksdb: {delete_skipped_count: 0, key_skipped_count: 1178721014, block: {cache_hit_count: 904535, read_count: 136, read_byte: 851.0 KB}}} N/A N/A
└─Selection_38 cop[tikv] 1.25 eq(XXXXXDB.XXXXXTB.shop_id, "eb14a985-f788-48b2-a73b-738e08ef158e"), ge(XXXXXDB.XXXXXTB.create_time, 2021-10-04 00:00:00.000000), lt(XXXXXDB.XXXXXTB.create_time, 2021-10-05 00:00:00.000000) 18196 time:55m47.9s, loops:56963, cop_task: {num: 6648294, max: 4.33s, min: 219.3μs, avg: 2.99ms, p95: 9.1ms, max_proc_keys: 21022, p95_proc_keys: 63, tot_proc: 1h5m30.9s, tot_wait: 25m6.9s, rpc_num: 6648390, rpc_time: 5h29m59.6s, copr_cache_hit_ratio: 0.00}, backoff{regionMiss: 54ms, staleCommand: 2ms}, tikv_task:{proc max:386ms, min:0s, p80:1ms, p95:1ms, iters:9755705, tasks:6648294}, scan_detail: {total_process_keys: 1160441875, total_keys: 1185259460, rocksdb: {delete_skipped_count: 0, key_skipped_count: 1141620284, block: {cache_hit_count: 468441273, read_count: 21828, read_byte: 394.1 MB}}} N/A N/A
└─TableRowIDScan_37 cop[tikv] 17452.40 table:XXXXXTB, keep order:false 1161519342 tikv_task:{proc max:384ms, min:0s, p80:1ms, p95:1ms, iters:9755705, tasks:6648294}, scan_detail: {total_process_keys: 0, total_keys: 0, rocksdb: {delete_skipped_count: 0, key_skipped_count: 0, block: {cache_hit_count: 0, read_count: 0, read_byte: 0 Bytes}}} N/A N/A
把dashboard上的慢SQL拷贝到客户端,手动执行SQL,对应执行计划(执行计划2)
EXPLAIN ANALYZE SELECT
t.order_code,
t.coin_num
FROM
(
SELECT
order_code,
( sum( CASE field_name WHEN 'sale_quantity' THEN field_value ELSE 0 END ) - sum( CASE field_name WHEN 'give_quantity' THEN field_value ELSE 0 END ) ) coin_num
FROM
XXXXXTB
WHERE
shop_id = 'eb14a985-f788-48b2-a73b-738e08ef158e'
AND create_time >= "2021-10-04 00:00:00"
AND create_time < "2021-10-05 00:00:00" GROUP BY order_code ) t WHERE t.coin_num > 0;
Projection_9 53242.86 218 root time:85ms, loops:2, Concurrency:5 XXXXXDB.XXXXXTB.order_code, minus(Column#9, Column#10)->Column#11 21.3 KB N/A
└─Selection_10 53242.86 218 root time:84.6ms, loops:2 gt(minus(Column#9, Column#10), 0) 9.54 KB N/A
└─HashAgg_16 66553.58 771 root time:84ms, loops:7, partial_worker:{wall_time:83.484628ms, concurrency:5, task_num:1, tot_wait:406.721842ms, tot_exec:1.878203ms, tot_time:408.795545ms, max:83.418615ms, p95:83.418615ms}, final_worker:{wall_time:84.332323ms, concurrency:5, task_num:5, tot_wait:417.21916ms, tot_exec:3.550493ms, tot_time:420.781142ms, max:84.297101ms, p95:84.297101ms} group by:XXXXXDB.XXXXXTB.order_code, funcs:sum(Column#14)->Column#9, funcs:sum(Column#15)->Column#10, funcs:firstrow(XXXXXDB.XXXXXTB.order_code)->XXXXXDB.XXXXXTB.order_code 341.2 KB N/A
└─IndexLookUp_17 66553.58 776 root time:81.3ms, loops:2, index_task: {total_time: 24ms, fetch_handle: 24ms, build: 4.63μs, wait: 27.1μs}, table_task: {total_time: 275.9ms, num: 5, concurrency: 5} 480.6 KB N/A
├─IndexRangeScan_14(Build) 83191.98 18196 cop[tikv] table:XXXXXTB, index:idx_XXXXXTB_1(shop_id, create_time) time:22.2ms, loops:20, cop_task: {num: 1, max: 22.1ms, proc_keys: 18196, tot_proc: 20ms, rpc_num: 1, rpc_time: 22.1ms, copr_cache_hit_ratio: 0.00}, tikv_task:{time:20ms, loops:22}, scan_detail: {total_process_keys: 18196, total_keys: 18197, rocksdb: {delete_skipped_count: 0, key_skipped_count: 18196, block: {cache_hit_count: 23, read_count: 1, read_byte: 11.4 KB}}} range:["eb14a985-f788-48b2-a73b-738e08ef158e" 2021-10-04 00:00:00,"eb14a985-f788-48b2-a73b-738e08ef158e" 2021-10-05 00:00:00), keep order:false N/A N/A
└─HashAgg_12(Probe) 66553.58 776 cop[tikv] time:113.2ms, loops:10, cop_task: {num: 56, max: 39.9ms, min: 1.34ms, avg: 9.37ms, p95: 23ms, max_proc_keys: 2072, p95_proc_keys: 909, tot_proc: 419ms, tot_wait: 16ms, rpc_num: 56, rpc_time: 522.9ms, copr_cache_hit_ratio: 0.00}, tikv_task:{proc max:36ms, min:0s, p80:14ms, p95:20ms, iters:58, tasks:56}, scan_detail: {total_process_keys: 18196, total_keys: 21419, rocksdb: {delete_skipped_count: 0, key_skipped_count: 9232, block: {cache_hit_count: 113866, read_count: 32, read_byte: 545.0 KB}}} group by:XXXXXDB.XXXXXTB.order_code, funcs:sum(case(eq(XXXXXDB.XXXXXTB.field_name, "sale_quantity"), XXXXXDB.XXXXXTB.field_value, "0"))->Column#14, funcs:sum(case(eq(XXXXXDB.XXXXXTB.field_name, "give_quantity"), XXXXXDB.XXXXXTB.field_value, "0"))->Column#15 N/A N/A
└─TableRowIDScan_15 83191.98 18196 cop[tikv] table:XXXXXTB tikv_task:{proc max:34ms, min:0s, p80:13ms, p95:19ms, iters:58, tasks:56}, scan_detail: {total_process_keys: 0, total_keys: 0, rocksdb: {delete_skipped_count: 0, key_skipped_count: 0, block: {cache_hit_count: 0, read_count: 0, read_byte: 0 Bytes}}} keep order:false N/A N/A
3.2、可能性分析
分析有两个可能性:
- 1、用户是否设置 sql_select_limit ?如果设置了这个,会导致 SPM 不生效,这个是一个已知问题,目前已经在解决中了,issue见 planner: binding cannot work when `sql_select_limit` is enabled · Issue #27949 · pingcap/tidb · GitHub
- 2、用户运行 query 和 binding 的字符集是不是一致?如果不一致也可能导致 binding 匹配不上;
- 3、从上面执行计划1与执行计划2对比来看,执行计划里面多了一个Limit_15的算子,说明由程序发过来的SQL中包含有limit的算子,而手动在客户端执行的SQL没有limit算子,所以猜想是由于程序端可能设置session变量sql_select_limit,或者是框架中初始化启动的时候设置了该参数。
3.3、给用户的排查建议
- 1、使用抓包工具,开启抓包,然后重启一个应用程序,停止抓包,检索是否包含sql_select_limit的关键字。
- 2、通过使用他们的驱动,写一个简单的程序,执行SQL:show variable like ‘%select_limit%’ ,查看返回的值是否不是默认值:。
- 3、先binding一个简单的SQL,比如:select * from table where a=1 类似的SQL,然后使用连接数据库的驱动,比如jdbc等,写一个简单的程序,看是否可以命中前面的binding,如果不能命中,那说明不是本次案例复杂SQL的问题,因为需要排除SQL是否存在回车、换行等差异导致没有命中binding。
3.4、实锤
通过抓包工具的排查手段,的确找到相关关键字:
基础分析来源,最终在kettle自带的jdbc中找到配置项:
3.5、根因
- 1、存在一个已知的BUG:设置了session的sql_select_limit变量,在发送的SQL就会相当于加了limit,导致即使binding了,但是不会执行binding。
- 2、用户虽然没有在代码中设置session的sql_select_limit变量,但是在kettle自带的jdbc中设置了这个配置项,触发了上面的BUG。
4、解决方案
- 1、用户通过Optimizer Hints方式修改程序代码中的 SQL语句,临时解决;
- 2、找到问题根因是由于kettle自带的jdbc设置了sql_select_limit参数导致,去掉框架启动参数;
- 3、等PR出来后升级版本解决这个BUG;issue 链接:planner: binding cannot work when `sql_select_limit` is enabled · Issue #27949 · pingcap/tidb · GitHub