mysql> explain analyze select * from CLUSTER_SLOW_QUERY order by Time desc limit 1;
+----------------------------+---------+---------+-----------+--------------------------+----------------------------------------------------------------------------------------------------------------------------------+-------------------------------------+---------+------+
| id | estRows | actRows | task | access object | execution info | operator info | memory | disk |
+----------------------------+---------+---------+-----------+--------------------------+----------------------------------------------------------------------------------------------------------------------------------+-------------------------------------+---------+------+
| Limit_11 | 1.00 | 1 | root | | time:174.7ms, loops:2 | offset:0, count:1 | N/A | N/A |
| └─TableReader_21 | 1.00 | 1 | root | | time:174.7ms, loops:1, cop_task: {num: 1, max: 174.6ms, proc_keys: 0, rpc_num: 1, rpc_time: 174.5ms, copr_cache_hit_ratio: 0.00} | data:Limit_20 | 3.01 KB | N/A |
| └─Limit_20 | 1.00 | 0 | cop[tidb] | | | offset:0, count:1 | N/A | N/A |
| └─TableFullScan_19 | 1.00 | 0 | cop[tidb] | table:CLUSTER_SLOW_QUERY | | keep order:true, desc, stats:pseudo | N/A | N/A |
+----------------------------+---------+---------+-----------+--------------------------+----------------------------------------------------------------------------------------------------------------------------------+-------------------------------------+---------+------+
4 rows in set (0.17 sec)
从执行计划看,它使用的 limit 算子,并且 limit 算子下推到各个 tidb server 节点,那么它的执行逻辑是:
- 接收到 cop[tidb] 处理请求的 tidb server,根据下推的 limit 算子 ”order by Time desc limit 1“ 返回 1 条记录
- 上层的 limit 算子调用 next 函数,读取到 1 条记录就满足 limit 1 要求了,这个时候执行就结束
也就是说,这里实际执行结果逻辑上是:只在单个 tidb server 日志上的 order by Time desc limit 1,并不是 cluster 的。
mysql> explain analyze select * from CLUSTER_SLOW_QUERY order by Time desc limit 10000, 1;
+--------------------------+----------+---------+-----------+--------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------+----------+------+
| id | estRows | actRows | task | access object | execution info | operator info | memory | disk |
+--------------------------+----------+---------+-----------+--------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------+----------+------+
| TopN_9 | 1.00 | 1 | root | | time:4.75s, loops:2 | information_schema.cluster_slow_query.time:desc, offset:10000, count:1 | 183.1 MB | N/A |
| └─TableReader_18 | 10000.00 | 85415 | root | | time:4.3s, loops:85, cop_task: {num: 5, max: 4.52s, min: 101.6ms, avg: 1.8s, p95: 4.52s, rpc_num: 5, rpc_time: 9.02s, copr_cache_hit_ratio: 0.00} | data:TableFullScan_17 | 308.8 MB | N/A |
| └─TableFullScan_17 | 10000.00 | 0 | cop[tidb] | table:CLUSTER_SLOW_QUERY | | keep order:false, stats:pseudo | N/A | N/A |
+--------------------------+----------+---------+-----------+--------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------+----------+------+
3 rows in set (4.76 sec)
查询条件改为 order by Time desc limit 10000, 1; 执行计划变为了 topN,这取出了所有的慢日志进行了全局 topN 排序,选出 1 条,这个是真正意义上的 cluster。
所以从实际执行计划看,从 CLUSTER_SLOW_QUERY 查询出来的结果,有时候并不是真正的 cluster 意义的结果。