TiDB SQL调优实战——索引问题

【是否原创】 否
【首发渠道】知乎
【首发渠道链接】 TiDB SQL调优实战——索引问题 - 知乎 (zhihu.com)
【正文】

一、前言

前一阵子,迫于业务压力,将一套拥有两个多TB数据的,并且稳定运行两年多的,TiDB v2.1的集群升级到TiDB v4.0,希望能快一点,的确,升级之后,解决了大多数问题,查询速度的确比平差快了很多。但是有些业务还是没有太大改善。决定单独将那些SQL单独拿出来做一些优化,看看问题出在哪。

这里我就拿两个比较典型的SQL语句来说明一下,两条都是与索引息息相关的SQL。

二、例子

  1. 问题SQL 1
SELECT
    a.SOID ID,
    a.SA,
    a.SB,
    SUM( a.AMOUNT ) AMOUNT,
    SUM( a.BM ) BM
FROM
(
    SELECT
        haso.SOID,
        haso.SA,
        haso.SB,
        SUM( hpar.A_M ) AMOUNT,
        SUM( hpar.A_BM ) BM
    FROM
        HPAR hpar
    INNER JOIN 
        HASO haso ON hpar.SOID = haso.SOID 
        AND haso.WO_STATUS = 'P'
        AND hpar.DISPLAY_FLAG = 'Y'
    GROUP BY
        haso.SOID 

    UNION ALL

    SELECT
        haso.SOID,
        haso.SA,
        haso.SB,
        SUM( hpar.TARGET_AM ) AMOUNT,
        SUM( hpar.TARGET_A_BM ) BM
    FROM
        HPAR hpar
    INNER JOIN 
    	HASO haso ON hpar.TARGET_SOID = haso.SOID
        AND haso.WO_STATUS = 'P'
        AND hpar.DISPLAY_FLAG = 'Y'
    GROUP BY
        haso.SOID
) a
GROUP BY
    a.SOID;

这条SQL语句是业务系统启动时的脚本SQL,其中hasr这张表在这两年多的时间积累里面,已经有两亿条数据,hapo也有将近两亿的数据。随着时间的推移,业务系统启动的时间也越来越长,甚至偶尔会出现超时的情况。升级集群虽然有所改善,但还是慢,于是决定单独将这条SQL拿出来跑。真是不跑不知道,一跑吓一跳,平均查询耗时600s,最长甚至跑了1000多s才跑出结果。

image

问题SQL 1排查过程

当时排查这个问题的时候,决定将这种长的还包含union all 的SQL语句拆开,从子查询开始入手,一段一段的拆开优化。

于是,针对这段子查询,explain analyze查看其执行计划。

EXPLAIN ANALYZE SELECT
        haso.SOID,
        haso.SA,
        haso.SB,
        SUM( hpar.A_M ) AMOUNT,
        SUM( hpar.A_BM ) BM
    FROM
        HPAR hpar
    INNER JOIN 
        HASO haso ON hpar.SOID = haso.SOID 
        AND haso.WO_STATUS = 'P'
        AND hpar.DISPLAY_FLAG = 'Y'
    GROUP BY
        haso.SOID 

image

能够发现,罪魁祸首出在扫描表hpar的时候,走的是TableFullScan,全表扫描数据,扫了两亿多行的数据,大大的拖慢了整个SQL的执行时间。那么问题来了,为什么走的是TableFullScan呢?因为不论是hpar的 SOID字段还是DISPLAY_FLAG字段均建立了索引,但从上面的执行计划上来看,是没有命中索引。

问题SQL 1解决办法

没有命中索引,第一时间想到的就是统计信息的缺失,于是决定排查统计信息,通过 show stats_healthy 查看该表的健康度,发现健康度只有54。果不其然,表健康度太低,统计信息不全,导致索引失效,带来的结果就是在查询时并没有命中索引,走的是TableFullScan,全表扫描,耗费大量的时间以及资源。最后运行 analyze HPAR 刷新统计信息,问题得以解决。将所有涉及到的表统计信息全都刷新了一遍之后,该条SQL最终的运行速度稳定在40s以内,有了显著提升。

  1. 问题SQL 2
SELECT
	DATE_FORMAT(a.A_DATE,'%Y-%m-%d'),
	a.COMPANY_NAME,
	a.CHANNEL_NAME,
	a.DEPARTMENT_NAME,
	a.ITEM_CODE,
	a.BARCODE,
	a.ITEM_NAME,
	IF('ITEM' = a.BUSINESS_UNIT,'单品','礼包'),
	a.INVOICE_TYPE,
	a.INVOICE_CODE,
	a.SPECS,
	IF('N' = a.OUT_FLAG,'否','是'),
	CONCAT(a.FIRST_CLASS,'.',a.SECOND_CLASS,'.',a.THIRD_CLASS),
	a.THIRD_CLASS,
	a.TAX_RATE,
	a.PRODUCT_NUMBER,
	SUM(
		cast(
		a.SETTLE_QUANTITY AS DECIMAL ( 18, 2 ))),
	a.ACCOUNT_NAME,
	SUM(
		cast(
		a.INPUT_AMOUNT AS DECIMAL ( 18, 2 ))),
	SUM(
		cast(
		a.SETTLE_AMOUNT AS DECIMAL ( 18, 2 ))),
	SUM(
		cast(
		a.TAX_AMOUNT AS DECIMAL ( 18, 2 )))
FROM
	(
	SELECT
		haso.SIGN_COMPANY,
		haso.DEPARTMENT_CODE,
		haso.SETTLE_MODEL,
		haso.SETTLE_ORDER_ID,
		hpssv.DESCRIPTION AS COMPANY_NAME,
		hpssv2.DESCRIPTION AS CHANNEL_NAME,
		hpssv3.DESCRIPTION AS DEPARTMENT_NAME,
		haso.A_DATE AS A_DATE,
		haso.BUSINESS_UNIT,
		hasoL.PRODUCT_NUMBER,
		hasoL.SETTLE_AMOUNT AS SETTLE_AMOUNT,
		hasoL.TAX_AMOUNT AS TAX_AMOUNT,
		hasoL.SETTLE_QUANTITY AS SETTLE_QUANTITY,
		hasoL.INPUT_AMOUNT AS INPUT_AMOUNT,
		tpii.ITEM_CODE,
		tpii.BARCODE,
		tpii.DESCRIPTION AS ITEM_NAME,
		tpii.INVOICE_TYPE,
		tpii.SPECS,
		tpii.OUT_FLAG,
		tpii.FIRST_CLASS,
		tpii.SECOND_CLASS,
		tpii.THIRD_CLASS,
		tpii.TAX_RATE,
		tpii.INVOICE_CODE,
		hpssv4.DESCRIPTION AS ACCOUNT_NAME,
		tpicm.ACCOUNT_CODE 
	FROM
		HASO haso
		LEFT JOIN HASOL hasoL ON haso.SETTLE_ORDER_ID = hasoL.SETTLE_ORDER_ID
		LEFT JOIN TPII tpii ON hasoL.PRODUCT_NUMBER = tpii.ITEM_CODE
		LEFT JOIN TPICM tpicm ON tpii.FIRST_CLASS = tpicm.FIRST_CLASS 
		AND tpii.SECOND_CLASS = tpicm.SECOND_CLASS 
		AND tpii.THIRD_CLASS = tpicm.THIRD_CLASS
		LEFT JOIN HPSSV hpssv ON haso.SIGN_COMPANY = hpssv.SEGMENT_VALUE 
		AND hpssv.TYPE_CODE = 'COM'
		LEFT JOIN HPSSV hpssv2 ON haso.SETTLE_MODEL = hpssv2.SEGMENT_VALUE 
		AND hpssv2.TYPE_CODE = 'CHANNEL'
		LEFT JOIN HPSSV hpssv3 ON haso.DEPARTMENT_CODE = hpssv3.SEGMENT_VALUE 
		AND hpssv3.TYPE_CODE = 'DEPT'
		LEFT JOIN HPSSV hpssv4 ON tpicm.ACCOUNT_CODE = hpssv4.SEGMENT_VALUE 
	WHERE
		haso.A_DATE >= '2020-10-01 00:00:00' 
		AND haso.A_DATE <= '2020-10-27 23:59:59' 
		AND haso.SIGN_COMPANY in ('3001')  
		AND haso.BUSINESS_UNIT IN ( 'ITEM', 'BAG' ) 
		UNION ALL
	SELECT
		haso.SIGN_COMPANY,
		haso.DEPARTMENT_CODE,
		haso.SETTLE_MODEL,
		haso.SETTLE_ORDER_ID,
		hpssv.DESCRIPTION AS COMPANY_NAME,
		hpssv2.DESCRIPTION AS CHANNEL_NAME,
		hpssv3.DESCRIPTION AS DEPARTMENT_NAME,
		haso.REVERSE_ACCOUNT_DATE AS A_DATE,
		haso.BUSINESS_UNIT,
		hasoL.PRODUCT_NUMBER,
		- hasoL.SETTLE_AMOUNT AS SETTLE_AMOUNT,
		- hasoL.TAX_AMOUNT AS TAX_AMOUNT,
		- hasoL.SETTLE_QUANTITY AS SETTLE_QUANTITY,
		- hasoL.INPUT_AMOUNT AS INPUT_AMOUNT,
		tpii.ITEM_CODE,
		tpii.BARCODE,
		tpii.DESCRIPTION AS ITEM_NAME,
		tpii.INVOICE_TYPE,
		tpii.SPECS,
		tpii.OUT_FLAG,
		tpii.FIRST_CLASS,
		tpii.SECOND_CLASS,
		tpii.THIRD_CLASS,
		tpii.TAX_RATE,
		tpii.INVOICE_CODE,
		hpssv4.DESCRIPTION AS ACCOUNT_NAME,
		tpicm.ACCOUNT_CODE 
	FROM
		HASO haso
		LEFT JOIN HASOL hasoL ON haso.SETTLE_ORDER_ID = hasoL.SETTLE_ORDER_ID
		LEFT JOIN TPII tpii ON hasoL.PRODUCT_NUMBER = tpii.ITEM_CODE
		LEFT JOIN TPICM tpicm ON tpii.FIRST_CLASS = tpicm.FIRST_CLASS 
		AND tpii.SECOND_CLASS = tpicm.SECOND_CLASS 
		AND tpii.THIRD_CLASS = tpicm.THIRD_CLASS
		LEFT JOIN HPSSV hpssv ON haso.SIGN_COMPANY = hpssv.SEGMENT_VALUE 
		AND hpssv.TYPE_CODE = 'COM'
		LEFT JOIN HPSSV hpssv2 ON haso.SETTLE_MODEL = hpssv2.SEGMENT_VALUE 
		AND hpssv2.TYPE_CODE = 'CHANNEL'
		LEFT JOIN HPSSV hpssv3 ON haso.DEPARTMENT_CODE = hpssv3.SEGMENT_VALUE 
		AND hpssv3.TYPE_CODE = 'DEPT'
		LEFT JOIN HPSSV hpssv4 ON tpicm.ACCOUNT_CODE = hpssv4.SEGMENT_VALUE 
	WHERE
		REVERSE_ACCOUNT_DATE IS NOT NULL 
		AND haso.REVERSE_ACCOUNT_DATE >= '2020-10-01 00:00:00' 
		AND haso.REVERSE_ACCOUNT_DATE <= '2020-10-27 23:59:59' 
		AND haso.SIGN_COMPANY in ('3001') 
		AND haso.BUSINESS_UNIT IN ( 'ITEM', 'BAG' ) 
	) a 
GROUP BY
	a.SIGN_COMPANY,
	a.DEPARTMENT_CODE,
	a.SETTLE_MODEL,
	a.A_DATE,
	a.PRODUCT_NUMBER,
	a.BUSINESS_UNIT,
	SIGN(
	a.SETTLE_AMOUNT 
	)

这条SQL语句,问题比上一个要更严重,因为压根儿就跑不出结果,跑了五六千秒还能继续跑,也不报错,继续在那跑,欸,就很离谱。由于是在navicat中运行的,后面只能通过show processlist 以及 kill tidb ‘processID’ 来终止查询。

问题SQL 2排查过程

同样的,基于上一条SQL的基本思路,将这个长SQL拆开来看,先用explain analyze看子查询的执行计划。跑的这个SQL涉及到haso、tpii、hpssv、hasol、tpicm五张表。其中,hpssv、tpii、tpicm表都不大,均只有两万多条数据,但是haso有接近两亿条数据,hasol有两千多万条。

EXPLAIN ANALYZE SELECT
		haso.SIGN_COMPANY,
		haso.DEPARTMENT_CODE,
		haso.SETTLE_MODEL,
		haso.SETTLE_ORDER_ID,
		hpssv.DESCRIPTION AS COMPANY_NAME,
		hpssv2.DESCRIPTION AS CHANNEL_NAME,
		hpssv3.DESCRIPTION AS DEPARTMENT_NAME,
		haso.A_DATE AS A_DATE,
		haso.BUSINESS_UNIT,
		hasoL.PRODUCT_NUMBER,
		hasoL.SETTLE_AMOUNT AS SETTLE_AMOUNT,
		hasoL.TAX_AMOUNT AS TAX_AMOUNT,
		hasoL.SETTLE_QUANTITY AS SETTLE_QUANTITY,
		hasoL.INPUT_AMOUNT AS INPUT_AMOUNT,
		tpii.ITEM_CODE,
		tpii.BARCODE,
		tpii.DESCRIPTION AS ITEM_NAME,
		tpii.INVOICE_TYPE,
		tpii.SPECS,
		tpii.OUT_FLAG,
		tpii.FIRST_CLASS,
		tpii.SECOND_CLASS,
		tpii.THIRD_CLASS,
		tpii.TAX_RATE,
		tpii.INVOICE_CODE,
		hpssv4.DESCRIPTION AS ACCOUNT_NAME,
		tpicm.ACCOUNT_CODE 
	FROM
		HASO haso
		LEFT JOIN HASOL hasoL ON haso.SETTLE_ORDER_ID = hasoL.SETTLE_ORDER_ID
		LEFT JOIN TPII tpii ON hasoL.PRODUCT_NUMBER = tpii.ITEM_CODE
		LEFT JOIN TPICM tpicm ON tpii.FIRST_CLASS = tpicm.FIRST_CLASS 
		AND tpii.SECOND_CLASS = tpicm.SECOND_CLASS 
		AND tpii.THIRD_CLASS = tpicm.THIRD_CLASS
		LEFT JOIN HPSSV hpssv ON haso.SIGN_COMPANY = hpssv.SEGMENT_VALUE 
		AND hpssv.TYPE_CODE = 'COM'
		LEFT JOIN HPSSV hpssv2 ON haso.SETTLE_MODEL = hpssv2.SEGMENT_VALUE 
		AND hpssv2.TYPE_CODE = 'CHANNEL'
		LEFT JOIN HPSSV hpssv3 ON haso.DEPARTMENT_CODE = hpssv3.SEGMENT_VALUE 
		AND hpssv3.TYPE_CODE = 'DEPT'
		LEFT JOIN HPSSV hpssv4 ON tpicm.ACCOUNT_CODE = hpssv4.SEGMENT_VALUE 
	WHERE
		haso.A_DATE >= '2020-10-01 00:00:00' 
		AND haso.A_DATE <= '2020-10-27 23:59:59' 
		AND haso.SIGN_COMPANY in ('3001')  
		AND haso.BUSINESS_UNIT IN ( 'ITEM', 'BAG' )

然后,就遇到一个问题,发现这个子查询也同样跑几千秒跑不出结果,无奈之下,只能结合 show processlist与explain for connection ‘processID’ 来查看正在跑的这段SQL的执行计划。

其实,从这个执行计划中我们能够发现,问题就出在最后的四个表的扫描上。

这四个表的扫描走的均是IndexLookUpReader,也就是说使用了索引的方式来扫描表中数据,但是四个表,扫描的时长,着实显得非常的不合理。没办法,一张表一张表的来看吧。先看tpii,前面提到过,这个表的数据只有两万多,但是使用ITEM_CODE这个索引扫描,花了很长的时间还没扫描结束。于是展开execution Info查看执行的详细信息,发现对于一个两万多数据的表,读这个表的时候扫了63W+的keys,接近64W。

看来问题终于是找到了,扫描的key太多,以至于SQL执行不成功,甚至都不能执行出结果。到了这里,不禁反问自己,为什么会扫描这么多的key呢?

看上面TiDB官方文档关于其部分索引算子的一些解释。在最开始的执行计划里面,能看到扫描tpii这张表,用的是IndexLookUpReader,比一般的IndexReader多了一步回表操作。那么很明显,在扫表时,错误的使用ITEM_CODE索引,导致做了没有必要的回表步骤,导致最后扫表的key的数量超出了认知,更拖慢了SQL查询的时间。

问题SQL 2解决办法

使用hint,在子查询 select后面添加 /*+ IGNORE_INDEX() */的hint

SELECT /*+ IGNORE_INDEX(tpii,ITEM_CODE)*/
	...

同样的,使用 /*+ IGNORE_INDEX() */的hint 来忽视 hpssv这四张表用到的索引,将hpssv、tpii的执行计划由原来的IndexLookUpReader改变陈TableFullScan,这样一来,SQL的运行速度比之前快了几十倍不止,稳定之后,平均执行时间只有10s不到,相比以前,要快了100倍。

三、总结

从这次SQL调优中,影射出很多系统中关于不当使用TiDB的问题。

但是话又说回来了,毕竟这是一个两年前的系统,当年开发这套系统的人都已经离职了,也许当时数据量不大,集群压力小,并且对于TiDB的了解也仅仅止步于 高度兼容MySQL5.7协议的分布式数据库,对于后面的这种问题也不会考虑到这么全面,写出来的SQL语句当时能跑,并且能稳定的跑出结果,就烧高香,万事大吉了,哪还管身后洪水滔天的,再加上平时也缺乏专门的数据库运维人员,对于这方面也缺乏重视,所以出现类似于今天SQL调优的这种问题也见怪不怪了。

很多人明明有了汽车,却用马拉着车,自己躺在车里面,气愤于别人骑着摩托车超过自己,最后赖这汽车不行。TiDB是一个很好的产品,但是怎么正确的,遵循TiDB原理的去使用这款产品,也显得尤为重要。

在AskTug中,TiDB 数据库开发规范 - 技术文章 - AskTUG,这篇文章推荐读一读,关于使用TiDB数据库的一些东西讲的挺好的,我也受益颇多。

8赞