tiflash 负载不均衡，如何排查

foxchan · 2021 年10 月 19 日 06:09

【 TiDB 使用环境】
大量表做了tiflash，主要进行数据统计和报表

【现象】业务和数据库现象

【问题】同样的region weight，固定的节点cpu使用率高，怀疑有region 热点，如何排查

【 TiDB 版本】 v5.1.2

【附件】相关日志及监控（https://metricstool.pingcap.com/)

压力固定在188 和141 节点，其他tiflash 没有压力

这段时间慢查询,

听风吹雨 · 2021 年10 月 19 日 06:38

你慢SQL的这个截图选择的时间不是监控图中异常对应的时间片，你再看看那个时间的慢SQL

foxchan · 2021 年10 月 19 日 06:49

怀疑是有热点region，该怎么排查。或者说如何让tiflash region分布更均衡。所有tiflash节点配置一样，region weight 也一样

yilong · 2021 年10 月 19 日 09:41

可以参考这个帖子先排查下是哪个表导致的，同时抓一下 tiflash 的 profile 信息，多谢。

foxchan · 2021 年10 月 20 日 03:44

profile 信息
profile
SQL还在查

heming · 2021 年10 月 21 日 07:58

TIDB_HOT_REGIONS 也看不出什么来

CPU 时间跟 Read Index OPS 曲线一致

heming · 2021 年10 月 21 日 08:35

SELECT
cat.creative_agent_tt_Id,
cat.agent_material_md5
FROM
creative_agent_tt cat
WHERE
cat.agent_creative_id = ‘171413xxx9480’
AND cat.end_time IS NULL;
这个sql 有索引但是执行计划走的tiflash
我让他们试试强制索引，咱们这个执行计划选择的时候没有使用CBO原则吗？
EXPLAIN SELECT
cat.creative_agent_tt_Id,
cat.agent_material_md5
FROM
creative_agent_tt cat FORCE INDEX (agent_creative_id)
WHERE
cat.agent_creative_id = ‘171413xxx9480’
AND cat.end_time IS NULL;

听风吹雨 · 2021 年10 月 21 日 08:48

麻烦提供下下面的信息：
集群信息tiup cluster display
异常时段Grafana TiFlash-Summary仪表盘中的TiFlash指标
所有的TiFlash日志，包括tiflash.log / tiflash_tikv.log

听风吹雨 · 2021 年10 月 21 日 08:49

麻烦提供这个SQL的执行计划explain analyze

heming · 2021 年10 月 21 日 08:51

只是执行计划的问题还需要提供 tiflash日志吗?
tiflash的执行计划

强制索引 tikv的执行计划

听风吹雨 · 2021 年10 月 25 日 01:49

请用户提供一下如下信息：

所有store的信息：select store_id,address,store_state_name,label,leader_count,region_count,region_weight,region_score,region_size from information_schema.tikv_store_status;
表的数据分布： select c.type, a.store_id, a.address, a.db_name, a.table_name, a.is_leader, a.is_index, a.cnt from (select r.db_name, r.table_name, r.store_id, s.address, r.is_index, r.is_leader, count as cnt from (select s.region_id, s.db_name, s.table_name, s.is_index, p.store_id, p.is_leader, p.status from information_schema.tikv_region_status s,information_schema.tikv_region_peers p where db_name =‘数据库名称’ and table_name=‘表名称’ and s.region_id = p.region_id order by p.store_id) as r, information_schema.tikv_store_status s where r.store_id=s.store_id group by r.db_name,r.table_name, r.store_id, r.is_leader, s.address, r.is_index) a, information_schema.cluster_info c where c.instance = a.address order by c.type desc, a.store_id;
慢SQL的执行计划以及实际对比： explain analyze 慢SQL
执行的详细信息：trace 慢SQL，比如 trace select * from test1;

听风吹雨 · 2021 年10 月 25 日 01:50

请问，你的问题跟楼主的问题不是同一个问题吧？如果不是请开新帖，谢谢。

foxchan · 2021 年10 月 25 日 02:11

是这个SQL 定时批量执行导致的tiflash cpu高

foxchan · 2021 年10 月 25 日 02:19

explain 查看之前帖子

trace 慢SQL。trace 现在是正常走索引，之前是走全表

听风吹雨 · 2021 年10 月 25 日 02:27

是执行：explain analyze 慢SQL，你之前使用 explain，而不是explain analyze

heming · 2021 年10 月 25 日 02:36

heming · 2021 年10 月 25 日 02:59

可能跟参数 tidb_opt_cpu_factor=10000有关系
set session tidb_opt_cpu_factor=10; 就会优选索引了

foxchan · 2021 年10 月 25 日 03:55

tidb 选5.1.1 就走索引，选5.1.2 就走全表

id	estRows	actRows	task	access object	execution info	operator info	memory	disk
Projection_4	23.26	1	root		time:954.8ms, loops:2, Concurrency:OFF	yixintui_operate.creative_agent_tt.creative_agent_tt_id, yixintui_operate.creative_agent_tt.agent_material_md5	2.20 KB	N/A
└─TableReader_10	23.26	1	root		time:954.8ms, loops:2, cop_task: {num: 36, max: 625ms, min: 83.1ms, avg: 314.1ms, p95: 604.3ms, rpc_num: 36, rpc_time: 11.3s, copr_cache_hit_ratio: 0.00}	data:Selection_9	408 Bytes	N/A
└─Selection_9	23.26	1	cop[tiflash]		tiflash_task:{proc max:623ms, min:82.5ms, p80:427.1ms, p95:599.3ms, iters:1, tasks:36, threads:36}	eq(yixintui_operate.creative_agent_tt.agent_creative_id, “1714123543472237”), isnull(yixintui_operate.creative_agent_tt.end_time)	N/A	N/A
└─TableFullScan_8	23261061.00	23427983	cop[tiflash]	table:cat	tiflash_task:{proc max:609ms, min:70.5ms, p80:416.1ms, p95:577.3ms, iters:431, tasks:36, threads:36}	keep order:false, stats:pseudo	N/A	N/A

tidb 5.1.1
NS8I4KG%5DH8JIXV1IF2ZHTV4

heming · 2021 年10 月 25 日 04:03

v5.3.0 的tidb 也是按tiflash 走。
5.3 set tidb_opt_cpu_factor=10; 也会走tikv 索引。

foxchan · 2021 年10 月 26 日 03:02

tiflash的负载和副本数有关系吗，因为我默认设置的是2个副本，所以只有2个tiflash提供服务