【 TiDB 使用环境】生产环境
【 TiDB 版本】6.1.3
【复现路径 从5.4.2 升级到6.1.3
【遇到的问题:问题现象及影响】
业务sql 全部响应特别慢
【资源配置】
【附件:截图/日志/监控】
【 TiDB 使用环境】生产环境
【 TiDB 版本】6.1.3
【复现路径 从5.4.2 升级到6.1.3
【遇到的问题:问题现象及影响】
业务sql 全部响应特别慢
【资源配置】
【附件:截图/日志/监控】
看看有没有锁?
138 那个节点为什么region那么少?还是增长的趋势,像是在balance
看节点的CPU比较高,通过dashboard看看topsql,应该有比较吃cpu的大事务
看看pd statistics-balance 监控?
通过tidb-dashboard面板,打开topsql,你就能看到哪些SQL消耗CPU多了。另外你最好再介绍一下集群细节,只看监控16:50开始update和insert的语句开始变多了,duration也开始增加,tikv负载开始增高。
topsql开启方式,给你截图了,开启后选择某个tikv节点,就能显示当时的CPU消耗情况:
可以按照下面的方法把tikv-detail的grafana数据导出下,tikv CPU有点高,看下具体哪个部分引起的
可能是相关参数问题,或某些SQL的执行计划跑偏了
可以先把gc life time调小点,2天时间有点太久了,看compaction flow越来越大,pending bytes也有点高;另外把auto analyze可以先关闭下,或者把start和end调整到UTC时间的00:00 +0800和07:00 +0800
当前的状态:给update 做了执行计划绑定
MySQL [kunpeng_dms_call]> show global bindings\G
*************************** 1. row ***************************
Original_sql: update `kunpeng_dms_call` . `tbl_device_send_cmd_log` set `device_code` = ? , `sign_no` = ? , `status` = ? where `uniq_id` = ? and `uniq_id` = ?
Bind_sql: UPDATE `kunpeng_dms_call`.`tbl_device_send_cmd_log` USE INDEX (`PRIMARY`) SET `device_code`='ZYBSJ0037581', `sign_no`='4005', `status`=1 WHERE `uniq_id` = 'rpca5a64e9b529e4497b5198ac4abe406161673513647' AND `uniq_id` = 'rpca5a64e9b529e4497b5198ac4abe406161673513647'
Default_db: kunpeng_dms_call
Status: enabled
Create_time: 2023-01-12 23:43:04.190
Update_time: 2023-01-12 23:43:04.190
Charset: utf8
Collation: utf8_general_ci
Source: manual
*************************** 2. row ***************************
Original_sql: select count ( ? ) from `kunpeng_dms_call` . `tbl_tel_call_log` where `encrypt_number` = ? and `create_time` >= ? and `call_status` in ( ... )
Bind_sql: SELECT /*+ read_from_storage(tiflash[`tbl_tel_call_log`])*/ count(1) FROM `kunpeng_dms_call`.`tbl_tel_call_log` WHERE `encrypt_number` = '8a4f5a835373df4da2154f34cdfb4b790a4b8bc2ebf95d0c' AND `create_time` >= 1673452800 AND `call_status` IN (3,5,31)
Default_db: kunpeng_dms_call
Status: enabled
Create_time: 2023-01-12 17:19:57.029
Update_time: 2023-01-12 17:19:57.029
Charset: utf8
Collation: utf8_general_ci
Source: manual
此时update 还是全表扫描的执行计划,慢日志中如下
*************************** 3. row ***************************
INSTANCE: 10.105.128.60:10094
Time: 2023-01-12 23:46:14.049842
Txn_start_ts: 438707998008803343
User: kunpengdm_app
Host: 10.110.0.232
Conn_ID: 1721487763422848565
Exec_retry_count: 0
Exec_retry_time: 0
Query_time: 173.620817781
Parse_time: 0
Compile_time: 0.000460205
Rewrite_time: 0.000290375
Preproc_subqueries: 0
Preproc_subqueries_time: 0
Optimize_time: 0
Wait_TS: 0.000010595
Prewrite_time: 0
Wait_prewrite_binlog_time: 0
Commit_time: 0
Get_commit_ts_time: 0
Commit_backoff_time: 0
Backoff_types:
Resolve_lock_time: 0
Local_latch_wait_time: 0
Write_keys: 0
Write_size: 0
Prewrite_region: 0
Txn_retry: 0
Cop_time: 173.250368907
Process_time: 2465.093
Wait_time: 9.785
Backoff_time: 0
LockKeys_time: 0
Request_count: 456
Total_keys: 478359688
Process_keys: 478358738
Rocksdb_delete_skipped_count: 310
Rocksdb_key_skipped_count: 478359522
Rocksdb_block_cache_hit_count: 518277
Rocksdb_block_read_count: 101
Rocksdb_block_read_byte: 5267193
DB: kunpeng_dms_call
Index_names: [tbl_device_send_cmd_log:PRIMARY]
Is_internal: 0
Digest: b6c811b9cb1881469e00ef9d5406f8d9e20d03aaba0c65fdc85b809b8b8906a4
Stats: tbl_device_send_cmd_log:438707801073123363
Cop_proc_avg: 5.405905701
Cop_proc_p90: 14.489
Cop_proc_max: 32.277
Cop_proc_addr: 10.105.128.115:20175
Cop_wait_avg: 0.021458333
Cop_wait_p90: 0.053
Cop_wait_max: 0.122
Cop_wait_addr: 10.105.128.115:20175
Mem_max: 7210
Disk_max: 0
KV_total: 2604.177794892
PD_total: 0.000009744
Backoff_total: 0
Write_sql_response_total: 0
Result_rows: 0
Backoff_Detail:
Prepared: 1
Succ: 0
IsExplicitTxn: 0
IsWriteCacheTable: 0
Plan_from_cache: 0
Plan_from_binding: 1
Has_more_results: 0
Plan: id task estRows operator info actRows execution info memory disk
Update_4 root 0 N/A 0 time:2m53.6s, loops:1, 0 Bytes N/A
└─Selection_10 root 1.01 eq(kunpeng_dms_call.tbl_device_send_cmd_log.uniq_id, "rpc7b57f49908294744a5f1309f39aeac111673538200"), eq(kunpeng_dms_call.tbl_device_send_cmd_log.uniq_id, "rpc7b57f49908294744a5f1309f39aeac111673538200") 0 time:2m53.6s, loops:1 6.70 KB N/A
└─IndexLookUp_9 root 1.01 0 time:2m53.6s, loops:1, 354 Bytes N/A
├─Selection_8 cop[tikv] 1.01 eq(kunpeng_dms_call.tbl_device_send_cmd_log.uniq_id, "rpc7b57f49908294744a5f1309f39aeac111673538200"), eq(kunpeng_dms_call.tbl_device_send_cmd_log.uniq_id, "rpc7b57f49908294744a5f1309f39aeac111673538200") 0 time:2m53.6s, loops:1, cop_task: {num: 456, max: 32.3s, min: 228.5ms, avg: 5.43s, p95: 20.3s, max_proc_keys: 1305408, p95_proc_keys: 1212810, tot_proc: 41m5.1s, tot_wait: 9.79s, rpc_num: 456, rpc_time: 41m15.2s, copr_cache_hit_ratio: 0.00}, tikv_task:{proc max:973ms, min:226ms, p80:661ms, p95:744ms, iters:469212, tasks:456}, scan_detail: {total_process_keys: 478358738, total_process_keys_size: 33906243952, total_keys: 478359688, rocksdb: {delete_skipped_count: 310, key_skipped_count: 478359522, block: {cache_hit_count: 518277, read_count: 101, read_byte: 5.02 MB}}} N/A N/A
│ └─IndexFullScan_6 cop[tikv] 1709241802 table:tbl_device_send_cmd_log, index:PRIMARY(uniq_id), keep order:false 478358738 tikv_task:{proc max:891ms, min:169ms, p80:568ms, p95:643ms, iters:469212, tasks:456} N/A N/A
└─TableRowIDScan_7 cop[tikv] 1.01 table:tbl_device_send_cmd_log, keep order:false 0 N/A N/A
Plan_digest: 3cdd8c63b2c7ae87f199bc9169f173bd4fe22337c634f361182fc532a63fd9b6
Prev_stmt:
Query: UPDATE `tbl_device_send_cmd_log` SET `device_code`=?,`sign_no`=?,`status`=? WHERE uniq_id = ? AND `uniq_id` = ? [arguments: ("ZYBSJ0129423", 4002, 1, "rpc7b57f49908294744a5f1309f39aeac111673538200", "rpc7b57f49908294744a5f1309f39aeac111673538200")];
当我手动执行update sql 的时候是非常快的,执行计划跟慢查询中也不一样 。执行速度是很快的
MySQL [kunpeng_dms_call]> explain UPDATE `tbl_device_send_cmd_log` SET `device_code`='ZYBSJ0086840',`sign_no`=4002,`status`=1 WHERE uniq_id = 'rpc1be6f4d4e26a4306b1bf45b5c75b65f31673537670' AND `uniq_id` = 'rpc1be6f4d4e26a4306b1bf45b5c75b65f31673537670';
+-------------------+---------+------+-------------------------------------------------------+---------------+
| id | estRows | task | access object | operator info |
+-------------------+---------+------+-------------------------------------------------------+---------------+
| Update_4 | N/A | root | | N/A |
| └─Point_Get_6 | 1.00 | root | table:tbl_device_send_cmd_log, index:PRIMARY(uniq_id) | |
+-------------------+---------+------+-------------------------------------------------------+---------------+
2 rows in set (0.00 sec)
MySQL [kunpeng_dms_call]>
MySQL [kunpeng_dms_call]>
MySQL [kunpeng_dms_call]> UPDATE `tbl_device_send_cmd_log` SET `device_code`='ZYBSJ0086840',`sign_no`=4002,`status`=1 WHERE uniq_id = 'rpc1be6f4d4e26a4306b1bf45b5c75b65f31673537670' AND `uniq_id` = 'rpc1be6f4d4e26a4306b1bf45b5c75b65f31673537670';
Query OK, 0 rows affected (0.01 sec)
Rows matched: 1 Changed: 0 Warnings: 0
update kunpeng_dms_call
. tbl_device_send_cmd_log
set device_code
= ? , sign_no
= ? , status
= ? where uniq_id
= ? and uniq_id
= ?你这个sql我想不到什么场景会用到,你uniq_id判断两次,这两次的值一样不就相当于一次,要是不一样不就啥也更新不了?这上面业务场景?你写两次,执行计划就走到索引全扫描了,如果只写一次,就是点查啊?
explain UPDATE tbl_device_send_cmd_log
SET device_code
=‘ZYBSJ0086840’,sign_no
=4002,status
=1 WHERE uniq_id = ‘rpc1be6f4d4e26a4306b1bf45b5c75b65f31673537670’ AND uniq_id
= ‘rpc1be6f4d4e26a4306b1bf45b5c75b65f31673537670’;你这样也是点查,但是你
explain UPDATE tbl_device_send_cmd_log
SET device_code
=‘ZYBSJ0086840’,sign_no
=4002,status
=1 WHERE uniq_id = ‘rpc1be6f4d4e26a4306b1bf45b5c75b65f31673537670’ AND uniq_id
= ‘rpc1be6f4d4e26a4306b1bf45b5c75b65f31673537671’;你这样他也是索引全扫描吧?
你下面的sql 是不是写重复了,没太看明白。
你看我上面贴的执行计划,我自己手动跑的话如果是两个uniq_id = xxx ,执行计划也是点查,速度很快。
但是如果是业务跑起来,慢查询里看到的就是索引全扫描了。
即使业务sql 写的不太对,但是两个uniq_id 也不应该扫描全索引吧 ?
我今天先让业务把冗余的uniq_id 去掉试试。
不是,我的意思是你传的两个 uniq_id
是一致还是不一致,一致的话写一个就行,不一致的话,根本没必要更新,写两个然后传不一样的值,肯定只能走索引全扫描啊,比方你传rpc1be6f4d4e26a4306b1bf45b5c75b65f31673537670和112312312312312312312312123123123试试
传的值都是一样的,但是执行计划里看着也是走的全表。如下:
Plan: id task estRows operator info actRows execution info memory disk
Update_4 root 0 N/A 0 time:2m53.6s, loops:1, 0 Bytes N/A
└─Selection_10 root 1.01 eq(kunpeng_dms_call.tbl_device_send_cmd_log.uniq_id, "rpc7b57f49908294744a5f1309f39aeac111673538200"), eq(kunpeng_dms_call.tbl_device_send_cmd_log.uniq_id, "rpc7b57f49908294744a5f1309f39aeac111673538200") 0 time:2m53.6s, loops:1 6.70 KB N/A
└─IndexLookUp_9 root 1.01 0 time:2m53.6s, loops:1, 354 Bytes N/A
├─Selection_8 cop[tikv] 1.01 eq(kunpeng_dms_call.tbl_device_send_cmd_log.uniq_id, "rpc7b57f49908294744a5f1309f39aeac111673538200"), eq(kunpeng_dms_call.tbl_device_send_cmd_log.uniq_id, "rpc7b57f49908294744a5f1309f39aeac111673538200") 0 time:2m53.6s, loops:1, cop_task: {num: 456, max: 32.3s, min: 228.5ms, avg: 5.43s, p95: 20.3s, max_proc_keys: 1305408, p95_proc_keys: 1212810, tot_proc: 41m5.1s, tot_wait: 9.79s, rpc_num: 456, rpc_time: 41m15.2s, copr_cache_hit_ratio: 0.00}, tikv_task:{proc max:973ms, min:226ms, p80:661ms, p95:744ms, iters:469212, tasks:456}, scan_detail: {total_process_keys: 478358738, total_process_keys_size: 33906243952, total_keys: 478359688, rocksdb: {delete_skipped_count: 310, key_skipped_count: 478359522, block: {cache_hit_count: 518277, read_count: 101, read_byte: 5.02 MB}}} N/A N/A
│ └─IndexFullScan_6 cop[tikv] 1709241802 table:tbl_device_send_cmd_log, index:PRIMARY(uniq_id), keep order:false 478358738 tikv_task:{proc max:891ms, min:169ms, p80:568ms, p95:643ms, iters:469212, tasks:456} N/A N/A
└─TableRowIDScan_7 cop[tikv] 1.01 table:tbl_device_send_cmd_log, keep order:false 0 N/A N/A
Plan_digest: 3cdd8c63b2c7ae87f199bc9169f173bd4fe22337c634f361182fc532a63fd9b6
Prev_stmt:
Query: UPDATE `tbl_device_send_cmd_log` SET `device_code`=?,`sign_no`=?,`status`=? WHERE uniq_id = ? AND `uniq_id` = ? [arguments: ("ZYBSJ0129423", 4002, 1, "rpc7b57f49908294744a5f1309f39aeac111673538200", "rpc7b57f49908294744a5f1309f39aeac111673538200")];
问题已解决:
研发改写sql 去掉多余的uniq_id,重新上线后,未出现之前的问题。改写后的SQL如下:
UPDATE tbl_device_send_cmd_log
SET device_code
=‘xxxxxx’,sign_no
=4002,status
=1 WHERE uniq_id = ‘xxxxxx’ ;
在v4 v5版本 也是范围表扫 ,改成IN 才能点查
问题解决了 下次记得用积分送个键盘来