急!求大佬帮助排查数据库卡顿问题

问题描述:
我使用的版本是5.2.1。 近期数据库很不稳定,有时候几天卡一次,有时候几个小时内连续卡多次,通过dashboard分析sql,sql执行时间都很长,但是查看索引执行计划都是正常的。看不到有问题的sql。 以下是其中一个sql的执行计划: id task estRows operator info actRows execution info memory disk
HashAgg_20 root 1 funcs:count(1)->Column#655 0 time:22.8s, loops:1, partial_worker:{wall_time:22.81680262s, concurrency:5, task_num:0, tot_wait:1m54.083899119s, tot_exec:0s, tot_time:1m54.083903907s, max:22.816785352s, p95:22.816785352s}, final_worker:{wall_time:22.816820353s, concurrency:5, task_num:0, tot_wait:1m54.084008738s, tot_exec:7.661µs, tot_time:1m54.084019576s, max:22.816808352s, p95:22.816808352s} 13.9 KB N/A
└─Sort_22 root 381.84 svc.svc_service_order.dealercode, svc.svc_service_order.createdate:desc 0 time:22.8s, loops:1 0 Bytes 0 Bytes
└─HashJoin_26 root 381.84 left outer join, equal:[eq(svc.svc_service_order.serviceordercode, svc.svc_service_order.serviceordercode) eq(svc.svc_service_order.dealercode, svc.svc_service_order.dealercode)] 0 time:22.8s, loops:1, build_hash_table:{total:22.8s, fetch:22.8s, build:0s} 0 Bytes 0 Bytes
├─HashJoin_69 root 381.84 left outer join, equal:[eq(svc.svc_service_order.serviceordercode, svc.svc_service_order.serviceordercode) eq(svc.svc_service_order.dealercode, svc.svc_service_order.dealercode)] 0 time:22.8s, loops:1, build_hash_table:{total:22.8s, fetch:22.8s, build:0s} 0 Bytes 0 Bytes
│ ├─IndexLookUp_88 root 381.84 0 time:22.8s, loops:1, index_task: {total_time: 3.5s, fetch_handle: 45.2ms, build: 5.62µs, wait: 3.45s}, table_task: {total_time: 1m52.4s, num: 8, concurrency: 5} 1.40 MB N/A
│ │ ├─IndexRangeScan_85 cop[tikv] 517.46 table:so, index:dealerCode_serviceOrderCode_index(dealerCode, serviceOrderCode, createDate), range:[“6531138”,“6531139”), keep order:false 90219 time:41.9ms, loops:91, cop_task: {num: 1, max: 41.6ms, proc_keys: 90219, tot_proc: 39ms, rpc_num: 1, rpc_time: 41.6ms, copr_cache_hit_ratio: 0.00}, tikv_task:{time:36ms, loops:93}, scan_detail: {total_process_keys: 90219, total_keys: 90220, rocksdb: {delete_skipped_count: 0, key_skipped_count: 90219, block: {cache_hit_count: 105, read_count: 0, read_byte: 0 Bytes}}} N/A N/A
│ │ └─Selection_87 cop[tikv] 381.84 ge(svc.svc_service_order.settlementdate, 2025-04-01 00:00:00.000000), le(svc.svc_service_order.settlementdate, 2025-04-13 23:59:59.000000), ne(svc.svc_service_order.status, 9) 770 time:1m52.1s, loops:10, cop_task: {num: 1420, max: 4.7s, min: 486.2µs, avg: 206.4ms, p95: 1.82s, max_proc_keys: 209, p95_proc_keys: 135, tot_proc: 1m9.6s, tot_wait: 3m42s, rpc_num: 1420, rpc_time: 4m53s, copr_cache_hit_ratio: 0.00}, tikv_task:{proc max:9ms, min:0s, p80:3ms, p95:5ms, iters:2771, tasks:1420}, scan_detail: {total_process_keys: 86776, total_keys: 86776, rocksdb: {delete_skipped_count: 0, key_skipped_count: 0, block: {cache_hit_count: 1036322, read_count: 0, read_byte: 0 Bytes}}} N/A N/A
│ │ └─TableRowIDScan_86 cop[tikv] 517.46 table:so, keep order:false 86776 tikv_task:{proc max:9ms, min:0s, p80:3ms, p95:5ms, iters:2771, tasks:1420} N/A N/A
│ └─HashAgg_89 root 49726.36 group by:svc.svc_service_order.dealercode, svc.svc_service_order.serviceordercode, funcs:firstrow(svc.svc_service_order.dealercode)->svc.svc_service_order.dealercode, funcs:firstrow(svc.svc_service_order.serviceordercode)->svc.svc_service_order.serviceordercode, funcs:count(1)->Column#658 0 time:22.8s, loops:1, partial_worker:{wall_time:22.816687421s, concurrency:5, task_num:10, tot_wait:1m54.077484488s, tot_exec:5.33879ms, tot_time:1m54.083186431s, max:22.816667291s, p95:22.816667291s}, final_worker:{wall_time:22.816877288s, concurrency:5, task_num:5, tot_wait:1m54.083130861s, tot_exec:965.158µs, tot_time:1m54.084100167s, max:22.816861682s, p95:22.816861682s} 1.28 MB N/A
│ └─IndexHashJoin_97 root 144358.11 left outer join, inner:IndexReader_94, outer key:svc.svc_service_order.id, inner key:svc.svc_service_order_part.serviceorderid, equal cond:eq(svc.svc_service_order.id, svc.svc_service_order_part.serviceorderid) 10240 time:22.8s, loops:11, inner:{total:178.2ms, concurrency:5, task:6, construct:2.53ms, fetch:167.1ms, build:638.4µs, join:8.47ms} 541.8 KB N/A
│ ├─IndexLookUp_122 root 62617.31 4064 time:22.8s, loops:8, index_task: {total_time: 3.56s, fetch_handle: 53.6ms, build: 14.5µs, wait: 3.51s}, table_task: {total_time: 1m52.5s, num: 12, concurrency: 5} 6.36 MB N/A
│ │ ├─Selection_121 cop[tikv] 62617.31 not(isnull(svc.svc_service_order.serviceordercode)) 90219 time:49.5ms, loops:95, cop_task: {num: 1, max: 49.3ms, proc_keys: 90219, tot_proc: 45ms, tot_wait: 1ms, rpc_num: 1, rpc_time: 49.3ms, copr_cache_hit_ratio: 0.00}, tikv_task:{time:44ms, loops:93}, scan_detail: {total_process_keys: 90219, total_keys: 90220, rocksdb: {delete_skipped_count: 0, key_skipped_count: 90219, block: {cache_hit_count: 105, read_count: 0, read_byte: 0 Bytes}}} N/A N/A
│ │ │ └─IndexRangeScan_119 cop[tikv] 62617.31 table:so, index:dealerCode_serviceOrderCode_index(dealerCode, serviceOrderCode, createDate), range:[“6531138”,“6531139”), keep order:false 90219 tikv_task:{time:35ms, loops:93} N/A N/A
│ │ └─TableRowIDScan_120 cop[tikv] 62617.31 table:so, keep order:false 86651 time:1m52.2s, loops:97, cop_task: {num: 1470, max: 4.73s, min: 442µs, avg: 210.9ms, p95: 1.83s, max_proc_keys: 209, p95_proc_keys: 135, tot_proc: 1m16.6s, tot_wait: 3m51.9s, rpc_num: 1470, rpc_time: 5m10s, copr_cache_hit_ratio: 0.00}, tikv_task:{proc max:10ms, min:0s, p80:3ms, p95:5ms, iters:2813, tasks:1470}, scan_detail: {total_process_keys: 86461, total_keys: 86461, rocksdb: {delete_skipped_count: 0, key_skipped_count: 0, block: {cache_hit_count: 1032763, read_count: 0, read_byte: 0 Bytes}}} N/A N/A
│ └─IndexReader_94 root 2.88 index:Selection_93 13255 time:164.5ms, loops:22, cop_task: {num: 19, max: 101.8ms, min: 1.02ms, avg: 16.8ms, p95: 101.8ms, max_proc_keys: 1831, p95_proc_keys: 1831, tot_proc: 166ms, tot_wait: 123ms, rpc_num: 19, rpc_time: 318.2ms, copr_cache_hit_ratio: 0.00} 2.28 KB N/A
│ └─Selection_93 cop[tikv] 2.88 not(isnull(svc.svc_service_order_part.serviceorderid)) 13255 tikv_task:{proc max:19ms, min:0s, p80:13ms, p95:19ms, iters:83, tasks:19}, scan_detail: {total_process_keys: 13255, total_keys: 23199, rocksdb: {delete_skipped_count: 0, key_skipped_count: 19167, block: {cache_hit_count: 46650, read_count: 0, read_byte: 0 Bytes}}} N/A N/A
│ └─IndexRangeScan_92 cop[tikv] 2.88 table:sp, index:idx_svc_service_order_part_serviceOrderId(serviceOrderId), range: decided by [eq(svc.svc_service_order_part.serviceorderid, svc.svc_service_order.id)], keep order:false 13255 tikv_task:{proc max:18ms, min:0s, p80:13ms, p95:18ms, iters:83, tasks:19} N/A N/A
└─HashAgg_125 root 49726.36 group by:svc.svc_service_order.dealercode, svc.svc_service_order.serviceordercode, funcs:firstrow(svc.svc_service_order.dealercode)->svc.svc_service_order.dealercode, funcs:firstrow(svc.svc_service_order.serviceordercode)->svc.svc_service_order.serviceordercode, funcs:count(1)->Column#659 0 time:22.8s, loops:1, partial_worker:{wall_time:22.816801246s, concurrency:5, task_num:5, tot_wait:1m54.079172349s, tot_exec:4.054589ms, tot_time:1m54.083692896s, max:22.816779147s, p95:22.816779147s}, final_worker:{wall_time:22.817043827s, concurrency:5, task_num:6, tot_wait:1m54.083613674s, tot_exec:1.205728ms, tot_time:1m54.084823667s, max:22.81701947s, p95:22.81701947s} 1.07 MB N/A
└─IndexHashJoin_133 root 80725.24 left outer join, inner:IndexReader_130, outer key:svc.svc_service_order.id, inner key:svc.svc_service_order_item.serviceorderid, equal cond:eq(svc.svc_service_order.id, svc.svc_service_order_item.serviceorderid) 5120 time:22.8s, loops:6, inner:{total:101.7ms, concurrency:5, task:6, construct:2.46ms, fetch:94.9ms, build:570.2µs, join:4.32ms} 495.7 KB N/A
├─IndexLookUp_158 root 62617.31 4064 time:22.8s, loops:8, index_task: {total_time: 3.56s, fetch_handle: 58.9ms, build: 12.5µs, wait: 3.5s}, table_task: {total_time: 1m52.5s, num: 12, concurrency: 5} 6.34 MB N/A
│ ├─Selection_157 cop[tikv] 62617.31 not(isnull(svc.svc_service_order.serviceordercode)) 90219 time:54.4ms, loops:95, cop_task: {num: 1, max: 54.2ms, proc_keys: 90219, tot_proc: 49ms, tot_wait: 1ms, rpc_num: 1, rpc_time: 54.2ms, copr_cache_hit_ratio: 0.00}, tikv_task:{time:48ms, loops:93}, scan_detail: {total_process_keys: 90219, total_keys: 90220, rocksdb: {delete_skipped_count: 0, key_skipped_count: 90219, block: {cache_hit_count: 105, read_count: 0, read_byte: 0 Bytes}}} N/A N/A
│ │ └─IndexRangeScan_155 cop[tikv] 62617.31 table:so, index:dealerCode_serviceOrderCode_index(dealerCode, serviceOrderCode, createDate), range:[“6531138”,“6531139”), keep order:false 90219 tikv_task:{time:42ms, loops:93} N/A N/A
│ └─TableRowIDScan_156 cop[tikv] 62617.31 table:so, keep order:false 86651 time:1m52.1s, loops:97, cop_task: {num: 1470, max: 4.72s, min: 430.3µs, avg: 214.9ms, p95: 1.87s, max_proc_keys: 209, p95_proc_keys: 132, tot_proc: 1m21.4s, tot_wait: 3m53.1s, rpc_num: 1470, rpc_time: 5m15.9s, copr_cache_hit_ratio: 0.03}, tikv_task:{proc max:11ms, min:0s, p80:3ms, p95:5ms, iters:2813, tasks:1470}, scan_detail: {total_process_keys: 81671, total_keys: 81671, rocksdb: {delete_skipped_count: 0, key_skipped_count: 0, block: {cache_hit_count: 975983, read_count: 0, read_byte: 0 Bytes}}} N/A N/A
└─IndexReader_130 root 1.61 index:Selection_129 7221 time:91.9ms, loops:17, cop_task: {num: 13, max: 26.8ms, min: 848.1µs, avg: 11.3ms, p95: 26.8ms, max_proc_keys: 1356, p95_proc_keys: 1356, tot_proc: 127ms, tot_wait: 7ms, rpc_num: 13, rpc_time: 146.6ms, copr_cache_hit_ratio: 0.00} 1.38 KB N/A
└─Selection_129 cop[tikv] 1.61 not(isnull(svc.svc_service_order_item.serviceorderid)) 7221 tikv_task:{proc max:24ms, min:1ms, p80:20ms, p95:24ms, iters:53, tasks:13}, scan_detail: {total_process_keys: 7221, total_keys: 12299, rocksdb: {delete_skipped_count: 0, key_skipped_count: 9179, block: {cache_hit_count: 46019, read_count: 0, read_byte: 0 Bytes}}} N/A N/A
└─IndexRangeScan_128 cop[tikv] 1.61 table:si, index:idx_svc_service_order_item_serviceOrderId(serviceOrderId), range: decided by [eq(svc.svc_service_order_item.serviceorderid, svc.svc_service_order.id)], keep order:false 7221 tikv_task:{proc max:24ms, min:1ms, p80:20ms, p95:24ms, iters:53, tasks:13} N/A N/A

但是Coprocessor 等待耗时 和 Coprocessor 执行耗时 时间很长。

有没有大佬给一些排查思路,或者协助排查。 +V gengquan2020 (有偿,可以谈。)

先从硬件入手查,出问题时间的cpu 内存 硬盘 网络情况
尤其是网络丢包和磁盘响应时间

再看看慢的时段的服务器系统日志和tidb日志

看看tikv节点的资源使用情况,包括IO\CPU\内存这些。另外,最好贴一下语句。left join的效率是比inner join的效率低。

根算子是hashagg,说明这个sql是个典型的聚合计算,这种sql要想快,必须上列存,也就是tiflash节点。
5.2.1这个版本太老了。在行存也就是tikv这个架构上做,扫描的数据量是怎么样都很难减小的。5.2版本也已经结束支持了,EOL了。

https://cn.pingcap.com/tidb-release-support-policy

还是安排升级,尝试使用tiflash才是tidb体系下,最好的解决方案。

1 个赞

看着是你的tikv扛不住了,发下拓扑结构,能上tiflash上tiflash吧

可以尝试上tiflash,优化那行运行时间长的分析行SQL语句

正常情况下 延迟还是挺低的, 资源利用率也不高。 我发上去的那个sql执行计划, 在数据库没问题的情况下 执行很快,在1秒以下


在正常情况下延迟基本上都可以保持在 700毫秒以下。 但是近期总是毫无征兆的 延迟就上去了。 去查看sql语句分析 ,看到的都是平时常见的sql , 索引执行也都是正常的。 但是sql执行时间就很长。 怀疑是某些sql没统计出来。

查看了硬件, 出问题的时间段某些TIKV的CPU利用率达到了 82%。 有的只有40到50. 并且有TIKV有报错日志
tikv [“KvService response batch commands fail”] [err=“"SendError(…)"”]

tikv [deadlock.rs:773] [“leader client failed”] [err=“Grpc(RpcFinished(Some(RpcStatus { code: 1-CANCELLED, message: "Cancelled", details: })))”]

tikv [router.rs:282] [“failed to send significant msg”] [msg=“CaptureChange { cmd: ChangeObserver { cdc_id: None, rts_id: Some(ObserveHandle { id: ObserveID(14409), observing: true }), region_id: 53894523 }, region_epoch: conf_ver: 82090 version: 13867, callback: Callback::Read(…) }”]

看下granafa监控的tikv-detail-Thread CPU-Unified read pool CPU 高不高

需要在延迟时抓一下当前进程cluster_processlist,慢查询只会统计执行完成的sql,没有执行完的不会统计在里面。另外有些耗资源但执行快的也不会出现在慢查询中。

下面是在正常和异常情况下的数据库情况的截图

这是正常情况下 数据库的资源利用率和延迟监控信息。


下面是异常情况的延迟。 然后资源上 tikv的有些节点CPU达到了 82%左右, 在卡的时候 TIKV会有一些报错。



还有在延迟很高的时间段内,TIKV 的CPU利用率 有的会很高,有的很低。

v5.2.1 已经 EOL 了,请尽快完成升级。

:thinking:43条告警,是不是可以从告警入手?另外这么多tikv,可以看看grafana里的pd视图,有没有频繁的tikv leader副本切换。

1 个赞

我建议是从sql入手,因为我看到很多 join ,sort, indexlookup

已经EOL的版本你研究这个没太大意思了啊。

你就算定位到铁证如山,它是个5.2版本的bug,问题是没人来修了啊。

:yum: 可能是对上要有交待吧。

1 个赞