有关tidb集群慢SQL的问题

  • 【TiDB 版本】:v3.0.5
  • 【问题描述】:使用tcpcopy引流线上流量压测tidb集群,出现一些慢查询,但是该慢SQL手动执行没有性能问题。

查看慢日志内容如下:

SQL1

# Time: 2020-01-10T11:14:45.573011107+08:00
# Txn_start_ts: 413827355604156418
# User: imrelation_bs_w@10.xx.xx.xx
# Conn_ID: 5049872
# Query_time: 3.707086752
# Parse_time: 0.000140625
# Compile_time: 0.000626766
# Backoff_time: 3.674 Request_count: 2 Total_keys: 11 Process_keys: 9 Prewrite_time: 0.013738701 Commit_time: 0.001622127 Get_commit_ts_time: 0.000437774 Resolve_lock_time: 0.021305638 Write_keys: 15 Write_size: 2190 Prewrite_region: 3
# DB: im_relation
# Index_names: [im_b_c_relation_98:from_to]
# Is_internal: false
# Digest: 4c956cc9c776d3012ac14a1e8ed6a6d4e214c03762a65dfcee9685a44ab5c94d
# Stats: im_b_c_relation_98:413827341042057272
# Num_cop_tasks: 2
# Cop_proc_avg: 0 Cop_proc_p90: 0 Cop_proc_max: 0 Cop_proc_addr: 10.16.82.6:20171
# Cop_wait_avg: 0 Cop_wait_p90: 0 Cop_wait_max: 0 Cop_wait_addr: 10.16.82.6:20171
# Mem_max: 19532
# Prepared: false
# Has_more_results: false
# Succ: true
# Plan: tidb_decode_plan('xwHwRjAJMzFfOQkwCTAJCjEJMTNfNwkxCTAJdGFibGU6aW1fYl9jX3JlbGF0aW9uXzk4LCBpbmRleDpmcm9tX3VzZXJfaWQsIHRvGQxkcmFuZ2U6Wy0zOTcwMzI5NyA2MTUxMTU3MSxGEwBIXSwga2VlcCBvcmRlcjpmYWxzZQGFCDBfOHqFAEI2AA==')
UPDATE im_b_c_relation_98 SET `from_user_id`='-39703297', `to_user_id`='61511571', `is_reply`='0', `from_first_time`='1577244878', `from_last_time`='1578626081', `to_first_time`='0', `to_last_time`='0', `last_user`='1', `last_time`='1578626081', `last_msg`='{\"extra\":{\"novisible\":\"b\",\"recall\":\"c\",\"post_id\":\"10671296\"},\"content\":\"您好,打扰您一下\"}', `last_msg_type`='RC:TxtMsg', `sort`='0001578626081', `noread_num`='0', `update_at`='1578626081' WHERE `from_user_id`='-39703297' && `to_user_id`='61511571';

SQL2:

# Time: 2020-01-10T11:15:16.130062529+08:00
# Txn_start_ts: 413827363376726057
# User: imrelation_bs_w@10.xx.xx.xx
# Conn_ID: 5049872
# Query_time: 4.585123578
# Parse_time: 0.000140222
# Compile_time: 0.000531084
# Wait_time: 0.001 Request_count: 2 Total_keys: 8 Process_keys: 6 Prewrite_time: 4.580365244 Commit_time: 0.001678787 Get_commit_ts_time: 0.000210794 Commit_backoff_time: 8.235 Backoff_types: [txnLock txnLock txnLock txnLock txnLock txnLock txnLock txnLock txnLock txnLock] Resolve_lock_time: 0.01575911 Write_keys: 15 Write_size: 3548 Prewrite_region: 3
# DB: im_relation
# Index_names: [im_b_c_relation_41:from_to]
# Is_internal: false
# Digest: 9398b355b282d35c1344af1497428d876845b15fed86e34447eaf4f4c7a8e0e4
# Stats: im_b_c_relation_41:413827295953289218
# Num_cop_tasks: 2
# Cop_proc_avg: 0 Cop_proc_p90: 0 Cop_proc_max: 0 Cop_proc_addr: 10.16.82.7:20172
# Cop_wait_avg: 0.0005 Cop_wait_p90: 0.001 Cop_wait_max: 0.001 Cop_wait_addr: 10.16.82.7:20172
# Mem_max: 18764
# Prepared: false
# Has_more_results: false
# Succ: true
# Plan: tidb_decode_plan('xwHwRjAJMzFfOQkwCTAJCjEJMTNfNwkxCTAJdGFibGU6aW1fYl9jX3JlbGF0aW9uXzQxLCBpbmRleDpmcm9tX3VzZXJfaWQsIHRvGQxkcmFuZ2U6Wy01Nzc0MDY0MCA2MjU3NTMxNSxGEwBIXSwga2VlcCBvcmRlcjpmYWxzZQGFEDBfOAkxcoUAQjYA')
UPDATE im_b_c_relation_41 SET `from_user_id`='-57740640', `to_user_id`='62575315', `is_reply`='0', `from_first_time`='1576645278', `from_last_time`='1578626111', `to_first_time`='0', `to_last_time`='0', `last_user`='1', `last_time`='1578626111', `last_msg`='{\"job_id\":\"10627168\",\"company_name\":\"上海忻美芃邦信息科技有限公司\",\"job_title\":\"英语老师\",\"job_date_type_desc\":\"全职\",\"job_desc\":\"\",\"job_location_quanguo\":\"\",\"job_locations\":[{\"address\":\"财富时代大厦\",\"lat_lng\":\"31.24936,121.448738\"}],\"tags\":[\"全职\",\"英语老师\"],\"demand_tags\":[\"男女不限\",\"20-35岁\",\"学历不限\",\"1-3年\",\"身份不限\"],\"salary_desc\":\"10.0k-15.0k元/月\",\"job_type_desc\":\"英语老师\",\"payment_type_desc\":\"月结\",\"template_type\":0,\"extra\":{\"novisible\":\"b\",\"recall\":\"c\",\"post_id\":\"10627168\"},\"message_card_name\":\"职位邀请\"}', `last_msg_type`='doumi:RecruiterInviteYouMsg', `sort`='0001578626111', `noread_num`='0', `update_at`='1578626111' WHERE `from_user_id`='-57740640' && `to_user_id`='62575315';

慢SQL大部分是因为Backoff_time或Prewrite_time耗时造成的。

附上有关锁冲突的监控

您好: 请问单独执行是在压测,还是压测进行中? 能否将tidb , tikv-detail,overview的监控使用长图截取工具都发一下,多谢.

您好,压测正在进行中,是实时压测。

有没有更方便的方法提供grafana监控图?监控图表太多了。

如果需要截图,哪些重点指标图需要提供给您?

TiDB:dashboard

您好: 可以先按d ,shift+e 打开所有监控,再使用类似这种工具来截长图,https://chrome.google.com/webstore/detail/full-page-screen-capture/fdpohaocaechififmbbbbbknoalclacl

您查收一下 overview:overview.pdf (2.0 MB)

tidb:tidb.pdf (1.1 MB)

tikv-detail:tikv-detail.pdf (3.0 MB)

您好: 1. 从监控看,您是一个服务器部署了3个tikv实例吗? 2. raft-log和raft-tikv,以及一些监控截图被互相覆盖了,麻烦补充完整的监控,多谢了. 3. 你的tikv机器配置是如何的?cpu,内存,硬盘大小 ,感觉磁盘很大,内存较小?

硬件配置如下:

组件 CPU 内存 磁盘 机器数量 网络 备注
TiDB、PD 48C 96G “系统2块480G-SSD数据盘1块960G(普通SSD)” 3 万兆网卡 TiDB组件和PD组件混合部署
TiKV 48C 128G “系统2块480G-SSD数据盘3块1.6T(NVME盘)” 4 万兆网卡 单机多实例部署,每台最多部署3个实例。

重新抓图如下:

tidb:tidb.pdf (2.9 MB)

tikv-detail:(已压缩)tikv-detail.pdf (2.0 MB)

您好: 看报错是11点14,截图开始时间开起来没有包含此事件,之后的时间段有类似慢sql吗?

类似的SQL一直存在,各个时间段都会出现,如下:

# Time: 2020-01-10T15:05:24.416953218+08:00
# Txn_start_ts: 413830983283900449
# User: imrelation_bs_w@10.16.80.247
# Conn_ID: 5652795
# Query_time: 4.028786454
# Parse_time: 0.000121275
# Compile_time: 0.000559906
# Request_count: 2 Total_keys: 2 Process_keys: 1 Prewrite_time: 4.024153052 Commit_time: 0.001630363 Get_commit_ts_time: 0.00040079 Commit_backoff_time: 7.813 Backoff_types: [txnLock txnLock txnLock txnLock txnLock txnLock txnLock txnLock txnLock txnLock] Resolve_lock_time: 0.006717238 Write_keys: 5 Write_size: 2368 Prewrite_region: 3
# DB: im_relation
# Index_names: [im_b_c_relation_76:from_to]
# Is_internal: false
# Digest: b68386fc404c734d29e9f5af688fd82da4f1fd26f9f9bd430973939ee4ecfbab
# Stats: im_b_c_relation_76:413830974384111642
# Num_cop_tasks: 2
# Cop_proc_avg: 0 Cop_proc_p90: 0 Cop_proc_max: 0 Cop_proc_addr: 10.16.82.5:20172
# Cop_wait_avg: 0 Cop_wait_p90: 0 Cop_wait_max: 0 Cop_wait_addr: 10.16.82.5:20172
# Mem_max: 20268
# Prepared: false
# Has_more_results: false
# Succ: true
# Plan: tidb_decode_plan('xwHwRjAJMzFfOQkwCTAJCjEJMTNfNwkxCTAJdGFibGU6aW1fYl9jX3JlbGF0aW9uXzc2LCBpbmRleDpmcm9tX3VzZXJfaWQsIHRvGQxkcmFuZ2U6Wy01MzIzMDA3NSAxMjY3MzUzNCxGEwBIXSwga2VlcCBvcmRlcjpmYWxzZQGFCDBfOHqFAEI2AA==')
UPDATE im_b_c_relation_76 SET `from_user_id`='-53230075', `to_user_id`='12673534', `is_reply`='0', `from_first_time`='1578639919', `from_last_time`='1578639920', `to_first_time`='0', `to_last_time`='0', `last_user`='1', `last_time`='1578639920', `last_msg`='{\"extra\":{\"novisible\":\"b\",\"system_send\":\"c\",\"post_id\":\"10710257\"},\"content\":\"您好,请问您还在找促销导购的工作么?感兴趣的话可以看下我的职位。\"}', `last_msg_type`='RC:TxtMsg', `sort`='0001578639920', `noread_num`='0', `update_at`='1578639920' WHERE `from_user_id`='-53230075' && `to_user_id`='12673534';

您好: 可以看到tikv整体的耗时时间比较短,能否帮忙确认下,这些更新语句是否会有并发同时更新同一行记录的情况?另外业务侧有失败的情况吗? 麻烦发一下tikv的日志,/log/tikv.log多谢

您好:现在使用tcpcopy引流线上请求到tidb集群,还没有正式切换线上流量,所以无法确定业务侧是否失败。

tikv.log,请查收附件tikv.log (2.8 MB)

[2020/01/10 17:12:18.951 +08:00] [ERROR] [deadlock.rs:549] [“leader client failed”] [err=“Grpc(RpcFinished(Some(RpcStatus { status: FailedPrecondition, details: Some(“I\'m not the leader of deadlock detector”) })))”] 看日志,有比较多的冲突,建议可以查一下,或者和业务确认会有同时更新吗?多谢

多谢支持,随后会和业务同学沟通一下,看看并发更新同一条记录的情况。

日志中这样的报错也是写入冲突的情况吗?

[2020/01/10 17:10:06.592 +08:00] [ERROR] [endpoint.rs:454] [error-response] [err="locked primary_lock: 7480000000000004895F69800000000000000404000000005E0C2A7D03F000000004C07DB7 lock_version: 413832945746313226 key: 7480000000000004895F72F000000004C07DB7 lock_ttl: 3002 txn_size: 1"]

读写冲突,读数据时发现 key 有锁阻碍读, primary_lock:锁对应事务的 primary lock lock_version:锁对应事务的 start TS key:表示被锁的 key lock_ttl: 锁的 TTL txn_size:锁所在事务在其 region 的 key 数量,指导清锁方式。