整个集群的update变慢,其他正常。怀疑是被锁,但是无法找到是哪个锁的

版本TiDB:v3.0.11

在某个时段内大量的这种update变慢,都是根据id更新,未发现id并发冲突。只是显示binlog写入变慢,目前pump的在56核256G,SSD的的服务器上。数据冲突不严重。目前没太好的查找方向。


02U)4UTG2U%7DW)(3AX%7BPPE21

*************************** 394. row ***************************
                 Time: 2020-07-09 16:39:52.860977
         Txn_start_ts: 417931972074799124
                 User: uic
                 Host: 10.204.9.202
              Conn_ID: 2761922
           Query_time: 41.745752207
           Parse_time: 0.000040567
         Compile_time: 0.000044442
        Prewrite_time: 41.744269941
 Binlog_prewrite_time: 41.744283277
          Commit_time: 0
   Get_commit_ts_time: 0
  Commit_backoff_time: 41.722
        Backoff_types: [txnLock
    Resolve_lock_time: 0.007564717
Local_latch_wait_time: 0
           Write_keys: 3
           Write_size: 676
      Prewrite_region: 3
            Txn_retry: 0
         Process_time: 0
            Wait_time: 0
         Backoff_time: 0
        LockKeys_time: 0
        Request_count: 0
           Total_keys: 0
         Process_keys: 0
                   DB: uic
          Index_names: 
          Is_internal: 0
               Digest: 2f2abcc1729c7cec9284548a61e5179d5e4fde4da523f5c9d1924922d5ddcd13
                Stats: 
         Cop_proc_avg: 0
         Cop_proc_p90: 0
         Cop_proc_max: 0
        Cop_proc_addr: 
         Cop_wait_avg: 0
         Cop_wait_p90: 0
         Cop_wait_max: 0
        Cop_wait_addr: 
              Mem_max: 0
                 Succ: 0
                 Plan:  Point_Get_1     root    1       table:uic_user, handle:38378511638990848
          Plan_digest: 0cea232deba5ed3a5351e923015f60176bbddc64d94f4bd8d68b522e419b6729
            Prev_stmt: 
                Query: update uic_user          SET extend_attr = '{"defaultPaymentType":"1","boxShowModeTipTimes":"1","courier_growth_score":"4909","courier_feature_rank":"2","courier_effective_time":"20200726","boxShowMode":"2","courier_growth_rank":"3","showBoxTypeModePage":"1","boxShowModeGray":"2"}',                                                                                                                         update_time = '2020-07-09 16:35:08.065'          where id = 38378511638990848;

确认下 duration 升高时间点的问题:

  1. 看下监控中 drainer event 和 sql query time
  2. drainer 中 worker-count 线程是多少
  3. 下游 tidb 当时的 qps 和服务器负载情况大概是多少,

该TiDB没有drainer。
目前发现performance_schema_max_rwlock_classes 参数是40秒清除的。innodb_lock_wait_timeout是50秒,不太像影响这次锁的问题
还有每次会影响10分钟,这10分钟应该是lifetime。

我猜是条sql加的锁没被清理,导致所有的sql的 超时,等到life time之后才被清理掉。
目前找了慢sql,以及每一条进入超时的sql,暂时无法找到锁住的sql。

hi,
请问你的问题是 当前 tidb 集群 duration 突然升高,pump 写入也突然变慢是嘛,这样可以先关注 tidb这边为什么写入变慢上,pump 随着写入变慢是预期的,都是在 2pc 中完成的。

可否上传下 16:30 - 16:50 的 tidb log 看下,并且使用以下方式上传下对应时间的 tidb、 tikv-detail 和 tikv-trouble-shorting 的监控截图。这边分析下。


打开 grafana 监控,先按 d 再按 shift+e 可以打开所有监控项。

(1)、chrome 安装这个插件https://chrome.google.com/webstore/detail/full-page-screen-capture/fdpohaocaechififmbbbbbknoalclacl

(2)、鼠标焦点置于 Dashboard 上,按 ?可显示所有快捷键,先按 d 再按 E 可将所有 Rows 的 Panels 打开,需等待一段时间待页面加载完成。

(3)、使用这个 full-page-screen-capture 插件进行截屏保存

pump写入正常,目前这个插件装不了,还有其他方法么

链接: https://pan.baidu.com/s/1Bx93zRgCLXhwNcFTRNcsNg 提取码: ex2s

其中有个冲突很厉害的
[INFO] [lock_resolver.go:487] [“resolveLock rollback”] [lock=“key: {tableID=602, handle=185114085543665664}, primary: {tableID=588, handle=378081179053621248}, txnStartTS: 417931653255266349, ttl: 669169, type: PessimisticLock”]

588表是uic_open_user_detail
602表是uic_user

这个日志是 2pc 的解锁日志,预期行为

可以关注下 duration 高时间点 tikv 服务器磁盘是否有问题,

磁盘没啥问题,在日志里找到一条,看起来是查询的时候超过了时间,其实这条sql没执行的

[2020/07/09 16:30:34.831 +08:00] [WARN] [conn.go:684] ["dispatch error"] [conn=2762076] [connInfo="id:2762076, addr:10.204.9.202:39680 status:1, collation:utf8_general_ci, user:uic"] [command=Query] [status="inTxn:1, autocommit:0"] [sql="select\
     \
     \
    \
    id, business_id, associated_business_id, user_id, mobile_phone, associated_user_id, \
    associated_mobile_phone, associated_nickname, associated_wx_id, associated_cancel_type, \
    associated_cancel_having, associated_cancel_tm, associated_status, create_emp, update_emp, \
    create_sys_tm, update_sys_tm, delete_flag\
   \
    from uic_user_passive_associated\
     \
       \
    \
     WHERE (  user_id = 485910153028198400\
                \
            \
              \
                  and associated_status = 1\
                \
            \
              \
                  and delete_flag = 0 )"] [err="[tikv:9006]GC life time is shorter than transaction duration, transaction starts at 2020-07-09 16:18:54.891 +0800 CST, GC safe point is 2020-07-09 16:19:31.441 +0800 CST\
github.com/pingcap/errors.AddStack\
\t/home/jenkins/agent/workspace/tidb_v3.0.11/go/pkg/mod/github.com/pingcap/errors@v0.11.4/errors.go:174\
github.com/pingcap/parser/terror.(*Error).GenWithStackByArgs\
\t/home/jenkins/agent/workspace/tidb_v3.0.11/go/pkg/mod/github.com/pingcap/parser@v0.0.0-20200301155133-79ec3dee69a5/terror/terror.go:238\
github.com/pingcap/tidb/store/tikv.(*tikvStore).CheckVisibility\
\t/home/jenkins/agent/workspace/tidb_v3.0.11/go/src/github.com/pingcap/tidb/store/tikv/kv.go:180\
github.com/pingcap/tidb/store/tikv.(*copIterator).Next\
\t/home/jenkins/agent/workspace/tidb_v3.0.11/go/src/github.com/pingcap/tidb/store/tikv/coprocessor.go:623\
github.com/pingcap/tidb/distsql.(*selectResult).fetch\
\t/home/jenkins/agent/workspace/tidb_v3.0.11/go/src/github.com/pingcap/tidb/distsql/select_result.go:105\
runtime.goexit\
\t/usr/local/go/src/runtime/asm_amd64.s:1357"]
[2020/07/09 16:30:34.833 +08:00] [WARN] [conn.go:684] ["dispatch error"] [conn=2762076] [connInfo="id:2762076, addr:10.204.9.202:39680 status:1, collation:utf8_general_ci, user:uic"] [command=Query] [status="inTxn:1, autocommit:0"] [sql="select\
     \
     \
    id, user_id, audit_status, effective_time, open_time, company_id, company_name, dept_code,\
    dept_name, staff_id, head_img_url, alipay_account, wx_account, points, update_points_time,\
    real_name_auth, attachments_name, last_login_edcode, last_login_time, province, city,\
    district, audit_time, audit_emp, create_time, create_emp, update_time, update_emp,\
    settle_id, remark, status, is_del, submit_audit_source, version,is_auth_again,is_company_call,blacklist_kind,black_status_expire_time\
   \
    from uic_courier_role_detail\
     \
       \
\
     WHERE (  user_id = 504463638380404736\
                \
            \
              \
                  and status = 1\
                \
            \
              \
                  and is_del = 0 ) \
   \
     \
     \
      order by create_time desc, id desc"] [err="[tikv:9006]GC life time is shorter than transaction duration, transaction starts at 2020-07-09 16:18:54.891 +0800 CST, GC safe point is 2020-07-09 16:19:31.441 +0800 

此日志时 sql 执行超过了 gc 时间,16:30 已经被报出来,所以不会对后面的 sql 有影响。

是的,但是没有这么慢的sql。有办法捞出来这条sql在这个事物里的所有sql么

在 tidb log 中搜索下对应 lock 时间点是否有 write conflit 字样日志,如果存在就是有写写冲突,并根据该条日志所打印的信息,使用以下方式先到对应数据,也请结合业务看下是否有从业务层面改进的方式

一楼中 not expired resolved 对应锁的状态。

通过 pd-ctl 将时间戳转换为可读时间:
./pd-ctl -u https://127.0.0.1:2379 tso {TIMESTAMP}

通过 tableID 查找具体的表名:
curl http://{TiDBIP}:10080/db-table/{tableID}

通过 indexID 查找具体的索引名:
SELECT * FROM INFORMATION_SCHEMA.TIDB_INDEXES WHERE TABLE_SCHEMA=’{table_name}’ AND TABLE_NAME=’{table_name}’ AND INDEX_ID={indexID};

好,目前系统改成乐观锁之后,不会直接卡死,但是依然有gc life time的问题。最神奇的点是只有一台TiDB会发生这种情况,而且判断开始时间是2020-07-13 09:47:06但实际上当前机器是2020/07/13 10:34:22,正在查为啥。看起来是时间问题

[2020/07/13 10:34:22.832 +08:00] [WARN] [session.go:983] ["run statement error"] [conn=2773633] [schemaVersion=340] [error="[tikv:9006]GC life time is shorter than transaction dura
tion, transaction starts at 2020-07-13 09:47:06.141 +0800 CST, GC safe point is 2020-07-13 10:19:31.441 +0800 CST"] [errorVerbose="[tikv:9006]GC life time is shorter than transacti
on duration, transaction starts at 2020-07-13 09:47:06.141 +0800 CST, GC safe point is 2020-07-13 10:19:31.441 +0800 CST\
github.com/pingcap/errors.AddStack\
\t/home/jenkins/agent/
workspace/tidb_v3.0.11/go/pkg/mod/github.com/pingcap/errors@v0.11.4/errors.go:174\
github.com/pingcap/parser/terror.(*Error).GenWithStackByArgs\
\t/home/jenkins/agent/workspace/tid
b_v3.0.11/go/pkg/mod/github.com/pingcap/parser@v0.0.0-20200301155133-79ec3dee69a5/terror/terror.go:238\
github.com/pingcap/tidb/store/tikv.(*tikvStore).CheckVisibility\
\t/home/jen
kins/agent/workspace/tidb_v3.0.11/go/src/github.com/pingcap/tidb/store/tikv/kv.go:180\
github.com/pingcap/tidb/store/tikv.(*tikvSnapshot).Get\
\t/home/jenkins/agent/workspace/tidb_
v3.0.11/go/src/github.com/pingcap/tidb/store/tikv/snapshot.go:269\
github.com/pingcap/tidb/executor.(*PointGetExecutor).get\
\t/home/jenkins/agent/workspace/tidb_v3.0.11/go/src/git
hub.com/pingcap/tidb/executor/point_get.go:194\
github.com/pingcap/tidb/executor.(*PointGetExecutor).Next\
\t/home/jenkins/agent/workspace/tidb_v3.0.11/go/src/github.com/pingcap/ti
db/executor/point_get.go:129\
github.com/pingcap/tidb/executor.Next\
\t/home/jenkins/agent/workspace/tidb_v3.0.11/go/src/github.com/pingcap/tidb/executor/executor.go:192\
github.co
m/pingcap/tidb/executor.(*UpdateExec).fetchChunkRows\
\t/home/jenkins/agent/workspace/tidb_v3.0.11/go/src/github.com/pingcap/tidb/executor/update.go:190\
github.com/pingcap/tidb/ex
ecutor.(*UpdateExec).Next\
\t/home/jenkins/agent/workspace/tidb_v3.0.11/go/src/github.com/pingcap/tidb/executor/update.go:149\
github.com/pingcap/tidb/executor.Next\
\t/home/jenkin
s/agent/workspace/tidb_v3.0.11/go/src/github.com/pingcap/tidb/executor/executor.go:192\
github.com/pingcap/tidb/executor.(*ExecStmt).handleNoDelayExecutor\
\t/home/jenkins/agent/wo
rkspace/tidb_v3.0.11/go/src/github.com/pingcap/tidb/executor/adapter.go:435\
github.com/pingcap/tidb/executor.(*ExecStmt).Exec\
\t/home/jenkins/agent/workspace/tidb_v3.0.11/go/src/
github.com/pingcap/tidb/executor/adapter.go:299\
github.com/pingcap/tidb/session.runStmt\
\t/home/jenkins/agent/workspace/tidb_v3.0.11/go/src/github.com/pingcap/tidb/session/tidb.g
o:251\
github.com/pingcap/tidb/session.(*session).executeStatement\
\t/home/jenkins/agent/workspace/tidb_v3.0.11/go/src/github.com/pingcap/tidb/session/session.go:980\
github.com/p
ingcap/tidb/session.(*session).execute\
\t/home/jenkins/agent/workspace/tidb_v3.0.11/go/src/github.com/pingcap/tidb/session/session.go:1094\
github.com/pingcap/tidb/session.(*sessi
on).Execute\
\t/home/jenkins/agent/workspace/tidb_v3.0.11/go/src/github.com/pingcap/tidb/session/session.go:1018\
github.com/pingcap/tidb/server.(*TiDBContext).Execute\
\t/home/jen
kins/agent/workspace/tidb_v3.0.11/go/src/github.com/pingcap/tidb/server/driver_tidb.go:246\
github.com/pingcap/tidb/server.(*clientConn).handleQuery\
\t/home/jenkins/agent/workspac
e/tidb_v3.0.11/go/src/github.com/pingcap/tidb/server/conn.go:1205\
github.com/pingcap/tidb/server.(*clientConn).dispatch\
\t/home/jenkins/agent/workspace/tidb_v3.0.11/go/src/github
.com/pingcap/tidb/server/conn.go:921\
github.com/pingcap/tidb/server.(*clientConn).Run\
\t/home/jenkins/agent/workspace/tidb_v3.0.11/go/src/github.com/pingcap/tidb/server/conn.go:6
68\
github.com/pingcap/tidb/server.(*Server).onConn\
\t/home/jenkins/agent/workspace/tidb_v3.0.11/go/src/github.com/pingcap/tidb/server/server.go:454\
runtime.goexit\
\t/usr/local/
go/src/runtime/asm_amd64.s:1357"] [session="{\
  \"currDBName\": \"uic\",\
  \"id\": 2773633,\
  \"status\": 1,\
  \"strictMode\": true,\
  \"txn\": \"418016087543906496\",\
  \"us
er\": {\
    \"Username\": \"uic\",\
    \"Hostname\": \"10.204.9.202\",\
    \"CurrentUser\": false,\
    \"AuthUsername\": \"uic\",\
    \"AuthHostname\": \"10.204.%.%\"\
  }\
}"
]

start ts 时间是 2020-07-13 09:47:06.141 +0800 CST,gc 时间为 2020-07-13 10:19:31.441 +0800 CST。可以看下 gc life time 设置的时间为多少

10分钟。看起来就是这个原因引起来这个帖子的第一个问题。只是目前还没头绪

因为这个流量是录制的可以看到操作时间是2020-07-13 09:30:13.436

[2020/07/13 09:29:07.278 +08:00] [WARN] [conn.go:684] ["dispatch error"] [conn=2773633] [connInfo="id:2773633, addr:10.204.9.202:52650 status:1, collation:utf8_general_ci, user:uic
"] [command=Query] [status="inTxn:1, autocommit:0"] [sql="update uic_user\
         SET extend_attr = '{\"defaultPaymentType\":\"1\",\"boxShowModeTipTimes\":\"1\",\"courier_growth_
score\":\"7854\",\"courier_feature_rank\":\"1\",\"courier_effective_time\":\"20200808\",\"boxShowMode\":\"2\",\"real_name_flag\":\"1\",\"courier_growth_rank\":\"4\",\"showBoxTypeMo
dePage\":\"1\",\"boxShowModeGray\":\"2\"}',\
            \
            \
            \
            \
            \
            \
            \
            \
                update_
time = '2020-07-13 09:30:13.436' \
        where id = 246881774439772160"] [err="[tikv:9006]GC life time is shorter than transaction duration, transaction starts at 2020-07-13 09:1
5:03.041 +0800 CST, GC safe point is 2020-07-13 09:19:31.441 +0800 CST\
github.com/pingcap/errors.AddStack\
\t/home/jenkins/agent/workspace/tidb_v3.0.11/go/pkg/mod/github.com/pingc
ap/errors@v0.11.4/errors.go:174\
github.com/pingcap/parser/terror.(*Error).GenWithStackByArgs\
\t/home/jenkins/agent/workspace/tidb_v3.0.11/go/pkg/mod/github.com/pingcap/parser@v0.
0.0-20200301155133-79ec3dee69a5/terror/terror.go:238\
github.com/pingcap/tidb/store/tikv.(*tikvStore).CheckVisibility\
\t/home/jenkins/agent/workspace/tidb_v3.0.11/go/src/github.co
m/pingcap/tidb/store/tikv/kv.go:180\
github.com/pingcap/tidb/store/tikv.(*tikvSnapshot).Get\
\t/home/jenkins/agent/workspace/tidb_v3.0.11/go/src/github.com/pingcap/tidb/store/tikv/
snapshot.go:269\
github.com/pingcap/tidb/executor.(*PointGetExecutor).get\
\t/home/jenkins/agent/workspace/tidb_v3.0.11/go/src/github.com/pingcap/tidb/executor/point_get.go:194\
gi
thub.com/pingcap/tidb/executor.(*PointGetExecutor).Next\
\t/home/jenkins/agent/workspace/tidb_v3.0.11/go/src/github.com/pingcap/tidb/executor/point_get.go:129\
github.com/pingcap/t
idb/executor.Next\
\t/home/jenkins/agent/workspace/tidb_v3.0.11/go/src/github.com/pingcap/tidb/executor/executor.go:192\
github.com/pingcap/tidb/executor.(*UpdateExec).fetchChunkRo
ws\
\t/home/jenkins/agent/workspace/tidb_v3.0.11/go/src/github.com/pingcap/tidb/executor/update.go:190\
github.com/pingcap/tidb/executor.(*UpdateExec).Next\
\t/home/jenkins/agent/w
orkspace/tidb_v3.0.11/go/src/github.com/pingcap/tidb/executor/update.go:149\
github.com/pingcap/tidb/executor.Next\
\t/home/jenkins/agent/workspace/tidb_v3.0.11/go/src/github.com/p
ingcap/tidb/executor/executor.go:192\
github.com/pingcap/tidb/executor.(*ExecStmt).handleNoDelayExecutor\
\t/home/jenkins/agent/workspace/tidb_v3.0.11/go/src/github.com/pingcap/tid
b/executor/adapter.go:435\
github.com/pingcap/tidb/executor.(*ExecStmt).Exec\
\t/home/jenkins/agent/workspace/tidb_v3.0.11/go/src/github.com/pingcap/tidb/executor/adapter.go:299\
g
ithub.com/pingcap/tidb/session.runStmt\
\t/home/jenkins/agent/workspace/tidb_v3.0.11/go/src/github.com/pingcap/tidb/session/tidb.go:251\
github.com/pingcap/tidb/session.(*session).
executeStatement\
\t/home/jenkins/agent/workspace/tidb_v3.0.11/go/src/github.com/pingcap/tidb/session/session.go:980\
github.com/pingcap/tidb/session.(*session).execute\
\t/home/je
nkins/agent/workspace/tidb_v3.0.11/go/src/github.com/pingcap/tidb/session/session.go:1094\
github.com/pingcap/tidb/session.(*session).Execute\
\t/home/jenkins/agent/workspace/tidb_
v3.0.11/go/src/github.com/pingcap/tidb/session/session.go:1018\
github.com/pingcap/tidb/server.(*TiDBContext).Execute\
\t/home/jenkins/agent/workspace/tidb_v3.0.11/go/src/github.co
m/pingcap/tidb/server/driver_tidb.go:246\
github.com/pingcap/tidb/server.(*clientConn).handleQuery\
\t/home/jenkins/agent/workspace/tidb_v3.0.11/go/src/github.com/pingcap/tidb/serv
er/conn.go:1205\
github.com/pingcap/tidb/server.(*clientConn).dispatch\
\t/home/jenkins/agent/workspace/tidb_v3.0.11/go/src/github.com/pingcap/tidb/server/conn.go:921\
github.com/p
ingcap/tidb/server.(*clientConn).Run\
\t/home/jenkins/agent/workspace/tidb_v3.0.11/go/src/github.com/pingcap/tidb/server/conn.go:668\
github.com/pingcap/tidb/server.(*Server).onCon
n\
\t/home/jenkins/agent/workspace/tidb_v3.0.11/go/src/github.com/pingcap/tidb/server/server.go:454\
runtime.goexit\
\t/usr/local/go/src/runtime/asm_amd64.s:1357"]

你好

调整 gc life time 到一个大的值(24h?72h?),可以解决此问题,是该 sql 执行时间较长导致的

该sql不可能执行太长时间,是30分执行的,到TiDB是29分,看了下TiDB的时间比应用时间少1分钟,这里的具体时间没问题

可否将此 sql 在 shell 中执行一下,并上传下执行时间截图,

正常是没问题的,目前这种sql只发生在某段时间,某台TiDB上。

当时的 tidb 的 duration 和 qps 负载情况可否提供下监控,可能是当时负载较高,sql 在当时执行较慢,可以在慢日志中搜索下该语句,拿出当时的执行计划,看下是否与现在的执行计划相符。