事务KV在Get操作时出现ResolveLocks请求大量超时

场景:
对同一个Key的大量Read-Modify-Write操作

问题:
发现当使用事务KV接口进行操作时,Get API会返回TiKV server timeout错误。堆栈跟踪如下:

ts=2020-08-11T15:07:38+08:00 level=error msg=“Failed to execute HSET” key=0a28044dd2c8af463926 field=FingerprintsTable value="
\u000e
\t461253038\u0010\ufffd\u001b" error=“[tikv:9002]TiKV server timeout” errorVerbose="[tikv:9002]TiKV server timeout

github.com/pingcap/errors.AddStack
/root/go/pkg/mod/github.com/pingcap/errors@v0.11.5-0.20190809092503-95897b64e011/errors.go:174

github.com/pingcap/errors.Trace
/root/go/pkg/mod/github.com/pingcap/errors@v0.11.5-0.20190809092503-95897b64e011/juju_adaptor.go:15

github.com/pingcap/tidb/store/tikv.(*RegionRequestSender).onSendFail
/root/go/pkg/mod/github.com/pingcap/tidb@v1.1.0-beta.0.20200604055950-efc1c154d098/store/tikv/region_request.go:315

github.com/pingcap/tidb/store/tikv.(*RegionRequestSender).sendReqToRegion
/root/go/pkg/mod/github.com/pingcap/tidb@v1.1.0-beta.0.20200604055950-efc1c154d098/store/tikv/region_request.go:256

github.com/pingcap/tidb/store/tikv.(*RegionRequestSender).SendReqCtx
/root/go/pkg/mod/github.com/pingcap/tidb@v1.1.0-beta.0.20200604055950-efc1c154d098/store/tikv/region_request.go:216

github.com/pingcap/tidb/store/tikv.(*RegionRequestSender).SendReq
/root/go/pkg/mod/github.com/pingcap/tidb@v1.1.0-beta.0.20200604055950-efc1c154d098/store/tikv/region_request.go:126

github.com/pingcap/tidb/store/tikv.(*tikvStore).SendReq
/root/go/pkg/mod/github.com/pingcap/tidb@v1.1.0-beta.0.20200604055950-efc1c154d098/store/tikv/kv.go:401

github.com/pingcap/tidb/store/tikv.(*LockResolver).getTxnStatus
/root/go/pkg/mod/github.com/pingcap/tidb@v1.1.0-beta.0.20200604055950-efc1c154d098/store/tikv/lock_resolver.go:531

github.com/pingcap/tidb/store/tikv.(*LockResolver).getTxnStatusFromLock
/root/go/pkg/mod/github.com/pingcap/tidb@v1.1.0-beta.0.20200604055950-efc1c154d098/store/tikv/lock_resolver.go:450

github.com/pingcap/tidb/store/tikv.(*LockResolver).resolveLocks
/root/go/pkg/mod/github.com/pingcap/tidb@v1.1.0-beta.0.20200604055950-efc1c154d098/store/tikv/lock_resolver.go:321

github.com/pingcap/tidb/store/tikv.(*LockResolver).resolveLocksLite
/root/go/pkg/mod/github.com/pingcap/tidb@v1.1.0-beta.0.20200604055950-efc1c154d098/store/tikv/lock_resolver.go:295

github.com/pingcap/tidb/store/tikv.(*clientHelper).ResolveLocks
/root/go/pkg/mod/github.com/pingcap/tidb@v1.1.0-beta.0.20200604055950-efc1c154d098/store/tikv/coprocessor.go:823

github.com/pingcap/tidb/store/tikv.(*tikvSnapshot).get
/root/go/pkg/mod/github.com/pingcap/tidb@v1.1.0-beta.0.20200604055950-efc1c154d098/store/tikv/snapshot.go:388

github.com/pingcap/tidb/store/tikv.(*tikvSnapshot).Get
/root/go/pkg/mod/github.com/pingcap/tidb@v1.1.0-beta.0.20200604055950-efc1c154d098/store/tikv/snapshot.go:309

github.com/pingcap/tidb/kv.(*unionStore).Get
/root/go/pkg/mod/github.com/pingcap/tidb@v1.1.0-beta.0.20200604055950-efc1c154d098/kv/union_store.go:113

github.com/pingcap/tidb/store/tikv.(*tikvTxn).Get
/root/go/pkg/mod/github.com/pingcap/tidb@v1.1.0-beta.0.20200604055950-efc1c154d098/store/tikv/txn.go:144

gitlab.vmic.xyz/daas/tula/store/txnkv.(*database).HSet.func1
/root/tula/store/txnkv/hash.go:54

gitlab.vmic.xyz/daas/tula/kv.RunInNewTxn
/root/tula/kv/txn.go:43

gitlab.vmic.xyz/daas/tula/store/txnkv.(*database).HSet
/root/tula/store/txnkv/hash.go:43

gitlab.vmic.xyz/daas/tula/command.HSet
/root/tula/command/hash.go:18

gitlab.vmic.xyz/daas/tula/command.(*Executor).Execute
/root/tula/command/command.go:166

gitlab.vmic.xyz/daas/tula.(*connection).execute
/root/tula/conn.go:143

gitlab.vmic.xyz/daas/tula.(*connection).serve
/root/tula/conn.go:112

gitlab.vmic.xyz/daas/tula.(*Server).serve.func1
/root/tula/tula.go:128

我们发现在TiKV SDK中调用发送CheckTxnStatus Request时,会返回超时错误,但是服务器的连通性是可以保证的。

TiKV 版本:3.0.12

请教一下该问题应该如何排查,谢谢!

  1. 报错代表 tikv 比较繁忙
  2. 检查 tikv 日志, 可能是 coprocessor 请求过多导致读负载较高,建议检查 tikv 监控 检查 thread cpu,coprocessor,scan key 等情况,比如是否有读热点;
  3. 检查 slow query 的报错信息,建议根据 table_id 和 txn_start_ts,到 tidb_slow_query.log 中定位慢 SQL,检查执行计划是否合理或者是否有大的查询扫了过多的 key

经过我们排除,这个问题的真正原因是,我们的TiKV是3.0的,而客户端SDK使用了4.0的接口,导致CheckTxnStatus API 在服务端未实现,因此在Batch 请求时导致了等待超时,将TiKV升级到4.0即可解决!

:+1:

此话题已在最后回复的 1 分钟后被自动关闭。不再允许新回复。