TIDB集群进行机房迁移后偶尔报错tikv aborts txn

【TiDB 使用环境】生产环境
【TiDB 版本】7.1.5
【操作系统】Debian12
【部署方式】云上部署(什么云)/机器部署(什么机器配置、什么硬盘)
【集群数据量】
【集群节点数】3tidb/3pd/3tikv
【问题复现路径】做过哪些操作出现的问题
业务与TiDB集群都在A机房,现将TiDB集群从原有A机房迁移到B机房,业务仍然在A机房, AB机房延迟为0.5ms
【遇到的问题:问题现象及影响】
业务开始使用B机房集群时,偶尔会有报错,之前未出现过,业务未变动。

err="tikv aborts txn: Error(Txn(Error(Mvcc(Error(Pessimis
ticLockNotFound { start_ts: TimeStamp(456533202726813701), key: [116, 128, 0, 0, 0, 0, 0, 4, 0, 95, 105, 128, 0, 0, 0, 0, 0, 0, 1, 1, 50, 48, 50, 53, 48, 51, 48, 57, 255, 50, 51, 45, 119, 111, 119, 45, 48, 255, 55, 45, 51, 48, 82, 87, 71, 120, 255, 51, 81, 51, 52, 75, 84, 67, 99, 255, 0, 0, 0, 0, 0, 0, 0, 0, 247, 3, 128, 0, 0, 0, 0, 0, 0, 42], reason: LockTsMismatch })))))\ngithub.com/tikv/client-go/v2/error.ExtractKeyErr\n\t/root/go/pkg/mod/github.com/tikv/client-go/v2@v2.0.8-0.20240417121055-7b4535c36317/error/error.go:310\ngithub.com/tikv/client-go/v2/txnkv/txnlock.ExtractLockFromKeyErr\n\t/root/go/pkg/mod/github.com/tikv/client-go/v2@v2.0.8-0.20240417121055-7b4535c36317/txnkv/txnlock/lock.go:27\ngithub.com/tikv/client-go/v2/txnkv/transaction.actionPrewrite.handleSingleBatch\n\t/root/go/pkg/mod/github.com/tikv/client-go/v2@v2.0.8-0.20240417121055-7b4535c36317/txnkv/transaction/prewrite.go:440\ngithub.com/tikv/client-go/v2/txnkv/transaction.(*batchExecutor).startWorker.func1\n\t/root/go/pkg/mod/github.com/tikv/client-go/v2@v2.0.8-0.20240417121055-7b4535c36317/txnkv/transaction/2pc.go:1980\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:1598\ngithub.com/pingcap/errors.AddStack\n\t/root/go/pkg/mod/github.com/pingcap/errors@v0.11.5-0.20231212100244-799fae176cfb/errors.go:174\ngithub.com/pingcap/errors.Trace\n\t/root/go/pkg/mod/github.com/pingcap/errors@v0.11.5-0.20231212100244-799fae176cfb/juju_adaptor.go:15\ngithub.com/pingcap/tidb/store/driver/error.ToTiDBErr\n\t/workspace/source/tidb/store/driver/error/error.go:187\ngithub.com/pingcap/tidb/store/driver/txn.extractKeyErr\n\t/workspace/source/tidb/store/driver/txn/error.go:162\ngithub.com/pingcap/tidb/store/driver/txn.(*tikvTxn).extractKeyErr\n\t/workspace/source/tidb/store/driver/txn/txn_driver.go:316\ngithub.com/pingcap/tidb/store/driver/txn.(*tikvTxn).Commit\n\t/workspace/source/tidb/store/driver/txn/txn_driver.go:100\ngithub.com/pingcap/tidb/session.(*LazyTxn).Commit\n\t/workspace/source/tidb/session/txn.go:429\ngithub.com/pingcap/tidb/session.(*session).commitTxnWithTemporaryData\n\t/workspace/source/tidb/session/session.go:809\ngithub.com/pingcap/tidb/session.(*session).doCommit\n\t/workspace/source/tidb/session/session.go:690\ngithub.com/pingcap/tidb/session.(*session).doCommitWithRetry\n\t/workspace/source/tidb/session/session.go:943\ngithub.com/pingcap/tidb/session.(*session).CommitTxn\n\t/workspace/source/tidb/session/session.go:1070\ngithub.com/pingcap/tidb/session.autoCommitAfterStmt\n\t/workspace/source/tidb/session/tidb.go:297\ngithub.com/pingcap/tidb/session.finishStmt\n\t/workspace/source/tidb/session/tidb.go:259\ngithub.com/pingcap/tidb/session.runStmt\n\t/workspace/source/tidb/session/session.go:2441\ngithub.com/pingcap/tidb/session.(*session).ExecuteStmt\n\t/workspace/source/tidb/session/session.go:2271\ngithub.com/pingcap/tidb/server.(*TiDBContext).ExecuteStmt\n\t/workspace/source/tidb/server/driver_tidb.go:294\ngithub.com/pingcap/tidb/server.(*clientConn).handleStmt\n\t/workspace/source/tidb/server/conn.go:2133\ngithub.com/pingcap/tidb/server.(*clientConn).handleQuery\n\t/workspace/source/tidb/server/conn.go:1901\ngithub.com/pingcap/tidb/server.(*clientConn).dispatch\n\t/workspace/source/tidb/server/conn.go:1388\ngithub.com/pingcap/tidb/server.(*clientConn).Run\n\t/workspace/source/tidb/server/conn.go:1169\ngithub.com/pingcap/tidb/server.(*Server).onConn\n\t/workspace/source/tidb/server/server.go:718\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:1598

【资源配置】进入到 TiDB Dashboard -集群信息 (Cluster Info) -主机(Hosts) 截图此页面
【复制黏贴 ERROR 报错的日志】
【其他附件:截图/日志/监控】

PessimisticLockNotFound
{ start_ts: TimeStamp(456533202726813701),
key: [116, 128, 0, 0, 0, 0, 0, 4, 0, 95, 105, 128, 0, 0, 0, 0, 0, 0, 1, 1, 50, 48, 50, 53, 48, 51, 48, 57, 255, 50, 51, 45, 119, 111, 119, 45, 48, 255, 55, 45, 51, 48, 82, 87, 71, 120, 255, 51, 81, 51, 52, 75, 84, 67, 99, 255, 0, 0, 0, 0, 0, 0, 0, 0, 247, 3, 128, 0, 0, 0, 0, 0, 0, 42]

根据业务场景来排查吧,这明显就是锁要释放的时候,发现找不到key了…

尝试排查:在tikv日志中看到关于key的报错,确认是某个表的索引,将索引删除重建后依然还会会报错
image

  1. 集群怎么迁移的 扩缩容么?现在整个 tidb 集群都在 B 机房么?
  2. 业务服务器后续会迁移到 B 机房么?
  1. 迁移是通过br备份恢复存量+ticdc增量同步的
  2. 业务后续应该也会迁移到B机房,但是实际两个机房之间的延迟<1ms的。

那感觉可能和机房迁移无关 :thinking:

试试这个文档内容对你是否有帮助:https://docs.pingcap.com/zh/tidb/stable/troubleshoot-lock-conflicts/

  • 7.2.3 TxnLockNotFound 事务提交太慢,过了 TTL (Time To Live) 时间之后被其他事务回滚了,该事务会自动重试,通常情况下对业务无感知。对于 0.25 MB 以内的小事务,TTL 默认时间为 3 秒。详情参见锁被清除 (LockNotFound) 错误
  • 7.2.4 PessimisticLockNotFound 类似 TxnLockNotFound,悲观事务提交太慢被其他事务回滚了。
    文档上是这么描述这个信息的,网络确定没有延迟吗?grafana上的网络延迟指标都仔细排查一下吧,机房迁移后出现这类错误,感觉网络延迟方面的概率最大啊