通过DM同步数据时报不明原因(与业务无关)的写-写冲突

为提高效率,请提供以下信息,问题描述清晰能够更快得到解决:

【TiDB 版本】
TiDB v4.0.10 + DM v2.0.1

【问题描述】
我们通过5个DM任务从MySQL向TiDB同步数据,这些任务均非合库合表任务,且同步的库表相互之间没有交集,safe-mode均未显式打开,syncer线程数16。除DM任务外,确定没有其他写入动作。

同步开始后,在监控的TiDB/KV Errors面板观察到持续的写-写冲突,如下图所示。

TiDB log中的信息:

[2021/02/06 18:23:12.073 +08:00] [INFO] [2pc.go:822] ["prewrite encounters lock"] [conn=0] [lock="key: {metaKey=true, key=DB:820, field=TID:862}, primary: {metaKey=true, key=DB:820, field=TID:862}, txnStartTS: 422735247324217368, lockForUpdateTS:0, ttl: 3001, type: Put"]
[2021/02/06 18:23:12.074 +08:00] [WARN] [txn.go:66] [RunInNewTxn] ["retry txn"=422735247324217370] ["original txn"=422735247324217370] [error="[kv:9007]Write conflict, txnStartTS=422735247324217370, conflictStartTS=422735247324217368, conflictCommitTS=422735247324217371, key={metaKey=true, key=DB:820, field=TID:862} primary={metaKey=true, key=DB:820, field=TID:862} [try again later]"]
[2021/02/06 18:23:12.076 +08:00] [INFO] [2pc.go:1336] ["2PC clean up done"] [txnStartTS=422735247324217370]
[2021/02/06 18:23:13.752 +08:00] [INFO] [2pc.go:822] ["prewrite encounters lock"] [conn=0] [lock="key: {metaKey=true, key=DB:820, field=TID:862}, primary: {metaKey=true, key=DB:820, field=TID:862}, txnStartTS: 422735247756755003, lockForUpdateTS:0, ttl: 3001, type: Put"]
[2021/02/06 18:23:13.754 +08:00] [WARN] [txn.go:66] [RunInNewTxn] ["retry txn"=422735247756755004] ["original txn"=422735247756755004] [error="[kv:9007]Write conflict, txnStartTS=422735247756755004, conflictStartTS=422735247756755003, conflictCommitTS=422735247756755010, key={metaKey=true, key=DB:820, field=TID:862} primary={metaKey=true, key=DB:820, field=TID:862} [try again later]"]
[2021/02/06 18:23:13.755 +08:00] [INFO] [2pc.go:1336] ["2PC clean up done"] [txnStartTS=422735247756755004]
[2021/02/06 18:23:13.797 +08:00] [INFO] [2pc.go:822] ["prewrite encounters lock"] [conn=0] [lock="key: {metaKey=true, key=DB:820, field=TID:862}, primary: {metaKey=true, key=DB:820, field=TID:862}, txnStartTS: 422735247769862173, lockForUpdateTS:0, ttl: 3001, type: Put"]
[2021/02/06 18:23:13.799 +08:00] [WARN] [txn.go:66] [RunInNewTxn] ["retry txn"=422735247769862172] ["original txn"=422735247769862172] [error="[kv:9007]Write conflict, txnStartTS=422735247769862172, conflictStartTS=422735247769862173, conflictCommitTS=422735247769862176, key={metaKey=true, key=DB:820, field=TID:862} primary={metaKey=true, key=DB:820, field=TID:862} [try again later]"]
[2021/02/06 18:23:13.800 +08:00] [INFO] [2pc.go:1336] ["2PC clean up done"] [txnStartTS=422735247769862172]
[2021/02/06 18:23:14.209 +08:00] [INFO] [2pc.go:822] ["prewrite encounters lock"] [conn=0] [lock="key: {metaKey=true, key=DB:820, field=TID:862}, primary: {metaKey=true, key=DB:820, field=TID:862}, txnStartTS: 422735247887826949, lockForUpdateTS:0, ttl: 3001, type: Put"]
[2021/02/06 18:23:14.211 +08:00] [WARN] [txn.go:66] [RunInNewTxn] ["retry txn"=422735247887826950] ["original txn"=422735247887826950] [error="[kv:9007]Write conflict, txnStartTS=422735247887826950, conflictStartTS=422735247887826949, conflictCommitTS=422735247887826952, key={metaKey=true, key=DB:820, field=TID:862} primary={metaKey=true, key=DB:820, field=TID:862} [try again later]"]
[2021/02/06 18:23:14.212 +08:00] [INFO] [2pc.go:1336] ["2PC clean up done"] [txnStartTS=422735247887826950]
[2021/02/06 18:23:14.765 +08:00] [INFO] [2pc.go:822] ["prewrite encounters lock"] [conn=0] [lock="key: {metaKey=true, key=DB:820, field=TID:862}, primary: {metaKey=true, key=DB:820, field=TID:862}, txnStartTS: 422735248032006156, lockForUpdateTS:0, ttl: 3001, type: Put"]
[2021/02/06 18:23:14.767 +08:00] [WARN] [txn.go:66] [RunInNewTxn] ["retry txn"=422735248032006158] ["original txn"=422735248032006158] [error="[kv:9007]Write conflict, txnStartTS=422735248032006158, conflictStartTS=422735248032006156, conflictCommitTS=422735248032006162, key={metaKey=true, key=DB:820, field=TID:862} primary={metaKey=true, key=DB:820, field=TID:862} [try again later]"]
[2021/02/06 18:23:14.769 +08:00] [INFO] [2pc.go:1336] ["2PC clean up done"] [txnStartTS=422735248032006158]

但是,检查所有TiKV logs,并未发现与写冲突相关的任何信息(基本都是与not leader相关的),按照常理也不应出现写冲突。另外,将5个DM任务中流量最大的那一个(30张表,同步速率约1K QPS)停掉,写冲突的报错就会消失。

Questions:

  1. 日志中报出的冲突信息相同,均为key={metaKey=true, key=DB:820, field=TID:862},表示什么key有冲突?(应该是与业务数据无关的)
  2. DM任务为什么用乐观事务?可以更改么?
  3. 导致此问题的可能原因是什么呢?

Many thanks~~~

按照官方的写写冲突的排查文档排查一下吧,可以解析到具体的冲突事务是哪些。

Hi,按照官方文档中乐观事务模型下写写冲突问题排查一节无法解析到具体的冲突事务是哪些,详情见上文问题1。

补充关于txnLock backoff的报警,在晚上业务高峰期非常频繁:

alertname = tidb_tikvclient_backoff_seconds_count
cluster = sht-tidb-cluster-pro
env = ENV_LABELS_ENV
expr = increase( tidb_tikvclient_backoff_seconds_count[10m] ) > 250
instance = 10.0.44.245:10080
job = tidb
level = warning
monitor = prometheus
type = txnLock
Annotations
description = cluster: ENV_LABELS_ENV, instance: 10.0.44.245:10080, values:460.5128205128205
summary = TiDB tikvclient_backoff_count error
value = 460.5128205128205

alertname = tidb_tikvclient_backoff_seconds_count
cluster = sht-tidb-cluster-pro
env = ENV_LABELS_ENV
expr = increase( tidb_tikvclient_backoff_seconds_count[10m] ) > 250
instance = 10.0.44.247:10080
job = tidb
level = warning
monitor = prometheus
type = txnLock
Annotations
description = cluster: ENV_LABELS_ENV, instance: 10.0.44.247:10080, values:518.9743589743589
summary = TiDB tikvclient_backoff_count error
value = 518.9743589743589

我也遇到过,通过增加TIDB-SERVER解决问题,目测TIDB-SERVER 成为集群的写入的瓶颈

多谢帮助,我们是3个TiDB Server(都是32核128G),前面挂了两个阿里云SLB,按道理几千的QPS应该不会成为瓶颈吧(笑哭

@Culbr hello~ 乐观事务出现写写冲突比较常见,DM 同步过程中是采用 batch 并发逻辑来消费的。还是确认一下上有同步任务是不是存在写写冲突的同步任务。可以看一下这篇帖子,再排查一下冲突的具体的表和数据。
dm 增量 rds 数据报错:write conflict

我是40核128G的,,都有这个问题,反正我是那么干 就好了:smiley:

通过挖源码得出是业务侧向有自增ID主键的表插入数据时显式指定了ID导致的,目前去掉了该表主键列的AUTO_INCREMENT属性,一切正常了,多谢帮助~

问题记录:https://www.jianshu.com/p/96a9ce71b0c3

找到根本原因了,是由于业务向有自增ID主键的表插入数据时显式指定了ID =,=

1 个赞

:+1:

此话题已在最后回复的 1 分钟后被自动关闭。不再允许新回复。