tispark 导入数据时,导入一段时间tispark会报错。

tidb是5.0.3.在tispark 往tikv导入数据一段时间,大概到了2个小时左右时,导入失败报错如下:
tispark报错日志:

com.pingcap.tikv.exception.TiBatchWriteException: Execution exception met.
at com.pingcap.tikv.TwoPhaseCommitter.doPrewriteSecondaryKeys(TwoPhaseCommitter.java:304)
at com.pingcap.tikv.TwoPhaseCommitter.prewriteSecondaryKeys(TwoPhaseCommitter.java:257)
at com.pingcap.tispark.write.TiBatchWrite$$anonfun$doWrite$7.apply(TiBatchWrite.scala:305)
at com.pingcap.tispark.write.TiBatchWrite$$anonfun$doWrite$7.apply(TiBatchWrite.scala:288)
at org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1$$anonfun$apply$28.apply(RDD.scala:935)
at org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1$$anonfun$apply$28.apply(RDD.scala:935)
at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2101)
at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2101)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
at org.apache.spark.scheduler.Task.run(Task.scala:121)
at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1405)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.util.concurrent.ExecutionException: com.pingcap.tikv.exception.TiBatchWriteException: > max retry number 64, oldRegion={Region[501763] ConfVer[1691] Version[29004] Store[9] KeyRange[t\200\000\000\000\000\000\002\022_i\200\000\000\000\000\000\000\001\003\200\000\000\000{\371J(\003\224\000\000\000\001\001\325\n\000]:[t\200\000\000\000\000\000\002\022_i\200\000\000\000\000\000\000\001\003\200\000\000\000|1u\024\003\330\000\000\000\005\006\364\030\000]}, currentRegion={Region[510827] ConfVer[1691] Version[29007] Store[9] KeyRange[t\200\000\000\000\000\000\002\022_i\200\000\000\000\000\000\000\001\003\200\000\000\000{\247|\021\003\254\000\000\000\002\361{2\000]:[t\200\000\000\000\000\000\002\022_i\200\000\000\000\000\000\000\001\003\200\000\000\000|\002m+\003\354\000\000\000\f\314\200\351\000]}
at java.util.concurrent.FutureTask.report(FutureTask.java:122)
at java.util.concurrent.FutureTask.get(FutureTask.java:192)
at com.pingcap.tikv.TwoPhaseCommitter.doPrewriteSecondaryKeys(TwoPhaseCommitter.java:283)
… 15 more
Caused by: com.pingcap.tikv.exception.TiBatchWriteException: > max retry number 64, oldRegion={Region[501763] ConfVer[1691] Version[29004] Store[9] KeyRange[t\200\000\000\000\000\000\002\022_i\200\000\000\000\000\000\000\001\003\200\000\000\000{\371J(\003\224\000\000\000\001\001\325\n\000]:[t\200\000\000\000\000\000\002\022_i\200\000\000\000\000\000\000\001\003\200\000\000\000|1u\024\003\330\000\000\000\005\006\364\030\000]}, currentRegion={Region[510827] ConfVer[1691] Version[29007] Store[9] KeyRange[t\200\000\000\000\000\000\002\022_i\200\000\000\000\000\000\000\001\003\200\000\000\000{\247|\021\003\254\000\000\000\002\361{2\000]:[t\200\000\000\000\000\000\002\022_i\200\000\000\000\000\000\000\001\003\200\000\000\000|\002m+\003\354\000\000\000\f\314\200\351\000]}
at com.pingcap.tikv.TwoPhaseCommitter.doPrewriteSecondaryKeysInBatchesWithRetry(TwoPhaseCommitter.java:358)
at com.pingcap.tikv.TwoPhaseCommitter.retryPrewriteBatch(TwoPhaseCommitter.java:387)
at com.pingcap.tikv.TwoPhaseCommitter.doPrewriteSecondaryKeysInBatchesWithRetry(TwoPhaseCommitter.java:366)
at com.pingcap.tikv.TwoPhaseCommitter.retryPrewriteBatch(TwoPhaseCommitter.java:387)
at com.pingcap.tikv.TwoPhaseCommitter.doPrewriteSecondaryKeysInBatchesWithRetry(TwoPhaseCommitter.java:366)
at com.pingcap.tikv.TwoPhaseCommitter.retryPrewriteBatch(TwoPhaseCommitter.java:387)
at com.pingcap.tikv.TwoPhaseCommitter.doPrewriteSecondaryKeysInBatchesWithRetry(TwoPhaseCommitter.java:366)
at com.pingcap.tikv.TwoPhaseCommitter.retryPrewriteBatch(TwoPhaseCommitter.java:387)
at com.pingcap.tikv.TwoPhaseCommitter.doPrewriteSecondaryKeysInBatchesWithRetry(TwoPhaseCommitter.java:366)
at com.pingcap.tikv.TwoPhaseCommitter.retryPrewriteBatch(TwoPhaseCommitter.java:387)
at com.pingcap.tikv.TwoPhaseCommitter.doPrewriteSecondaryKeysInBatchesWithRetry(TwoPhaseCommitter.java:366)
at com.pingcap.tikv.TwoPhaseCommitter.retryPrewriteBatch(TwoPhaseCommitter.java:387)
at com.pingcap.tikv.TwoPhaseCommitter.doPrewriteSecondaryKeysInBatchesWithRetry(TwoPhaseCommitter.java:366)
at com.pingcap.tikv.TwoPhaseCommitter.retryPrewriteBatch(TwoPhaseCommitter.java:387)
at com.pingcap.tikv.TwoPhaseCommitter.doPrewriteSecondaryKeysInBatchesWithRetry(TwoPhaseCommitter.java:366)
at com.pingcap.tikv.TwoPhaseCommitter.retryPrewriteBatch(TwoPhaseCommitter.java:387)

tispark和spark的版本分别是多少?另外,失败时间段,tikv和tidb上有没有error级别日志?

+1, 用命令 select ti_version() 查下 tispark 的版本信息,便于定位问题

tispark 2.41 和 spark 2.40。没有用到tidb。直接tispark的jar包和tikv进行通信。

tikv的节点有点多,我再看看有没有报错。

tikv 那个时间段段段的报错日志报错如下:
[2022/06/23 15:44:39.568 +08:00] [WARN] [endpoint.rs:639] [error-response] [err=“Key is locked (will clean up) primary_lock: 7480000000000002125F72820000001AC399FF lock_version: 434102246260080641 key: 7480000000000002125F698000000000000001038000000135D43A1803FE0000001BB4D1BD lock_ttl: 20000 txn_size: 16 min_commit_ts: 434102246260080641”]

只要跑一段时间,跑大概2个小时,就会一直报错了。

按照我现在的理解,回复一下:
spark写入过程:锁表–》预写pk–》预写sk–》提交pk–》释放表锁–》提交sk
根据你前后提交的日志来看,错误发生在预写sk阶段,并且存在重试,按照道理说是有个sql锁了要写的数据,然后乐观锁重试次数超过限制之后失败,至于这个sql是一个查询sql还是因为region分裂造成的锁,目前判断不了,我们找专家看看再给你答复,另外,你看有什么怀疑的慢sql或者其他日志可以放的,也可以继续提交

从13点50多应用开始写入报错;然后就一直报错了。

[2022/06/23 12:33:19.885 +08:00] [WARN] [peer.rs:3154] [“failed to propose confchange”] [error=“Other(”[components/raftstore/src/store/peer.rs:2442]: [region 446061] 446296 unsafe to perform conf change [peer { id: 501424 store_id: 5 }, peer { id: 501423 store_id: 8 }, change_type: AddLearnerNode peer { id: 465879 store_id: 6 role: Learner }, change_type: AddLearnerNode peer { id: 457835 store_id: 9 role: Learner }], before: voters: 457835 voters: 446296 voters: 465879 learners: 501424 learners: 501423, after: voters: 446296 voters: 501424 voters: 501423 voters_outgoing: 446296 voters_outgoing: 465879 voters_outgoing: 457835 learners_next: 457835 learners_next: 465879, truncated index 11569, promoted commit index 0")"]
[2022/06/23 12:35:56.923 +08:00] [WARN] [peer.rs:2963] [“skip proposal”] [error_code=KV:Raftstore:Unknown] [err=“Other(”[components/raftstore/src/store/peer.rs:2855]: [region 476655] 476656 log gap contains conf change, skip merging.")"] [peer_id=476656] [region_id=476655]
[2022/06/23 13:50:17.105 +08:00] [WARN] [peer.rs:3154] [“failed to propose confchange”] [error=“Other(”[components/raftstore/src/store/peer.rs:2442]: [region 503053] 503055 unsafe to perform conf change [peer { id: 511610 store_id: 8 }, change_type: AddLearnerNode peer { id: 503054 store_id: 9 role: Learner }], before: voters: 503056 voters: 503054 voters: 503055 learners: 511610, after: voters: 503056 voters: 511610 voters: 503055 voters_outgoing: 503056 voters_outgoing: 503054 voters_outgoing: 503055 learners_next: 503054, truncated index 353, promoted commit index 0")"]
[2022/06/23 13:50:17.109 +08:00] [WARN] [peer.rs:2963] [“skip proposal”] [error_code=KV:Raftstore:Unknown] [err=“Other(”[components/raftstore/src/store/peer.rs:2855]: [region 503053] 503055 log gap contains conf change, skip merging.")"] [peer_id=503055] [region_id=503053]
[2022/06/23 13:51:56.117 +08:00] [WARN] [peer.rs:3154] [“failed to propose confchange”] [error=“Other(”[components/raftstore/src/store/peer.rs:2442]: [region 506587] 511380 unsafe to perform conf change [peer { id: 511656 store_id: 6 }, change_type: AddLearnerNode peer { id: 506588 store_id: 4 role: Learner }], before: voters: 510503 voters: 506588 voters: 511380 learners: 511656, after: voters: 511656 voters: 510503 voters: 511380 voters_outgoing: 506588 voters_outgoing: 511380 voters_outgoing: 510503 learners_next: 506588, truncated index 2112, promoted commit index 0")"]
[2022/06/23 13:51:56.117 +08:00] [WARN] [peer.rs:3154] [“failed to propose confchange”] [error=“Other(”[components/raftstore/src/store/peer.rs:2442]: [region 506587] 511380 unsafe to perform conf change [peer { id: 511656 store_id: 6 }, change_type: AddLearnerNode peer { id: 506588 store_id: 4 role: Learner }], before: voters: 510503 voters: 506588 voters: 511380 learners: 511656, after: voters: 511656 voters: 510503 voters: 511380 voters_outgoing: 506588 voters_outgoing: 511380 voters_outgoing: 510503 learners_next: 506588, truncated index 2112, promoted commit index 0")"]
[2022/06/23 13:58:52.921 +08:00] [WARN] [endpoint.rs:639] [error-response] [err=“Key is locked (will clean up) primary_lock: 7480000000000002125F72820000000D61D20B lock_version: 434100894789795841 key: 7480000000000002125F69800000000000000103800000000BD3934C03C40000000D82DD2C lock_ttl: 20000 txn_size: 16 min_commit_ts: 434100894789795841”]
[2022/06/23 13:58:52.921 +08:00] [WARN] [endpoint.rs:639] [error-response] [err=“Key is locked (will clean up) primary_lock: 7480000000000002125F72820000000D61D20B lock_version: 434100894789795841 key: 7480000000000002125F698000000000000001038000000011F64EDE03920000000E1FB493 lock_ttl: 20000 txn_size: 16 min_commit_ts: 434100894789795841”]

这个异常只有重试的信息,没有为什么重试的信息么,一般重试是因为一些错误,有这些错误信息么

只有这些信息了。没有看到有错误信息

  1. tikv 的监控有查看吗? 有没有资源达到瓶颈?
  2. 可以试试分批小批量导入能够成功吗?

我是指 TiSpark 的报错信息,没有截全。下面应该还有的