tispark + spark struct streaming Error reading region

tidb版本: 5.1
tispark: 2.4.1
spark: 3.0

在 tidb 集群升级到 5.0 之后频繁报 Error reading region 的错误

Lost task 251.0 in stage 278.0 (TID 57473, executor 1):
com.pingcap.tikv.exception.TiClientInternalException: Error reading region:
at com.pingcap.tikv.operation.iterator.DAGIterator.doReadNextRegionChunks(DAGIterator.java:189)
at com.pingcap.tikv.operation.iterator.DAGIterator.readNextRegionChunks(DAGIterator.java:166)
at com.pingcap.tikv.operation.iterator.DAGIterator.hasNext(DAGIterator.java:112)
at org.apache.spark.sql.tispark.TiRowRDD$$anon$1.hasNext(TiRowRDD.scala:69)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage9.coprocessorrdd_nextBatch_0$(Unknown Source)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage9.processNext(Unknown Source)
at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$11$$anon$1.hasNext(WholeStageCodegenExec.scala:624)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
at org.apache.spark.shuffle.sort.UnsafeShuffleWriter.write(UnsafeShuffleWriter.java:187)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:55)
at org.apache.spark.scheduler.Task.run(Task.scala:121)
at org.apache.spark.executor.Executor$TaskRunner$$anonfun$11.apply(Executor.scala:407)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1408)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:413)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.util.concurrent.ExecutionException: com.pingcap.tikv.exception.RegionTaskException: Handle region task failed:
at java.util.concurrent.FutureTask.report(FutureTask.java:122)
at java.util.concurrent.FutureTask.get(FutureTask.java:192)
at com.pingcap.tikv.operation.iterator.DAGIterator.doReadNextRegionChunks(DAGIterator.java:184)
… 18 more
Caused by: com.pingcap.tikv.exception.RegionTaskException: Handle region task failed:
at com.pingcap.tikv.operation.iterator.DAGIterator.process(DAGIterator.java:232)
at com.pingcap.tikv.operation.iterator.DAGIterator.lambda$submitTasks$1(DAGIterator.java:90)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
… 3 more
Caused by: com.pingcap.tikv.exception.GrpcException: retry is exhausted.
at com.pingcap.tikv.util.ConcreteBackOffer.doBackOffWithMaxSleep(ConcreteBackOffer.java:148)
at com.pingcap.tikv.util.ConcreteBackOffer.doBackOff(ConcreteBackOffer.java:119)
at com.pingcap.tikv.util.RangeSplitter.splitRangeByRegion(RangeSplitter.java:191)
at com.pingcap.tikv.region.RegionStoreClient.handleCopResponse(RegionStoreClient.java:692)
at com.pingcap.tikv.region.RegionStoreClient.coprocess(RegionStoreClient.java:660)
at com.pingcap.tikv.operation.iterator.DAGIterator.process(DAGIterator.java:219)
… 7 more
Caused by: com.pingcap.tikv.exception.TiClientInternalException: Region not exist for key:*
at com.pingcap.tikv.region.RegionManager.getRegionStorePairByKey(RegionManager.java:104)
at com.pingcap.tikv.util.RangeSplitter.splitRangeByRegion(RangeSplitter.java:183)
… 10 more

Process region tasks failed, remain 0 tasks not executed due to
com.pingcap.tikv.exception.TiClientInternalException: Cannot find valid store on TiKV for region {Region[208713548] ConfVer[11789] Version[27402] Store[49139] KeyRange[t\200\000\000\000\000\001\322\306_r\200\000\000\001\377\234\227\025]:[t\200\000\000\000\000\001\322\306_r\200\000\000\001\377\235\301\237]}
at com.pingcap.tikv.region.RegionManager.getRegionStorePairByKey(RegionManager.java:134)
at com.pingcap.tikv.region.RegionManager.getRegionStorePairByKey(RegionManager.java:96)
at com.pingcap.tikv.region.RegionStoreClient.(RegionStoreClient.java:148)
at com.pingcap.tikv.region.RegionStoreClient.(RegionStoreClient.java:102)
at com.pingcap.tikv.region.RegionStoreClient$RegionStoreClientBuilder.build(RegionStoreClient.java:899)
at com.pingcap.tikv.operation.iterator.DAGIterator.process(DAGIterator.java:216)
at com.pingcap.tikv.operation.iterator.DAGIterator.lambda$submitTasks$1(DAGIterator.java:90)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)

Exception in task 346.0 in stage 4247.0 (TID 883372)
com.pingcap.tikv.exception.TiClientInternalException: Error reading region:
at com.pingcap.tikv.operation.iterator.DAGIterator.doReadNextRegionChunks(DAGIterator.java:189)
at com.pingcap.tikv.operation.iterator.DAGIterator.readNextRegionChunks(DAGIterator.java:166)
at com.pingcap.tikv.operation.iterator.DAGIterator.hasNext(DAGIterator.java:112)
at org.apache.spark.sql.tispark.TiRowRDD$$anon$1.hasNext(TiRowRDD.scala:69)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage0.coprocessorrdd_nextBatch_0$(Unknown Source)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage0.processNext(Unknown Source)
at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$11$$anon$1.hasNext(WholeStageCodegenExec.scala:624)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage9.processNext(Unknown Source)
at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$11$$anon$1.hasNext(WholeStageCodegenExec.scala:624)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
at org.apache.spark.shuffle.sort.UnsafeShuffleWriter.write(UnsafeShuffleWriter.java:187)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:55)
at org.apache.spark.scheduler.Task.run(Task.scala:121)
at org.apache.spark.executor.Executor$TaskRunner$$anonfun$11.apply(Executor.scala:407)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1408)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:413)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.util.concurrent.ExecutionException: com.pingcap.tikv.exception.RegionTaskException: Handle region task failed:
at java.util.concurrent.FutureTask.report(FutureTask.java:122)
at java.util.concurrent.FutureTask.get(FutureTask.java:192)
at com.pingcap.tikv.operation.iterator.DAGIterator.doReadNextRegionChunks(DAGIterator.java:184)
… 21 more
Caused by: com.pingcap.tikv.exception.RegionTaskException: Handle region task failed:
at com.pingcap.tikv.operation.iterator.DAGIterator.process(DAGIterator.java:232)
at com.pingcap.tikv.operation.iterator.DAGIterator.lambda$submitTasks$1(DAGIterator.java:90)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
… 3 more
Caused by: com.pingcap.tikv.exception.SelectException: [FLASH:Coprocessor:BadRequest] Income key ranges is illegal for region: 208438970
at com.pingcap.tikv.region.RegionStoreClient.doCoprocessor(RegionStoreClient.java:749)
at com.pingcap.tikv.region.RegionStoreClient.handleCopResponse(RegionStoreClient.java:720)
at com.pingcap.tikv.region.RegionStoreClient.coprocess(RegionStoreClient.java:664)
at com.pingcap.tikv.operation.iterator.DAGIterator.process(DAGIterator.java:219)

主要报这三种异常, 导致每过一两天流就会挂, 帮忙看下是什么原因吧

如果 spark 是 3.0 版本,那么建议使用 2.5.0 版本的 tispark:

https://github.com/pingcap/tispark

建议使用匹配的版本后,再观察看下~

这个是需要自己在master build 吗? 我看最近release 是 2.4.1

TiSpark 2.5.0 尚未正式 release,具体的情况可以关注下 github ~

在 TiSpark 2.5.0 未 release 前,当前的情况,可能需要将 Spark 版本降级到 2.3.0+ 或 2.4.0+:

image