tispark读取tikv报错:Request range exceeds bound

【现象】 业务和数据库现象:报错,程序退出

【 TiDB 版本】 tidb v4.0.15 , tispark v2.5.0 , spark3 on k8s

执行:select * from TABLE where create_datetime >= timestamp ‘2020-01-01 00:00:00.000’ and create_datetime < timestamp ‘2021-01-01 00:00:00.000’

create_datetime 字段是有索引的

报错:
com.pingcap.tikv.exception.TiClientInternalException: Error reading region:
at com.pingcap.tikv.operation.iterator.DAGIterator.doReadNextRegionChunks(DAGIterator.java:189)
at com.pingcap.tikv.operation.iterator.DAGIterator.readNextRegionChunks(DAGIterator.java:166)
at com.pingcap.tikv.operation.iterator.DAGIterator.hasNext(DAGIterator.java:112)
at org.apache.spark.sql.execution.ColumnarRegionTaskExec$$anon$2.proceedNextBatchTask$1(CoprocessorRDD.scala:356)
at org.apache.spark.sql.execution.ColumnarRegionTaskExec$$anon$2.hasNext(CoprocessorRDD.scala:366)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:488)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.columnartorow_nextBatch_0$(Unknown Source)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)
at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:729)
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458)
at org.apache.spark.shuffle.sort.UnsafeShuffleWriter.write(UnsafeShuffleWriter.java:177)
at org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:52)
at org.apache.spark.scheduler.Task.run(Task.scala:127)
at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:446)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1377)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:449)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.util.concurrent.ExecutionException: com.pingcap.tikv.exception.RegionTaskException: Handle region task failed:
at java.util.concurrent.FutureTask.report(FutureTask.java:122)
at java.util.concurrent.FutureTask.get(FutureTask.java:192)
at com.pingcap.tikv.operation.iterator.DAGIterator.doReadNextRegionChunks(DAGIterator.java:184)
… 21 more
Caused by: com.pingcap.tikv.exception.RegionTaskException: Handle region task failed:
at com.pingcap.tikv.operation.iterator.DAGIterator.process(DAGIterator.java:233)
at com.pingcap.tikv.operation.iterator.DAGIterator.lambda$submitTasks$1(DAGIterator.java:90)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
… 3 more
Caused by: com.pingcap.tikv.exception.GrpcException: Request range exceeds bound, request range:[7480000000000003FF605F72800000001BFF90623F0000000000FA, end:7480000000000003FF605F72800000001BFF9064B80000000000FA), physical bound:[7480000000000003FF605F72800000001BFF90D7790000000000FA, 7480000000000003FF605F72800000001BFF91FD580000000000FA)
at com.pingcap.tikv.region.RegionStoreClient.handleCopResponse(RegionStoreClient.java:703)
at com.pingcap.tikv.region.RegionStoreClient.coprocess(RegionStoreClient.java:650)
at com.pingcap.tikv.operation.iterator.DAGIterator.process(DAGIterator.java:219)
… 7 more

帮忙确认两个:

  1. 其他查询 SQL 是否正常?
  2. 输出一下这个 SQL 执行计划。
  1. 其他查询正常

  2. image

最开始是用自己编译的tispark jar包,现在用官方jar包,版本都是2.5测试了一下报如下的错:
Caused by: com.pingcap.tikv.exception.GrpcException: message: “region 42197 is missing”
region_not_found {
region_id: 42197
}

通过 pdctl 查询一下 region id 42197 的 region group 状态呢。我看版本都是匹配的,应该不是兼容性问题。https://github.com/pingcap/tispark#getting-tispark

这个查询的应该是 TiDB 的系统表,能不能在 TiDB 查询一下是否正常 ?另外其他 SQL 是指得业务 SQL 还是系统其他 SQL 比如 information schema 下面的系统表 ?

不是不是,这个TABLE是我随便写的,代指被查询的表,不是系统表。:joy:

用其他sql工具,例如mysql client或者dbeaver查询同样的语句正常么?这个错误可以复现么?
然后,
select * from information_schema.tikv_region_peers where REGION_ID = ‘42197’;

select * from information_schema.tikv_region_status where REGION_ID = ‘42197’;
这两个查询啥结果?

没有复现。就只tispark查询会报错。
运行这两个查询,结果都是空

猜测是tispark从pd读取了region的分布之后,regoin有了变化,我这两天顺顺代码,在给个确定的答案。
另外这种情况下,整个spark任务失败了么?按照道理说,task会重试的,不影响最后的结果。

是会重试,结果:com.pingcap.tikv.exception.GrpcException: retry is exhausted.

你好,请问有进展了吗?