spark中拉取tidb数据时OOM

为提高效率,请提供以下信息,问题描述清晰能够更快得到解决:
【 TiDB 使用环境】
tidb version v5.2.1

【概述】 场景 + 问题概述
要将数据拉去到spark中做复杂处理时, tidb出错

【背景】 做过哪些操作

【现象】 业务和数据库现象

【问题】 当前遇到的问题
不能拉起非常大的数据到spark中处理
【业务影响】
业务不能继续下去
【 TiDB 版本】
v5.2.1

我当然可以加一些限制条件,不要拉去那么多数据到spark。 但是我的问题是, 如果我获取过多数据到spark,也是应该spark处理时候OOM。 为什么在获取tidb数据时候, tikv会OOM呢?

【附件】 相关日志及监控(https://metricstool.pingcap.com/)

21/10/14 11:22:08 WARN TaskSetManager: Lost task 5.0 in stage 0.0 (TID 4, 202.38.228.229, executor 0): com.pingcap.tikv.exception.TiClientInternalException: Error reading region:
	at com.pingcap.tikv.operation.iterator.DAGIterator.doReadNextRegionChunks(DAGIterator.java:189)
	at com.pingcap.tikv.operation.iterator.DAGIterator.readNextRegionChunks(DAGIterator.java:166)
	at com.pingcap.tikv.operation.iterator.DAGIterator.hasNext(DAGIterator.java:112)
	at org.apache.spark.sql.tispark.TiRowRDD$$anon$1.hasNext(TiRowRDD.scala:69)
	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage0.coprocessorrdd_nextBatch_0$(Unknown Source)
	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage0.processNext(Unknown Source)
	at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
	at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$13$$anon$1.hasNext(WholeStageCodegenExec.scala:636)
	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.agg_doAggregateWithKeys_0$(Unknown Source)
	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)
	at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
	at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$13$$anon$1.hasNext(WholeStageCodegenExec.scala:636)
	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
	at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:125)
	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:55)
	at org.apache.spark.scheduler.Task.run(Task.scala:121)
	at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)
Caused by: java.util.concurrent.ExecutionException: com.pingcap.tikv.exception.RegionTaskException: Handle region task failed:
	at java.util.concurrent.FutureTask.report(FutureTask.java:122)
	at java.util.concurrent.FutureTask.get(FutureTask.java:192)
	at com.pingcap.tikv.operation.iterator.DAGIterator.doReadNextRegionChunks(DAGIterator.java:184)
	... 22 more
Caused by: com.pingcap.tikv.exception.RegionTaskException: Handle region task failed:
	at com.pingcap.tikv.operation.iterator.DAGIterator.process(DAGIterator.java:232)
	at com.pingcap.tikv.operation.iterator.DAGIterator.lambda$submitTasks$1(DAGIterator.java:90)
	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
	... 3 more
Caused by: com.pingcap.tikv.exception.GrpcException: shade.io.grpc.StatusRuntimeException: CANCELLED: Failed to read message.
	at com.pingcap.tikv.policy.RetryPolicy.rethrowNotRecoverableException(RetryPolicy.java:45)
	at com.pingcap.tikv.policy.RetryPolicy.callWithRetry(RetryPolicy.java:55)
	at com.pingcap.tikv.AbstractGRPCClient.callWithRetry(AbstractGRPCClient.java:77)
	at com.pingcap.tikv.region.RegionStoreClient.coprocess(RegionStoreClient.java:663)
	at com.pingcap.tikv.operation.iterator.DAGIterator.process(DAGIterator.java:219)
	... 7 more
Caused by: shade.io.grpc.StatusRuntimeException: CANCELLED: Failed to read message.
	at shade.io.grpc.stub.ClientCalls.toStatusRuntimeException(ClientCalls.java:244)
	at shade.io.grpc.stub.ClientCalls.getUnchecked(ClientCalls.java:225)
	at shade.io.grpc.stub.ClientCalls.blockingUnaryCall(ClientCalls.java:142)
	at com.pingcap.tikv.AbstractGRPCClient.lambda$callWithRetry$0(AbstractGRPCClient.java:80)
	at com.pingcap.tikv.policy.RetryPolicy.callWithRetry(RetryPolicy.java:53)
	... 10 more
Caused by: java.lang.OutOfMemoryError: Java heap space


若提问为性能优化、故障排查类问题,请下载脚本运行。终端输出的打印结果,请务必全选并复制粘贴上传。

可能是这里描述的 tikv gRPC 发送速度跟不上 coprocessor 往外吐数据的速度导致
https://book.tidb.io/session4/chapter7/tidb-oom.html