调用 TiSpark 任务时,遇到 TiKV Server Timeout 报错

为提高效率,提问时请提供以下信息,问题描述清晰可优先响应。

  • 【TiDB 版本】:
    v3.0.8
  • 【问题描述】:

    调用TiSpark进行任务时,出现故障TiKV Server timeout 警告 。 然后监控上看到 线条中断 这种 时硬件、 网络原因么 会导致计算数据丢失或者数据拉取不全么?
    20/01/19 03:25:57 WARN TaskSetManager: Lost task 42.0 in stage 1.0 (TID 53, 192.168.100.23, executor 3): com.pingcap.tikv.exception.TiClientInternalException: Error reading region:
    at com.pingcap.tikv.operation.iterator.DAGIterator.doReadNextRegionChunks(DAGIterator.java:163)
    at com.pingcap.tikv.operation.iterator.DAGIterator.readNextRegionChunks(DAGIterator.java:140)
    at com.pingcap.tikv.operation.iterator.DAGIterator.hasNext(DAGIterator.java:89)
    at org.apache.spark.sql.tispark.TiRDD$$anon$2.hasNext(TiRDD.scala:86)
    at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
    at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
    at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)
    at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
    at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$13$$anon$1.hasNext(WholeStageCodegenExec.scala:636)
    at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
    at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
    at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.savePartition(JdbcUtils.scala:653)
    at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$saveTable$1.apply(JdbcUtils.scala:834)
    at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$saveTable$1.apply(JdbcUtils.scala:834)
    at org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1$$anonfun$apply$28.apply(RDD.scala:935)
    at org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1$$anonfun$apply$28.apply(RDD.scala:935)
    at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2101)
    at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2101)
    at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
    at org.apache.spark.scheduler.Task.run(Task.scala:121)
    at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
    at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:748)
    Caused by: java.util.concurrent.ExecutionException: com.pingcap.tikv.exception.RegionTaskException: Handle region task failed:
    at java.util.concurrent.FutureTask.report(FutureTask.java:122)
    at java.util.concurrent.FutureTask.get(FutureTask.java:192)
    at com.pingcap.tikv.operation.iterator.DAGIterator.doReadNextRegionChunks(DAGIterator.java:158)
    … 25 more
    Caused by: com.pingcap.tikv.exception.RegionTaskException: Handle region task failed:
    at com.pingcap.tikv.operation.iterator.DAGIterator.process(DAGIterator.java:201)
    at com.pingcap.tikv.operation.iterator.DAGIterator.lambda$submitTasks$1(DAGIterator.java:67)
    at java.util.concurrent.FutureTask.run(FutureTask.java:266)
    at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
    at java.util.concurrent.FutureTask.run(FutureTask.java:266)
    … 3 more
    Caused by: com.pingcap.tikv.exception.GrpcException: retry is exhausted.
    at com.pingcap.tikv.util.ConcreteBackOffer.doBackOff(ConcreteBackOffer.java:127)
    at com.pingcap.tikv.operation.KVErrorHandler.handleRequestError(KVErrorHandler.java:247)
    at com.pingcap.tikv.policy.RetryPolicy.callWithRetry(RetryPolicy.java:58)
    at com.pingcap.tikv.AbstractGRPCClient.callWithRetry(AbstractGRPCClient.java:63)
    at com.pingcap.tikv.region.RegionStoreClient.coprocess(RegionStoreClient.java:547)
    at com.pingcap.tikv.operation.iterator.DAGIterator.process(DAGIterator.java:188)
    … 7 more
    Caused by: com.pingcap.tikv.exception.GrpcException: send tikv request error: UNAVAILABLE, try next peer later
    at com.pingcap.tikv.operation.KVErrorHandler.handleRequestError(KVErrorHandler.java:250)
    … 11 more
    Caused by: shade.io.grpc.StatusRuntimeException: UNAVAILABLE
    at shade.io.grpc.stub.ClientCalls.toStatusRuntimeException(ClientCalls.java:210)
    at shade.io.grpc.stub.ClientCalls.getUnchecked(ClientCalls.java:191)
    at shade.io.grpc.stub.ClientCalls.blockingUnaryCall(ClientCalls.java:124)
    at com.pingcap.tikv.AbstractGRPCClient.lambda$callWithRetry$0(AbstractGRPCClient.java:66)
    at com.pingcap.tikv.policy.RetryPolicy.callWithRetry(RetryPolicy.java:54)
    … 10 more
    Caused by: shade.io.netty.channel.AbstractChannel$AnnotatedConnectException: Connection refused: /192.168.100.23:20160
    at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
    at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717)
    at shade.io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:323)
    at shade.io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:340)
    at shade.io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:633)
    at shade.io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:580)
    at shade.io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:497)
    at shade.io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:459)
    at shade.io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:858)
    at shade.io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:138)
    … 1 more
    Caused by: java.net.ConnectException: Connection refused
    … 11 more

若提问为性能优化、故障排查类问题,请下载脚本运行。终端输出的打印结果,请务必全选并复制粘贴上传。

麻烦提供一下 Spark 和 TiSpark 的版本信息

// 获取spark版本
org.apache.spark.SPARK_VERSION
// 获取SparkConfig
sc.getConf.getAll.foreach(println)
// 获取TiSpark版本
spark.sql("select ti_version()").show(false)

scala> org.apache.spark.SPARK_VERSION
res0: String = 2.4.3

scala> sc.getConf.getAll.foreach(println)
(spark.executor.extraJavaOptions,-Duser.timezone=UTC)
(spark.driver.host,kafka02)
(spark.driver.port,34179)
(spark.tispark.table.scan_concurrency,256)
(spark.tispark.grpc.timeout_in_sec,100)
(spark.executor.id,driver)
(spark.tispark.request.command.priority,Low)
(spark.executor.cores,4)
(spark.sql.extensions,org.apache.spark.sql.TiExtensions)
(spark.eventLog.enabled,False)
(spark.driver.extraJavaOptions,-Duser.timezone=UTC)
(spark.app.name,Spark shell)
(spark.sql.catalogImplementation,hive)
(spark.tispark.grpc.framesize,268435456)
(spark.home,/home/tidb/deploy/spark)
(spark.tispark.meta.reload_period_in_sec,60)
(spark.driver.memory,2g)
(spark.tispark.pd.addresses,192.168.100.24:2379)
(spark.executor.memory,8g)
(spark.submit.deployMode,client)
(spark.repl.class.uri,spark://kafka02:34179/classes)
(spark.jars,)
(spark.repl.class.outputDir,/tmp/spark-0cec0086-ce36-4347-bfbc-78e6268a660b/repl-3909a4c5-5e09-4b4f-bf8a-702dce62662c)
(spark.app.id,app-20200119114337-0007)
(spark.master,spark://192.168.100.23:7077)
(spark.ui.showConsoleProgress,true)

scala> spark.sql(“select ti_version()”).show(false)
[Stage 0:> (0 + 0) / 1]20/01/19 03:44:57 WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
20/01/19 03:45:12 WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
20/01/19 03:45:27 WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
20/01/19 03:45:42 WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
±-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|UDF:ti_version() |
±-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|Release Version: 2.1.8-spark_2.4
Supported Spark Version: spark-2.4
Git Commit Hash: 1077168611cb8534f2282d6a199025c35e3a52b3
Git Branch: release-2.1.8-spark-2.4
UTC Build Time: 2019-12-12 02:36:25
Supported Spark Version: 2.3
Current Spark Version: 2.4.3
Current Spark Major Version: unknown
TimeZone: UTC|

tikv server timeout 应该是 TiSpark 任务对 tikv 集群压力过大,可以调整一下任务配置,任务执行是成功的还是失败的?

任务成功了 但是 这样报警,过程中会丢失数据导致计算结果不准么 另外 你们这个TiSpark时Standalone模式? yarn 之类的要自己搭建么 ?

如果任务成功了是不会丢数据的,spark会重试失败的task。TiSpark只是用了spark集群的一些extension功能,和spark集群本身的模式没关系的,如果你们是用ansible部署的话,默认是standalone模式。

TiSpark下面的历史服务怎么配 还是说只能在8080 瞄一眼

这个用正常的spark集群配置历史服务的配置方法就行了

此话题已在最后回复的 1 分钟后被自动关闭。不再允许新回复。