调用 TiSpark 任务时,遇到 TiKV Server Timeout 报错

为提高效率,提问时请提供以下信息,问题描述清晰可优先响应。

  • 【TiDB 版本】: v3.0.8
  • 【问题描述】: 调用TiSpark进行任务时,出现故障TiKV Server timeout 警告 。 然后监控上看到 线条中断 这种 时硬件、 网络原因么 会导致计算数据丢失或者数据拉取不全么? 20/01/19 03:25:57 WARN TaskSetManager: Lost task 42.0 in stage 1.0 (TID 53, 192.168.100.23, executor 3): com.pingcap.tikv.exception.TiClientInternalException: Error reading region: at com.pingcap.tikv.operation.iterator.DAGIterator.doReadNextRegionChunks(DAGIterator.java:163) at com.pingcap.tikv.operation.iterator.DAGIterator.readNextRegionChunks(DAGIterator.java:140) at com.pingcap.tikv.operation.iterator.DAGIterator.hasNext(DAGIterator.java:89) at org.apache.spark.sql.tispark.TiRDD$$anon$2.hasNext(TiRDD.scala:86) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source) at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$13$$anon$1.hasNext(WholeStageCodegenExec.scala:636) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409) at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.savePartition(JdbcUtils.scala:653) at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$saveTable$1.apply(JdbcUtils.scala:834) at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$saveTable$1.apply(JdbcUtils.scala:834) at org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1$$anonfun$apply$28.apply(RDD.scala:935) at org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1$$anonfun$apply$28.apply(RDD.scala:935) at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2101) at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2101) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) at org.apache.spark.scheduler.Task.run(Task.scala:121) at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408) at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) Caused by: java.util.concurrent.ExecutionException: com.pingcap.tikv.exception.RegionTaskException: Handle region task failed: at java.util.concurrent.FutureTask.report(FutureTask.java:122) at java.util.concurrent.FutureTask.get(FutureTask.java:192) at com.pingcap.tikv.operation.iterator.DAGIterator.doReadNextRegionChunks(DAGIterator.java:158) … 25 more Caused by: com.pingcap.tikv.exception.RegionTaskException: Handle region task failed: at com.pingcap.tikv.operation.iterator.DAGIterator.process(DAGIterator.java:201) at com.pingcap.tikv.operation.iterator.DAGIterator.lambda$submitTasks$1(DAGIterator.java:67) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) … 3 more Caused by: com.pingcap.tikv.exception.GrpcException: retry is exhausted. at com.pingcap.tikv.util.ConcreteBackOffer.doBackOff(ConcreteBackOffer.java:127) at com.pingcap.tikv.operation.KVErrorHandler.handleRequestError(KVErrorHandler.java:247) at com.pingcap.tikv.policy.RetryPolicy.callWithRetry(RetryPolicy.java:58) at com.pingcap.tikv.AbstractGRPCClient.callWithRetry(AbstractGRPCClient.java:63) at com.pingcap.tikv.region.RegionStoreClient.coprocess(RegionStoreClient.java:547) at com.pingcap.tikv.operation.iterator.DAGIterator.process(DAGIterator.java:188) … 7 more Caused by: com.pingcap.tikv.exception.GrpcException: send tikv request error: UNAVAILABLE, try next peer later at com.pingcap.tikv.operation.KVErrorHandler.handleRequestError(KVErrorHandler.java:250) … 11 more Caused by: shade.io.grpc.StatusRuntimeException: UNAVAILABLE at shade.io.grpc.stub.ClientCalls.toStatusRuntimeException(ClientCalls.java:210) at shade.io.grpc.stub.ClientCalls.getUnchecked(ClientCalls.java:191) at shade.io.grpc.stub.ClientCalls.blockingUnaryCall(ClientCalls.java:124) at com.pingcap.tikv.AbstractGRPCClient.lambda$callWithRetry$0(AbstractGRPCClient.java:66) at com.pingcap.tikv.policy.RetryPolicy.callWithRetry(RetryPolicy.java:54) … 10 more Caused by: shade.io.netty.channel.AbstractChannel$AnnotatedConnectException: Connection refused: /192.168.100.23:20160 at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717) at shade.io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:323) at shade.io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:340) at shade.io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:633) at shade.io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:580) at shade.io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:497) at shade.io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:459) at shade.io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:858) at shade.io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:138) … 1 more Caused by: java.net.ConnectException: Connection refused … 11 more

若提问为性能优化、故障排查类问题,请下载脚本运行。终端输出的打印结果,请务必全选并复制粘贴上传。

麻烦提供一下 Spark 和 TiSpark 的版本信息

// 获取spark版本
org.apache.spark.SPARK_VERSION
// 获取SparkConfig
sc.getConf.getAll.foreach(println)
// 获取TiSpark版本
spark.sql("select ti_version()").show(false)

scala> org.apache.spark.SPARK_VERSION res0: String = 2.4.3

scala> sc.getConf.getAll.foreach(println) (spark.executor.extraJavaOptions,-Duser.timezone=UTC) (spark.driver.host,kafka02) (spark.driver.port,34179) (spark.tispark.table.scan_concurrency,256) (spark.tispark.grpc.timeout_in_sec,100) (spark.executor.id,driver) (spark.tispark.request.command.priority,Low) (spark.executor.cores,4) (spark.sql.extensions,org.apache.spark.sql.TiExtensions) (spark.eventLog.enabled,False) (spark.driver.extraJavaOptions,-Duser.timezone=UTC) (spark.app.name,Spark shell) (spark.sql.catalogImplementation,hive) (spark.tispark.grpc.framesize,268435456) (spark.home,/home/tidb/deploy/spark) (spark.tispark.meta.reload_period_in_sec,60) (spark.driver.memory,2g) (spark.tispark.pd.addresses,192.168.100.24:2379) (spark.executor.memory,8g) (spark.submit.deployMode,client) (spark.repl.class.uri,spark://kafka02:34179/classes) (spark.jars,) (spark.repl.class.outputDir,/tmp/spark-0cec0086-ce36-4347-bfbc-78e6268a660b/repl-3909a4c5-5e09-4b4f-bf8a-702dce62662c) (spark.app.id,app-20200119114337-0007) (spark.master,spark://192.168.100.23:7077) (spark.ui.showConsoleProgress,true)

scala> spark.sql(“select ti_version()”).show(false) [Stage 0:> (0 + 0) / 1]20/01/19 03:44:57 WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources 20/01/19 03:45:12 WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources 20/01/19 03:45:27 WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources 20/01/19 03:45:42 WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources ±-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ |UDF:ti_version() | ±-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ |Release Version: 2.1.8-spark_2.4 Supported Spark Version: spark-2.4 Git Commit Hash: 1077168611cb8534f2282d6a199025c35e3a52b3 Git Branch: release-2.1.8-spark-2.4 UTC Build Time: 2019-12-12 02:36:25 Supported Spark Version: 2.3 Current Spark Version: 2.4.3 Current Spark Major Version: unknown TimeZone: UTC|

tikv server timeout 应该是 TiSpark 任务对 tikv 集群压力过大,可以调整一下任务配置,任务执行是成功的还是失败的?

任务成功了 但是 这样报警,过程中会丢失数据导致计算结果不准么 另外 你们这个TiSpark时Standalone模式? yarn 之类的要自己搭建么 ?

如果任务成功了是不会丢数据的,spark会重试失败的task。TiSpark只是用了spark集群的一些extension功能,和spark集群本身的模式没关系的,如果你们是用ansible部署的话,默认是standalone模式。

TiSpark下面的历史服务怎么配 还是说只能在8080 瞄一眼

这个用正常的spark集群配置历史服务的配置方法就行了