升级tispark,从2.5.2升级至3.1.6,相同任务执行变慢

1 问题

tispark2.5执行任务挺快,但是升级到tispark3.1.6后执行观察任务会花费30s到60s处理tidb的元数据,有什么方法能不执行这段操作么?

2 任务描述

一个insert into select类型的任务,输入表为hive表和tidb表,输出表是hive表。

在tispark2和3执行的区别就是使用tispark3运行的任务语句中所有tidb的表加上了tidb_catalog前缀,tispark2没有加前缀。

tispark2版本组合:spark3.1.2+tispark2.5.2

tispark3版本组合:spark3.3.3+tispark3.1.6

tispark2执行时间:88s

tispark3执行时间:244s

两个都使用kyuubi1.9.0提交,资源参数设置都一致。

3. 问题日志

通过观察使用tispark3执行的日志可以发现有一些地方会花费将近一分钟的时间,例如11:14:

24/08/28 11:14:57 INFO ExecuteStatement:
           Spark application name: briojiobrtjiobrt8909790799078
                 application ID: application_1702896992724_28754891
                 application web UI: http://server1-pm:8080/proxy/application_1702896992724_28754891
                 master: yarn
                 deploy mode: cluster
                 version: 3.3.3
           Start time: 2024-08-28T11:13:47.633
           User: maintain
24/08/28 11:14:57 INFO TiUtil$: spark.tispark.stale_read =
24/08/28 11:15:45 WARN KVErrorHandler: NotLeader Error with region id 4146387906 and store id 4138101747, new store id 26634932
24/08/28 11:15:45 WARN KVErrorHandler: Store Not Match happened with region id 4146387906, store id 26634932, actual store id 26634932
24/08/28 11:15:45 WARN KVErrorHandler: Failed to send notification back to driver since CacheInvalidateCallBack is null in executor node.
24/08/28 11:15:45 WARN RegionStoreClient: Re-splitting region task due to region error:to store id 4138101747, mine 26634932

但是在tispark2中没有这种情况,任务正常执行:

24/08/28 11:23:24 INFO ExecuteStatement:
           Spark application name: ijgiojeiojgioerjhgio43
                 application ID: application_1702896992724_28755705
                 application web UI: http://server1-pm:8080/proxy/application_1702896992724_28755705
                 master: yarn
                 deploy mode: cluster
                 version: 3.1.2
           Start time: 2024-08-28T11:23:12.320
           User: maintain
24/08/28 11:23:29 INFO InMemoryFileIndex: It took 66 ms to list leaf files for 1 paths.
24/08/28 11:23:30 INFO FileOutputCommitter: File Output Committer Algorithm version is 1
24/08/28 11:23:30 INFO SQLHadoopMapReduceCommitProtocol: Using output committer class org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter
24/08/28 11:23:30 INFO CodeGenerator: Code generated in 64.830432 ms
24/08/28 11:23:30 INFO DAGScheduler: Registering RDD 6 (sql at ExecuteStatement.scala:99) as input to shuffle 0
24/08/28 11:23:30 INFO DAGScheduler: Got map stage job 0 (sql at ExecuteStatement.scala:99) with 101 output partitions
24/08/28 11:23:30 INFO DAGScheduler: Final stage: ShuffleMapStage 0 (sql at ExecuteStatement.scala:99)
24/08/28 11:23:30 INFO DAGScheduler: Parents of final stage: List()
24/08/28 11:23:30 INFO DAGScheduler: Missing parents: List()
24/08/28 11:23:30 INFO DAGScheduler: Submitting ShuffleMapStage 0 (MapPartitionsRDD[6] at sql at ExecuteStatement.scala:99), which has no missing parents
24/08/28 11:23:30 INFO SQLOperationListener: Query [e3c5c291-f2e3-43b5-98b2-0f6b59d59cf5]: Job 0 started with 1 stages, 1 active jobs running
24/08/28 11:23:30 INFO CodeGenerator: Code generated in 27.225805 ms
24/08/28 11:23:30 INFO SQLOperationListener: Query [e3c5c291-f2e3-43b5-98b2-0f6b59d59cf5]: Stage 0.0 started with 101 tasks, 1 active stages running
24/08/28 11:23:30 INFO YarnAllocator: Resource profile 0 doesn't exist, adding it
24/08/28 11:23:30 INFO YarnAllocator: Driver requested a total number of 1 executor(s) for resource profile id: 0.
24/08/28 11:23:30 INFO ExecutorAllocationManager: Requesting 1 new executor because tasks are backlogged (new desired total will be 1 for resource profile id: 0)
24/08/28 11:23:30 INFO YarnAllocator: Will request 1 executor container(s) for  ResourceProfile Id: 0, each with 1 core(s) and 4505 MB memory.

看看有没有对其对应的版本~

看一下这个,我忘了tispark哪个版本修复的了。

这个问题应该是已经修了 :joy:

用 3.2.3 试试看

https://github.com/pingcap/tispark/releases/tag/v3.2.3

好的,我再试一试

之前最开始用的就是tispark3.2.3,但是也是慢

tispark-3.2.3会打印这些日志花费很长时间:
image
image

emm, 提个issue找大佬看看吧