如题 想利用pyspark读取tidb的内容 开发环境如何配置没有头绪
pyspark的实践较少,查了github上的一段说明,看看能不能帮到你,(参考:http://github.com/pingcap/tispark中python/README.md):
TiSpark (version >= 2.0) on PySpark:
Note: If you are using TiSpark version less than 2.0, please read this document instead
Usage
There are currently two ways to use TiSpark on Python:
Directly via pyspark
This is the simplest way, just a decent Spark environment should be enough.
-
Make sure you have the latest version of TiSpark and a
jar
with all TiSpark’s dependencies. -
Remember to add needed configurations listed in README into your
$SPARK_HOME/conf/spark-defaults.conf
-
For spark-2.3.x please copy
./resources/spark-2.3/session.py
to$SPARK_HOME/python/pyspark/sql/session.py
. For other Spark version please edit the file$SPARK_HOME/python/pyspark/sql/session.py
and change it from
jsparkSession = self._jvm.SparkSession(self._jsc.sc())
to
jsparkSession = self._jvm.SparkSession.builder().getOrCreate()
- Run this command in your
$SPARK_HOME
directory:
./bin/pyspark --jars /where-ever-it-is/tispark-${name_with_version}.jar
- To use TiSpark, run these commands:
# Query as you are in spark-shell
spark.sql("show databases").show()
spark.sql("use tpch_test")
spark.sql("show tables").show()
spark.sql("select count(*) from customer").show()
# Result
# +--------+
# |count(1)|
# +--------+
# | 150|
# +--------+
Via spark-submit
This way is useful when you want to execute your own Python scripts.
Because of an open issue [SPARK-25003] in Spark-2.3.x and Spark-2.4.x, using spark-submit for python files will only support following api
- Use
pip install pytispark
in your console to installpytispark
Note that you may need reinstall pytispark
if you meet No plan for reation
error.
- Create a Python file named
test.py
as below:
import pytispark.pytispark as pti
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
ti = pti.TiContext(spark)
ti.tidbMapDatabase("tpch_test")
spark.sql("select count(*) from customer").show()
# Result
# +--------+
# |count(1)|
# +--------+
# | 150|
# +--------+
- Prepare your TiSpark environment as above and execute
./bin/spark-submit --jars /where-ever-it-is/tispark-${name_with_version}.jar test.py
- Result:
+--------+
|count(1)|
+--------+
| 150|
+--------+
See pytispark for more information.
此话题已在最后回复的 1 分钟后被自动关闭。不再允许新回复。