请教一下在windows环境下如何在pycharm里面使用tispark连接tidb

如题 想利用pyspark读取tidb的内容 开发环境如何配置没有头绪

pyspark的实践较少,查了github上的一段说明,看看能不能帮到你,(参考:http://github.com/pingcap/tispark中python/README.md):

TiSpark (version >= 2.0) on PySpark:

Note: If you are using TiSpark version less than 2.0, please read this document instead

Usage

There are currently two ways to use TiSpark on Python:

Directly via pyspark

This is the simplest way, just a decent Spark environment should be enough.

  1. Make sure you have the latest version of TiSpark and a jar with all TiSpark’s dependencies.

  2. Remember to add needed configurations listed in README into your $SPARK_HOME/conf/spark-defaults.conf

  3. For spark-2.3.x please copy ./resources/spark-2.3/session.py to $SPARK_HOME/python/pyspark/sql/session.py. For other Spark version please edit the file $SPARK_HOME/python/pyspark/sql/session.py and change it from

jsparkSession = self._jvm.SparkSession(self._jsc.sc())

to

jsparkSession = self._jvm.SparkSession.builder().getOrCreate()
  1. Run this command in your $SPARK_HOME directory:
./bin/pyspark --jars /where-ever-it-is/tispark-${name_with_version}.jar
  1. To use TiSpark, run these commands:
# Query as you are in spark-shell
spark.sql("show databases").show()
spark.sql("use tpch_test")
spark.sql("show tables").show()
spark.sql("select count(*) from customer").show()

# Result
# +--------+
# |count(1)|
# +--------+
# |     150|
# +--------+

Via spark-submit

This way is useful when you want to execute your own Python scripts.

Because of an open issue [SPARK-25003] in Spark-2.3.x and Spark-2.4.x, using spark-submit for python files will only support following api

  1. Use pip install pytispark in your console to install pytispark

Note that you may need reinstall pytispark if you meet No plan for reation error.

  1. Create a Python file named test.py as below:
import pytispark.pytispark as pti
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
ti = pti.TiContext(spark)

ti.tidbMapDatabase("tpch_test")

spark.sql("select count(*) from customer").show()

# Result
# +--------+
# |count(1)|
# +--------+
# |     150|
# +--------+
  1. Prepare your TiSpark environment as above and execute
./bin/spark-submit --jars /where-ever-it-is/tispark-${name_with_version}.jar test.py
  1. Result:
+--------+
|count(1)|
+--------+
|     150|
+--------+

See pytispark for more information.

1 个赞

此话题已在最后回复的 1 分钟后被自动关闭。不再允许新回复。