如何用PySpark连接Hive及直接在Hive中创建存储表?
Hey there! Let's break down your two PySpark + Hive questions step by step:
Connecting PySpark to Hive is straightforward, but there are a few key setup steps and code patterns to follow:
- First, ensure your Spark environment has Hive support: Most pre-built Spark packages come with Hive integration out of the box. The critical thing is to make sure Hive's
hive-site.xmlconfiguration file is accessible to Spark—either copy it into Spark'sconfdirectory, or specify its path via Spark configurations. - Code implementation (two approaches):
- Legacy approach (Spark 1.x, using HiveContext): This matches what you're already using, with added context for metastore setup:
from pyspark import SparkConf, SparkContext from pyspark.sql import HiveContext # Configure Spark to point to your Hive metastore if it's remote conf = SparkConf().setAppName("data_import") \ .set("hive.metastore.uris", "thrift://your-metastore-host:9083") # Adjust host/port to your setup sc = SparkContext(conf=conf) sqlContext = HiveContext(sc) - Modern approach (Spark 2.x+, using SparkSession, recommended): SparkSession simplifies the API and includes Hive support with a single flag:
from pyspark.sql import SparkSession spark = SparkSession.builder \ .appName("data_import") \ .enableHiveSupport() # This flag enables full Hive integration .config("hive.metastore.uris", "thrift://your-metastore-host:9083") \ .getOrCreate()
- Legacy approach (Spark 1.x, using HiveContext): This matches what you're already using, with added context for metastore setup:
- Important check: If you're using a remote Hive metastore (common in cluster setups), make sure the metastore service (
hive metastore) is running and Spark can reach its address.
The reason your testdb.db folder is showing up in spark-warehouse is that Spark SQL uses its own default warehouse path by default. To align this with Hive's native storage (so tables behave like they do in MySQL/MongoDB), you need to point Spark to Hive's warehouse directory.
Here's how to fix your code:
Hive's default warehouse is typically /user/hive/warehouse (defined in hive-site.xml via the hive.metastore.warehouse.dir setting). You have two easy options:
- Copy Hive's
hive-site.xmlto Spark'sconfdirectory: Spark will automatically pick up Hive's warehouse path and metastore settings, no extra code needed. - Explicitly set the warehouse path in your Spark configuration:
Here's the modified version of your original code with explicit configuration:
from pyspark import SparkConf, SparkContext from pyspark.sql import HiveContext conf = SparkConf().setAppName("data_import") \ .set("spark.sql.shuffle.partitions", "2") \ .set("spark.sql.warehouse.dir", "/user/hive/warehouse") \ # Match Hive's default warehouse path .set("hive.metastore.uris", "thrift://your-metastore-host:9083") # Add if using remote metastore sc = SparkContext(conf=conf) sqlContext = HiveContext(sc) # Now these operations will create tables in Hive's native warehouse sqlContext.sql("CREATE DATABASE testdb") sqlContext.sql("USE testdb") sqlContext.sql("create table daily_revenue(order_date string, daily_revenue float)")
If you switch to the modern SparkSession approach, it looks like this:
from pyspark.sql import SparkSession spark = SparkSession.builder \ .appName("data_import") \ .enableHiveSupport() \ .config("spark.sql.shuffle.partitions", "2") \ .config("spark.sql.warehouse.dir", "/user/hive/warehouse") \ .getOrCreate() spark.sql("CREATE DATABASE testdb") spark.sql("USE testdb") spark.sql("create table daily_revenue(order_date string, daily_revenue float)")
After making this change, your testdb.db folder and tables will live in Hive's default warehouse directory. You'll be able to see and query them directly using the Hive CLI or any other Hive-compatible tool, just like you would with MySQL or MongoDB.
内容的提问来源于stack exchange,提问作者Krishna




