如何用PySpark连接Hive及直接在Hive中创建存储表？

阿华AIGC实验室

2026-5-15

Hey there! Let's break down your two PySpark + Hive questions step by step:

1. How to Connect PySpark to Hive

Connecting PySpark to Hive is straightforward, but there are a few key setup steps and code patterns to follow:

First, ensure your Spark environment has Hive support: Most pre-built Spark packages come with Hive integration out of the box. The critical thing is to make sure Hive's hive-site.xml configuration file is accessible to Spark—either copy it into Spark's conf directory, or specify its path via Spark configurations.

Code implementation (two approaches):

Legacy approach (Spark 1.x, using HiveContext): This matches what you're already using, with added context for metastore setup:

from pyspark import SparkConf, SparkContext
from pyspark.sql import HiveContext

# Configure Spark to point to your Hive metastore if it's remote
conf = SparkConf().setAppName("data_import") \
    .set("hive.metastore.uris", "thrift://your-metastore-host:9083")  # Adjust host/port to your setup

sc = SparkContext(conf=conf)
sqlContext = HiveContext(sc)

Modern approach (Spark 2.x+, using SparkSession, recommended): SparkSession simplifies the API and includes Hive support with a single flag:

from pyspark.sql import SparkSession

spark = SparkSession.builder \
    .appName("data_import") \
    .enableHiveSupport()  # This flag enables full Hive integration
    .config("hive.metastore.uris", "thrift://your-metastore-host:9083") \
    .getOrCreate()

Important check: If you're using a remote Hive metastore (common in cluster setups), make sure the metastore service (hive metastore) is running and Spark can reach its address.

2. Storing Tables in Hive's Default Location (Instead of spark-warehouse)

The reason your testdb.db folder is showing up in spark-warehouse is that Spark SQL uses its own default warehouse path by default. To align this with Hive's native storage (so tables behave like they do in MySQL/MongoDB), you need to point Spark to Hive's warehouse directory.

Here's how to fix your code:

Hive's default warehouse is typically /user/hive/warehouse (defined in hive-site.xml via the hive.metastore.warehouse.dir setting). You have two easy options:

Copy Hive's hive-site.xml to Spark's conf directory: Spark will automatically pick up Hive's warehouse path and metastore settings, no extra code needed.
Explicitly set the warehouse path in your Spark configuration:

Here's the modified version of your original code with explicit configuration:

from pyspark import SparkConf, SparkContext
from pyspark.sql import HiveContext

conf = SparkConf().setAppName("data_import") \
    .set("spark.sql.shuffle.partitions", "2") \
    .set("spark.sql.warehouse.dir", "/user/hive/warehouse") \  # Match Hive's default warehouse path
    .set("hive.metastore.uris", "thrift://your-metastore-host:9083")  # Add if using remote metastore

sc = SparkContext(conf=conf)
sqlContext = HiveContext(sc)

# Now these operations will create tables in Hive's native warehouse
sqlContext.sql("CREATE DATABASE testdb")
sqlContext.sql("USE testdb")
sqlContext.sql("create table daily_revenue(order_date string, daily_revenue float)")

If you switch to the modern SparkSession approach, it looks like this:

from pyspark.sql import SparkSession

spark = SparkSession.builder \
    .appName("data_import") \
    .enableHiveSupport() \
    .config("spark.sql.shuffle.partitions", "2") \
    .config("spark.sql.warehouse.dir", "/user/hive/warehouse") \
    .getOrCreate()

spark.sql("CREATE DATABASE testdb")
spark.sql("USE testdb")
spark.sql("create table daily_revenue(order_date string, daily_revenue float)")

After making this change, your testdb.db folder and tables will live in Hive's default warehouse directory. You'll be able to see and query them directly using the Hive CLI or any other Hive-compatible tool, just like you would with MySQL or MongoDB.

内容的提问来源于stack exchange，提问作者Krishna