Spark新手求教：如何本地开发PySpark并提交至集群？

阿华AIGC实验室

2026-5-21

Hey there! I've helped a few folks set up PySpark dev environments on Windows for remote Spark clusters, so let's break this down step by step for your specific setup:

1. Get Your Local Windows Environment Ready

First, let's make sure your Windows 10 laptop has all the pieces to run PySpark locally and talk to your Ubuntu cluster:

Match Python Versions: Install the same Python version on Windows that's running on your Ubuntu Spark cluster. This avoids nasty compatibility bugs.
Install PySpark: Run pip install pyspark==2.3.0 (exact version matching your cluster's Spark is critical here).
Set Environment Variables:
- SPARK_HOME: Point this to the path where you unzipped Spark 2.3.0 on Windows (e.g., C:\spark-2.3.0-bin-hadoop2.7).
- HADOOP_HOME: Windows needs this for file system operations. Download the matching winutils.exe for your Hadoop version (Spark 2.3.0 usually uses Hadoop 2.7), unzip it to a folder like C:\hadoop-2.7.0, and set HADOOP_HOME to that path. Also add %HADOOP_HOME%\bin to your system PATH.
- PYSPARK_PYTHON: Set this to your local Python executable path (e.g., C:\Python36\python.exe) to ensure PySpark uses the right interpreter.

2. Test Local PySpark First

Before jumping to remote clusters, confirm local PySpark works:

Open a command prompt and run pyspark. You should see the Spark shell load.

Run a quick test command:

rdd = sc.parallelize([1,2,3,4])
print(rdd.sum())

If it outputs 10, your local setup is good to go.

3. Configure Access to Your Ubuntu Spark Cluster

Next, let's get your Windows machine talking to the remote standalone cluster:

Verify Network Access: Make sure your Windows laptop can ping the Ubuntu server, and that the cluster's firewall allows traffic on port 7077 (Spark Master default) and 8080 (Spark Web UI).
Sync Spark Configs: Copy the spark-defaults.conf and spark-env.sh files from your Ubuntu cluster's $SPARK_HOME/conf directory to your local Windows SPARK_HOME/conf directory. This ensures your local PySpark uses the same cluster settings.
Set Cluster Master URL: Add an environment variable SPARK_MASTER_URL on Windows with the value spark://<your-ubuntu-server-ip>:7077 (replace <your-ubuntu-server-ip> with your server's actual IP address).

4. Set Up Eclipse/PyDev for PySpark Development

Now let's configure your IDE to write and submit jobs:

Create a PyDev Project: Open Eclipse, go to File > New > PyDev Project, and set up a project with your configured Python interpreter.
Add PySpark Libraries: In your project's properties, go to PyDev > Interpreter/Grammar, select your interpreter, and add the following paths:
- %SPARK_HOME%\python
- All .zip files in %SPARK_HOME%\python\lib (like py4j-0.10.6-src.zip — version matches your Spark setup)

Write a Test Job: Create a Python script (e.g., remote_spark_test.py) with this sample code:

import os
from pyspark.sql import SparkSession

# Optional: Set env vars directly in the script if you didn't set system-wide
os.environ['SPARK_HOME'] = 'C:\\spark-2.3.0-bin-hadoop2.7'
os.environ['PYSPARK_PYTHON'] = 'C:\\Python36\\python.exe'

if __name__ == "__main__":
    spark = SparkSession.builder \
        .master("spark://<your-ubuntu-server-ip>:7077") \
        .appName("FirstRemotePySparkJob") \
        .getOrCreate()

    # Simple test data
    data = [("David", 30), ("Sarah", 28)]
    df = spark.createDataFrame(data, ["Name", "Age"])
    df.show()

    spark.stop()

Configure Run Settings: Go to Run > Run Configurations, create a new Python Run configuration for your script. Under the Environment tab, add the same SPARK_HOME and PYSPARK_PYTHON variables if you didn't set them system-wide.

5. Submit and Verify the Job

Run your script from Eclipse. You should see output in the console, and the job will be sent to your Ubuntu cluster.
To confirm it's running on the cluster, open the Spark Web UI in your browser: http://<your-ubuntu-server-ip>:8080. You'll see your job listed under "Running Applications" or "Completed Applications".

Key Notes to Avoid Headaches

Version Lock: Always keep local PySpark, cluster Spark, and Python versions identical. Mismatches cause cryptic errors.
Data Access: If your job uses input files, ensure the cluster can access them (use HDFS, a shared network drive, or upload files to the cluster's local storage).
SSH Convenience: For easier file transfers and job management, set up SSH key-based authentication between Windows and Ubuntu so you don't have to enter passwords every time.

内容的提问来源于stack exchange，提问作者David