Spark新手求教:如何本地开发PySpark并提交至集群?
Hey there! I've helped a few folks set up PySpark dev environments on Windows for remote Spark clusters, so let's break this down step by step for your specific setup:
1. Get Your Local Windows Environment Ready
First, let's make sure your Windows 10 laptop has all the pieces to run PySpark locally and talk to your Ubuntu cluster:
- Match Python Versions: Install the same Python version on Windows that's running on your Ubuntu Spark cluster. This avoids nasty compatibility bugs.
- Install PySpark: Run
pip install pyspark==2.3.0(exact version matching your cluster's Spark is critical here). - Set Environment Variables:
SPARK_HOME: Point this to the path where you unzipped Spark 2.3.0 on Windows (e.g.,C:\spark-2.3.0-bin-hadoop2.7).HADOOP_HOME: Windows needs this for file system operations. Download the matching winutils.exe for your Hadoop version (Spark 2.3.0 usually uses Hadoop 2.7), unzip it to a folder likeC:\hadoop-2.7.0, and setHADOOP_HOMEto that path. Also add%HADOOP_HOME%\binto your systemPATH.PYSPARK_PYTHON: Set this to your local Python executable path (e.g.,C:\Python36\python.exe) to ensure PySpark uses the right interpreter.
2. Test Local PySpark First
Before jumping to remote clusters, confirm local PySpark works:
- Open a command prompt and run
pyspark. You should see the Spark shell load. - Run a quick test command:
rdd = sc.parallelize([1,2,3,4]) print(rdd.sum())
If it outputs 10, your local setup is good to go.
3. Configure Access to Your Ubuntu Spark Cluster
Next, let's get your Windows machine talking to the remote standalone cluster:
- Verify Network Access: Make sure your Windows laptop can ping the Ubuntu server, and that the cluster's firewall allows traffic on port 7077 (Spark Master default) and 8080 (Spark Web UI).
- Sync Spark Configs: Copy the
spark-defaults.confandspark-env.shfiles from your Ubuntu cluster's$SPARK_HOME/confdirectory to your local WindowsSPARK_HOME/confdirectory. This ensures your local PySpark uses the same cluster settings. - Set Cluster Master URL: Add an environment variable
SPARK_MASTER_URLon Windows with the valuespark://<your-ubuntu-server-ip>:7077(replace<your-ubuntu-server-ip>with your server's actual IP address).
4. Set Up Eclipse/PyDev for PySpark Development
Now let's configure your IDE to write and submit jobs:
- Create a PyDev Project: Open Eclipse, go to
File > New > PyDev Project, and set up a project with your configured Python interpreter. - Add PySpark Libraries: In your project's properties, go to
PyDev > Interpreter/Grammar, select your interpreter, and add the following paths:%SPARK_HOME%\python- All
.zipfiles in%SPARK_HOME%\python\lib(likepy4j-0.10.6-src.zip— version matches your Spark setup)
- Write a Test Job: Create a Python script (e.g.,
remote_spark_test.py) with this sample code:import os from pyspark.sql import SparkSession # Optional: Set env vars directly in the script if you didn't set system-wide os.environ['SPARK_HOME'] = 'C:\\spark-2.3.0-bin-hadoop2.7' os.environ['PYSPARK_PYTHON'] = 'C:\\Python36\\python.exe' if __name__ == "__main__": spark = SparkSession.builder \ .master("spark://<your-ubuntu-server-ip>:7077") \ .appName("FirstRemotePySparkJob") \ .getOrCreate() # Simple test data data = [("David", 30), ("Sarah", 28)] df = spark.createDataFrame(data, ["Name", "Age"]) df.show() spark.stop() - Configure Run Settings: Go to
Run > Run Configurations, create a newPython Runconfiguration for your script. Under theEnvironmenttab, add the sameSPARK_HOMEandPYSPARK_PYTHONvariables if you didn't set them system-wide.
5. Submit and Verify the Job
- Run your script from Eclipse. You should see output in the console, and the job will be sent to your Ubuntu cluster.
- To confirm it's running on the cluster, open the Spark Web UI in your browser:
http://<your-ubuntu-server-ip>:8080. You'll see your job listed under "Running Applications" or "Completed Applications".
Key Notes to Avoid Headaches
- Version Lock: Always keep local PySpark, cluster Spark, and Python versions identical. Mismatches cause cryptic errors.
- Data Access: If your job uses input files, ensure the cluster can access them (use HDFS, a shared network drive, or upload files to the cluster's local storage).
- SSH Convenience: For easier file transfers and job management, set up SSH key-based authentication between Windows and Ubuntu so you don't have to enter passwords every time.
内容的提问来源于stack exchange,提问作者David




