You need to enable JavaScript to run this app.
最新活动
大模型
产品
解决方案
定价
生态与合作
支持与服务
开发者
了解我们

Spark新手求教:如何本地开发PySpark并提交至集群?

Hey there! I've helped a few folks set up PySpark dev environments on Windows for remote Spark clusters, so let's break this down step by step for your specific setup:

1. Get Your Local Windows Environment Ready

First, let's make sure your Windows 10 laptop has all the pieces to run PySpark locally and talk to your Ubuntu cluster:

  • Match Python Versions: Install the same Python version on Windows that's running on your Ubuntu Spark cluster. This avoids nasty compatibility bugs.
  • Install PySpark: Run pip install pyspark==2.3.0 (exact version matching your cluster's Spark is critical here).
  • Set Environment Variables:
    • SPARK_HOME: Point this to the path where you unzipped Spark 2.3.0 on Windows (e.g., C:\spark-2.3.0-bin-hadoop2.7).
    • HADOOP_HOME: Windows needs this for file system operations. Download the matching winutils.exe for your Hadoop version (Spark 2.3.0 usually uses Hadoop 2.7), unzip it to a folder like C:\hadoop-2.7.0, and set HADOOP_HOME to that path. Also add %HADOOP_HOME%\bin to your system PATH.
    • PYSPARK_PYTHON: Set this to your local Python executable path (e.g., C:\Python36\python.exe) to ensure PySpark uses the right interpreter.
2. Test Local PySpark First

Before jumping to remote clusters, confirm local PySpark works:

  • Open a command prompt and run pyspark. You should see the Spark shell load.
  • Run a quick test command:
    rdd = sc.parallelize([1,2,3,4])
    print(rdd.sum())
    

If it outputs 10, your local setup is good to go.

3. Configure Access to Your Ubuntu Spark Cluster

Next, let's get your Windows machine talking to the remote standalone cluster:

  • Verify Network Access: Make sure your Windows laptop can ping the Ubuntu server, and that the cluster's firewall allows traffic on port 7077 (Spark Master default) and 8080 (Spark Web UI).
  • Sync Spark Configs: Copy the spark-defaults.conf and spark-env.sh files from your Ubuntu cluster's $SPARK_HOME/conf directory to your local Windows SPARK_HOME/conf directory. This ensures your local PySpark uses the same cluster settings.
  • Set Cluster Master URL: Add an environment variable SPARK_MASTER_URL on Windows with the value spark://<your-ubuntu-server-ip>:7077 (replace <your-ubuntu-server-ip> with your server's actual IP address).
4. Set Up Eclipse/PyDev for PySpark Development

Now let's configure your IDE to write and submit jobs:

  • Create a PyDev Project: Open Eclipse, go to File > New > PyDev Project, and set up a project with your configured Python interpreter.
  • Add PySpark Libraries: In your project's properties, go to PyDev > Interpreter/Grammar, select your interpreter, and add the following paths:
    • %SPARK_HOME%\python
    • All .zip files in %SPARK_HOME%\python\lib (like py4j-0.10.6-src.zip — version matches your Spark setup)
  • Write a Test Job: Create a Python script (e.g., remote_spark_test.py) with this sample code:
    import os
    from pyspark.sql import SparkSession
    
    # Optional: Set env vars directly in the script if you didn't set system-wide
    os.environ['SPARK_HOME'] = 'C:\\spark-2.3.0-bin-hadoop2.7'
    os.environ['PYSPARK_PYTHON'] = 'C:\\Python36\\python.exe'
    
    if __name__ == "__main__":
        spark = SparkSession.builder \
            .master("spark://<your-ubuntu-server-ip>:7077") \
            .appName("FirstRemotePySparkJob") \
            .getOrCreate()
    
        # Simple test data
        data = [("David", 30), ("Sarah", 28)]
        df = spark.createDataFrame(data, ["Name", "Age"])
        df.show()
    
        spark.stop()
    
  • Configure Run Settings: Go to Run > Run Configurations, create a new Python Run configuration for your script. Under the Environment tab, add the same SPARK_HOME and PYSPARK_PYTHON variables if you didn't set them system-wide.
5. Submit and Verify the Job
  • Run your script from Eclipse. You should see output in the console, and the job will be sent to your Ubuntu cluster.
  • To confirm it's running on the cluster, open the Spark Web UI in your browser: http://<your-ubuntu-server-ip>:8080. You'll see your job listed under "Running Applications" or "Completed Applications".
Key Notes to Avoid Headaches
  • Version Lock: Always keep local PySpark, cluster Spark, and Python versions identical. Mismatches cause cryptic errors.
  • Data Access: If your job uses input files, ensure the cluster can access them (use HDFS, a shared network drive, or upload files to the cluster's local storage).
  • SSH Convenience: For easier file transfers and job management, set up SSH key-based authentication between Windows and Ubuntu so you don't have to enter passwords every time.

内容的提问来源于stack exchange,提问作者David

火山引擎 最新活动