You need to enable JavaScript to run this app.
最新活动
大模型
产品
解决方案
定价
生态与合作
支持与服务
开发者
了解我们

Spark应用开发最佳实践及集群开发方案咨询

Optimal Workflow for Developing Spark 2.1.0 Apps on Your Cluster (With IntelliJ)

Hey there! I’ve been in your shoes—building Spark apps with a great IDE like IntelliJ but no local space for big datasets. Let’s walk through the best approach to develop, test, and deploy apps that work with your cluster’s data smoothly.

1. Local Development: Test Logic with Small Datasets

You don’t need the full cluster data to write and debug core logic. Here’s how to do it:

  • Set up a local Spark environment: Install Spark 2.1.0 on your local machine, or use Maven/Gradle to pull in Spark dependencies (set scope=provided so you don’t bundle cluster libraries).
  • Use sample data: Download a tiny subset of your cluster data (e.g., 100 rows) to your local machine, or generate synthetic data that mimics the structure of your real dataset. Store this in your project’s src/main/resources folder.
  • Configure SparkSession for local mode: In your code, set the master to local[*] when running locally to use all your CPU cores:
    val spark = SparkSession.builder()
      .appName("LocalDevTest")
      .master("local[*]")
      .getOrCreate()
    
    This lets you run and debug your code in IntelliJ with the sample data to validate transformations, filters, and business logic.

2. Point to Cluster Data Without Local Storage

When you’re ready to run against real data, you don’t need to copy anything locally. Just reference your cluster’s storage paths directly in your code:

  • Use HDFS paths: If your data is on HDFS, use the full cluster path like hdfs://your-edge-node:8020/path/to/large-dataset.
  • Switch paths easily: Use command-line arguments or a config file to toggle between local sample data and cluster data. For example:
    val inputPath = args(0) // Pass path as argument when submitting
    val df = spark.read.parquet(inputPath)
    
    Local dev: Run with src/main/resources/sample.parquet; cluster run: Pass the HDFS path.

3. Deploy to the Cluster from IntelliJ or Command Line

You have two solid options to get your app running on the cluster:

  • IntelliJ Remote Deployment:
    1. Package your app as a JAR (use IntelliJ’s built-in Maven/Gradle build, or run mvn package).
    2. Set up a "Spark Submit" run configuration in IntelliJ:
      • Specify the Spark installation path on your edge node.
      • Set master to yarn (or spark://your-master:7077 if using standalone cluster).
      • Choose deploy mode: cluster (runs driver on cluster) or client (driver runs on your local machine, good for quick testing).
      • Add your JAR path and any arguments (like the cluster data path).
  • Command Line Submit:
    1. SCP your built JAR to the edge node.
    2. Run the Spark submit command:
      spark-submit \
        --class com.yourcompany.YourSparkApp \
        --master yarn \
        --deploy-mode cluster \
        --executor-memory 4G \
        --num-executors 8 \
        your-app.jar hdfs://your-edge-node:8020/path/to/data
      

4. Debugging Cluster Runs

If things go wrong on the cluster, don’t guess—use these tricks:

  • Check Spark Logs: Access logs via YARN ResourceManager or Spark History Server (most clusters have this enabled). Look for errors in driver/executor logs.
  • Remote Debugging: Attach IntelliJ to a running cluster app:
    1. Add debug flags to your spark-submit command:
      --driver-java-options "-agentlib:jdwp=transport=dt_socket,server=y,suspend=y,address=5005"
      
    2. In IntelliJ, create a "Remote" run configuration, set the host to your edge node’s IP, and port to 5005. Run the configuration to attach and set breakpoints.

5. Best Practices for Spark 2.1.0

  • Dependency Management: Always mark Spark core, SQL, etc., as provided in your build file—this avoids bloating your JAR with libraries the cluster already has.
  • Optimize Data Formats: Use Parquet or ORC instead of CSV/JSON for faster reads/writes and better compression.
  • Minimize Shuffles: Avoid operations like groupByKey; use reduceByKey or aggregateByKey instead. Partition data strategically to reduce data movement.
  • Resource Tuning: Match executor memory/cores to your cluster’s available resources. Start with conservative settings and adjust based on job performance.

内容的提问来源于stack exchange,提问作者Taylrl

火山引擎 最新活动