You need to enable JavaScript to run this app.
最新活动
大模型
产品
解决方案
定价
生态与合作
支持与服务
开发者
了解我们

EMR运行Spark-Scala Jar报错:需配置Master URL问题排查

Let's break down your problem step by step—you're hitting two key issues here: the master URL configuration error, and a hidden code edge case that's popping up in spark-shell but not Zeppelin. Let's tackle them one by one.

1. Why the Master URL Error Keeps Happening

Your core issue boils down to two common pitfalls with Spark on EMR:

Invalid spark-submit Command Combinations

Some of the commands you tried are mutually exclusive and will never work:

  • --master local[*] + --deploy-mode cluster: Local master runs the driver on your local machine, while cluster mode runs the driver on a YARN node. These two settings conflict completely.
  • --master yarn-client: In Spark 2.x, yarn-client is deprecated—use --master yarn --deploy-mode client instead (though cluster mode is recommended for EMR production jobs).

Dependency Conflict from Fat JARs

You're using a jar-with-dependencies.jar, which packages all your dependencies (including Spark itself) into the Jar. EMR clusters already have Spark 2.2.0 installed, so your packaged Spark libraries clash with the cluster's native ones. This overrides the cluster's spark-defaults.conf settings (like the yarn master URL), leading to the "master URL not set" error.

2. Step-by-Step Fixes

Fix 1: Use the Correct spark-submit Command

Stick to valid combinations for EMR Spark on YARN. For production jobs, use cluster mode:

spark-submit --master yarn --deploy-mode cluster --class financialLineItem.FinancialLineItem s3://trfsmallfffile/AJAR/SparkJob-0.1.jar

If you need to see logs locally during testing, use client mode:

spark-submit --master yarn --deploy-mode client --class financialLineItem.FinancialLineItem s3://trfsmallfffile/AJAR/SparkJob-0.1.jar

Fix 2: Package Your Jar Without Spark Dependencies

EMR provides all Spark core libraries, so you don't need to include them in your Jar. Here's how to adjust your build:

  • If using Maven: Mark Spark dependencies as <scope>provided</scope> so they're excluded from the fat Jar.
  • If using SBT: Add % Provided to Spark dependencies:
    libraryDependencies ++= Seq(
      "org.apache.spark" %% "spark-core" % "2.2.0" % Provided,
      "org.apache.spark" %% "spark-sql" % "2.2.0" % Provided
    )
    
  • In Eclipse: When creating your Jar, exclude all org.apache.spark packages from the build.

Fix 3: Refactor Code to Use SparkSession (Spark 2.x Best Practice)

Your code uses outdated SparkContext + SQLContext setup. Spark 2.x's SparkSession automatically handles configuration, including picking up the master URL from EMR's settings. Replace your initialization code with:

def main(args: Array[String]) {
  println("Entering Spark Mode ")
  val spark = SparkSession.builder()
    .appName("FinancialLineItem")
    .getOrCreate()
  println("SparkSession initialized")
  
  import spark.implicits._
  val sc = spark.sparkContext
  // Rest of your code (replace sqlContext with spark)
}

This eliminates manual SparkConf setup that might override cluster configurations.

3. Fixing the "empty collection" Error in spark-shell

The java.lang.UnsupportedOperationException: empty collection error happens because your code calls first() on an empty RDD (when no header line is found). Zeppelin works because your test data has the header, but your cluster input might not, or the path is incorrect.

Add safety checks to avoid this:

// For MAIN path
val headerRDD = rdd.filter(_.contains("LineItem.organizationId")).map(line => line.split("\\|\\^\\|"))
if (headerRDD.isEmpty()) {
  throw new IllegalArgumentException("No header line found in MAIN path files!")
}
val header = headerRDD.first()

// Repeat the same check for INCR path's header1
val header1RDD = rdd1.filter(_.contains("LineItem.organizationId")).map(line => line.split("\\|\\^\\|"))
if (header1RDD.isEmpty()) {
  throw new IllegalArgumentException("No header line found in INCR path files!")
}
val header1 = header1RDD.first()

Also, double-check that your S3 paths (s3://path/FinancialLineItem/MAIN and s3://path/FinancialLineItem/INCR) point to the correct files with the expected header line.

Final Notes

  • The EMR sample job works because it uses the cluster's native Spark libraries and follows valid submit commands.
  • Always test your code with the same Spark version as EMR (2.2.0 in your case) to avoid version mismatches.

内容的提问来源于stack exchange,提问作者Sudarshan kumar

火山引擎 最新活动