EMR运行Spark-Scala Jar报错:需配置Master URL问题排查
Let's break down your problem step by step—you're hitting two key issues here: the master URL configuration error, and a hidden code edge case that's popping up in spark-shell but not Zeppelin. Let's tackle them one by one.
1. Why the Master URL Error Keeps Happening
Your core issue boils down to two common pitfalls with Spark on EMR:
Invalid spark-submit Command Combinations
Some of the commands you tried are mutually exclusive and will never work:
--master local[*]+--deploy-mode cluster: Local master runs the driver on your local machine, while cluster mode runs the driver on a YARN node. These two settings conflict completely.--master yarn-client: In Spark 2.x,yarn-clientis deprecated—use--master yarn --deploy-mode clientinstead (though cluster mode is recommended for EMR production jobs).
Dependency Conflict from Fat JARs
You're using a jar-with-dependencies.jar, which packages all your dependencies (including Spark itself) into the Jar. EMR clusters already have Spark 2.2.0 installed, so your packaged Spark libraries clash with the cluster's native ones. This overrides the cluster's spark-defaults.conf settings (like the yarn master URL), leading to the "master URL not set" error.
2. Step-by-Step Fixes
Fix 1: Use the Correct spark-submit Command
Stick to valid combinations for EMR Spark on YARN. For production jobs, use cluster mode:
spark-submit --master yarn --deploy-mode cluster --class financialLineItem.FinancialLineItem s3://trfsmallfffile/AJAR/SparkJob-0.1.jar
If you need to see logs locally during testing, use client mode:
spark-submit --master yarn --deploy-mode client --class financialLineItem.FinancialLineItem s3://trfsmallfffile/AJAR/SparkJob-0.1.jar
Fix 2: Package Your Jar Without Spark Dependencies
EMR provides all Spark core libraries, so you don't need to include them in your Jar. Here's how to adjust your build:
- If using Maven: Mark Spark dependencies as
<scope>provided</scope>so they're excluded from the fat Jar. - If using SBT: Add
% Providedto Spark dependencies:libraryDependencies ++= Seq( "org.apache.spark" %% "spark-core" % "2.2.0" % Provided, "org.apache.spark" %% "spark-sql" % "2.2.0" % Provided ) - In Eclipse: When creating your Jar, exclude all
org.apache.sparkpackages from the build.
Fix 3: Refactor Code to Use SparkSession (Spark 2.x Best Practice)
Your code uses outdated SparkContext + SQLContext setup. Spark 2.x's SparkSession automatically handles configuration, including picking up the master URL from EMR's settings. Replace your initialization code with:
def main(args: Array[String]) { println("Entering Spark Mode ") val spark = SparkSession.builder() .appName("FinancialLineItem") .getOrCreate() println("SparkSession initialized") import spark.implicits._ val sc = spark.sparkContext // Rest of your code (replace sqlContext with spark) }
This eliminates manual SparkConf setup that might override cluster configurations.
3. Fixing the "empty collection" Error in spark-shell
The java.lang.UnsupportedOperationException: empty collection error happens because your code calls first() on an empty RDD (when no header line is found). Zeppelin works because your test data has the header, but your cluster input might not, or the path is incorrect.
Add safety checks to avoid this:
// For MAIN path val headerRDD = rdd.filter(_.contains("LineItem.organizationId")).map(line => line.split("\\|\\^\\|")) if (headerRDD.isEmpty()) { throw new IllegalArgumentException("No header line found in MAIN path files!") } val header = headerRDD.first() // Repeat the same check for INCR path's header1 val header1RDD = rdd1.filter(_.contains("LineItem.organizationId")).map(line => line.split("\\|\\^\\|")) if (header1RDD.isEmpty()) { throw new IllegalArgumentException("No header line found in INCR path files!") } val header1 = header1RDD.first()
Also, double-check that your S3 paths (s3://path/FinancialLineItem/MAIN and s3://path/FinancialLineItem/INCR) point to the correct files with the expected header line.
Final Notes
- The EMR sample job works because it uses the cluster's native Spark libraries and follows valid submit commands.
- Always test your code with the same Spark version as EMR (2.2.0 in your case) to avoid version mismatches.
内容的提问来源于stack exchange,提问作者Sudarshan kumar




