Spark on YARN应用以exitCode: -104退出的场景及作业重启问题咨询
Hey there, let's tackle your two Spark on YARN questions step by step— I've run into similar headaches with CDH distributions before, so here's what I’ve learned:
Exit code -104 in Spark on YARN almost always signals that a NodeManager killed your Executor container due to resource-related issues. The most common triggers are:
- Memory limit violations: YARN enables checks for physical (
yarn.nodemanager.pmem-check-enabled) and virtual (yarn.nodemanager.vmem-check-enabled) memory by default. If your Executor uses more memory than the allocated container size (including Spark's off-heap memory overhead), the NodeManager sends a SIGKILL to the container, resulting in this exit code. - Node resource exhaustion: If the host node runs out of disk space, or hits CPU/throttling limits, the NodeManager may terminate running containers to protect the node, which can also lead to exit code
-104. - Corrupted container state: Rarely, if the container's working directory or metadata gets corrupted, the NodeManager will shut it down with this exit code.
Looking at your log line RECEIVED SIGNAL 15: SIGTERM, this is a termination signal sent by YARN's ResourceManager (RM) or NodeManager to the ApplicationMaster (AM). The job restarting from scratch confirms the AM was terminated and restarted (controlled by YARN's yarn.resourcemanager.am.max-attempts configuration). Here's why this might be happening, plus fixes tailored to your setup:
Common Root Causes
- AM heartbeat timeout: The AM failed to send heartbeats to the RM within the configured interval (check
yarn.resourcemanager.am.expiry-interval-ms). This can happen if the AM is stuck processing heavy metadata or the cluster is under extreme load. - Resource preemption: If your YARN cluster uses the Capacity Scheduler with preemption enabled, higher-priority jobs in the same queue might steal resources, forcing the RM to terminate your AM to free up capacity.
- Mismatched memory configurations: You set
EXECUTOR_MEMORY=4GandEXECUTOR_CORES=6, but Spark requires additional overhead memory for off-heap usage (like JVM internals, native libraries). The defaultspark.yarn.executor.memoryOverheadis 10% of executor memory (400MB here), which is likely too low for your heavy transformation workload. If the container's total memory (executor + overhead) exceeds YARN's per-container limit, NodeManagers kill executors, and the AM may fail/restart if too many executors die. - Disk write failures: Your job writes to multiple directories—if any target directory has permission issues, or the node's disk is full, executors might crash, triggering AM instability and termination.
Recommended Fixes
- Boost memory overhead: Add
--conf spark.yarn.executor.memoryOverhead=1024to your spark-submit command (setting it to 1GB). This gives critical headroom for off-heap memory usage during your transformations. - Extend AM timeout: Set
--conf spark.yarn.am.waitTime=3600000(1 hour) to give the AM more time to process work without being flagged as unresponsive. - Verify queue resource quotas: Ensure your YARN queue has enough resources to accommodate 15 executors (each needing 6 cores + 4GB heap + 1GB overhead = 90 cores, 75GB total memory). If the queue's quota is lower, the RM will struggle to allocate resources and restart the AM repeatedly.
- Increase task parallelism: Your input files are small (7MB, 40MB, 100MB), so default partition counts are likely too low, leading to large, memory-heavy tasks. Set
--conf spark.sql.shuffle.partitions=200and--conf spark.default.parallelism=100to split work into smaller, more manageable tasks. - Check executor logs: Don't just rely on AM logs—dig into individual executor logs (via the YARN UI) to find root causes like OutOfMemoryErrors or write failures.
- Disable virtual memory checks (temporarily): If you're confident physical memory is configured correctly, set
yarn.nodemanager.vmem-check-enabled=falsein your YARN config to avoid false positives from virtual memory calculations.
内容的提问来源于stack exchange,提问作者Surender Raja




