Spark on YARN应用以exitCode: -104退出的场景及作业重启问题咨询

阿华AIGC实验室

2026-5-21

Hey there, let's tackle your two Spark on YARN questions step by step— I've run into similar headaches with CDH distributions before, so here's what I’ve learned:

1. When does a Spark on YARN application exit with exitCode: -104?

Exit code -104 in Spark on YARN almost always signals that a NodeManager killed your Executor container due to resource-related issues. The most common triggers are:

Memory limit violations: YARN enables checks for physical (yarn.nodemanager.pmem-check-enabled) and virtual (yarn.nodemanager.vmem-check-enabled) memory by default. If your Executor uses more memory than the allocated container size (including Spark's off-heap memory overhead), the NodeManager sends a SIGKILL to the container, resulting in this exit code.
Node resource exhaustion: If the host node runs out of disk space, or hits CPU/throttling limits, the NodeManager may terminate running containers to protect the node, which can also lead to exit code -104.
Corrupted container state: Rarely, if the container's working directory or metadata gets corrupted, the NodeManager will shut it down with this exit code.

2. Troubleshooting SIGTERM and job restarts in your Spark application

Looking at your log line RECEIVED SIGNAL 15: SIGTERM, this is a termination signal sent by YARN's ResourceManager (RM) or NodeManager to the ApplicationMaster (AM). The job restarting from scratch confirms the AM was terminated and restarted (controlled by YARN's yarn.resourcemanager.am.max-attempts configuration). Here's why this might be happening, plus fixes tailored to your setup:

Common Root Causes

AM heartbeat timeout: The AM failed to send heartbeats to the RM within the configured interval (check yarn.resourcemanager.am.expiry-interval-ms). This can happen if the AM is stuck processing heavy metadata or the cluster is under extreme load.
Resource preemption: If your YARN cluster uses the Capacity Scheduler with preemption enabled, higher-priority jobs in the same queue might steal resources, forcing the RM to terminate your AM to free up capacity.
Mismatched memory configurations: You set EXECUTOR_MEMORY=4G and EXECUTOR_CORES=6, but Spark requires additional overhead memory for off-heap usage (like JVM internals, native libraries). The default spark.yarn.executor.memoryOverhead is 10% of executor memory (400MB here), which is likely too low for your heavy transformation workload. If the container's total memory (executor + overhead) exceeds YARN's per-container limit, NodeManagers kill executors, and the AM may fail/restart if too many executors die.
Disk write failures: Your job writes to multiple directories—if any target directory has permission issues, or the node's disk is full, executors might crash, triggering AM instability and termination.

Recommended Fixes

Boost memory overhead: Add --conf spark.yarn.executor.memoryOverhead=1024 to your spark-submit command (setting it to 1GB). This gives critical headroom for off-heap memory usage during your transformations.
Extend AM timeout: Set --conf spark.yarn.am.waitTime=3600000 (1 hour) to give the AM more time to process work without being flagged as unresponsive.
Verify queue resource quotas: Ensure your YARN queue has enough resources to accommodate 15 executors (each needing 6 cores + 4GB heap + 1GB overhead = 90 cores, 75GB total memory). If the queue's quota is lower, the RM will struggle to allocate resources and restart the AM repeatedly.
Increase task parallelism: Your input files are small (7MB, 40MB, 100MB), so default partition counts are likely too low, leading to large, memory-heavy tasks. Set --conf spark.sql.shuffle.partitions=200 and --conf spark.default.parallelism=100 to split work into smaller, more manageable tasks.
Check executor logs: Don't just rely on AM logs—dig into individual executor logs (via the YARN UI) to find root causes like OutOfMemoryErrors or write failures.
Disable virtual memory checks (temporarily): If you're confident physical memory is configured correctly, set yarn.nodemanager.vmem-check-enabled=false in your YARN config to avoid false positives from virtual memory calculations.