Airflow中backfill与catchup的区别及backfill的具体用法咨询
Hey there! Since you already have a handle on catchup in Airflow, let's dive deep into backfill — what it is, how to use it, and how it differs from catchup with real-world examples that should make it click.
First, a quick recap of catchup to set the stage (since you know this, just a quick refresh):
When you set catchup=True on a DAG (this is the default), Airflow will automatically run every past scheduled instance of the DAG that wasn't executed when you enable the DAG. If catchup=False, it skips all past runs and only executes the latest scheduled instance plus future ones.
What Exactly is Backfill?
Backfill is a manual, on-demand tool to run specific historical DAG runs for a time range you define. Unlike catchup (which is automatic when you turn on a DAG), backfill gives you granular control over exactly which dates, tasks, or even failed runs you want to process. It's your go-to when:
- You fix a bug in your DAG and need to reprocess past data
- You add a new task to an existing DAG and want to run it for historical dates
- You initially disabled catchup but later realize you need to run some past instances
- You need to rerun only failed tasks from a specific time period
How to Use Backfill (Concrete CLI Examples)
Backfill is primarily run via the Airflow CLI — here are common scenarios with commands you can copy-paste:
Example 1: Basic Backfill for a Date Range
Say your DAG is named ecommerce_sales_pipeline, and you need to run all scheduled instances between March 1, 2024 and March 10, 2024. Run this:
airflow dags backfill -s 2024-03-01 -e 2024-03-10 ecommerce_sales_pipeline
This will trigger all missing DAG runs in that range, following your DAG's schedule interval (e.g., daily runs if your DAG is scheduled @daily).
Example 2: Rerun Only Failed DAG Runs
If you only want to reprocess the runs that failed in a time range (instead of all missing runs), add the --rerun-failed-tasks flag:
airflow dags backfill -s 2024-03-01 -e 2024-03-10 --rerun-failed-tasks ecommerce_sales_pipeline
This skips successful past runs and only retries the ones that failed.
Example 3: Backfill a Single Specific Date
If you need to run the DAG for just one specific date (e.g., March 5, 2024), set the start and end date to the same value:
airflow dags backfill -s 2024-03-05 -e 2024-03-05 ecommerce_sales_pipeline
Example 4: Backfill Only Specific Tasks
Suppose you added a new task compute_customer_ltv to your DAG and need to run it for all of March 2024, without rerunning the entire DAG. Use the -t flag to specify the task ID:
airflow dags backfill -s 2024-03-01 -e 2024-03-31 -t compute_customer_ltv ecommerce_sales_pipeline
If you don't want to wait for upstream tasks to complete (e.g., you know the upstream data is already there), add --ignore-dependencies to run the task standalone:
airflow dags backfill -s 2024-03-01 -e 2024-03-31 -t compute_customer_ltv --ignore-dependencies ecommerce_sales_pipeline
Key Differences at a Glance
To wrap up, here's a quick comparison of backfill vs catchup:
- Trigger: Catchup is automatic when enabling a DAG (if
catchup=True); Backfill is manual via CLI/API. - Control: Catchup runs all missing past instances; Backfill lets you pick exact date ranges, specific tasks, and whether to rerun only failures.
- Use Case: Catchup is for initial DAG setup to catch up on all missed runs; Backfill is for targeted reprocessing or running specific historical ranges later.
内容的提问来源于stack exchange,提问作者Saurabh Sharma




