You need to enable JavaScript to run this app.
最新活动
大模型
产品
解决方案
定价
生态与合作
支持与服务
开发者
了解我们

Airflow中backfill与catchup的区别及backfill的具体用法咨询

Hey there! Since you already have a handle on catchup in Airflow, let's dive deep into backfill — what it is, how to use it, and how it differs from catchup with real-world examples that should make it click.

Airflow: Backfill vs Catchup

First, a quick recap of catchup to set the stage (since you know this, just a quick refresh):
When you set catchup=True on a DAG (this is the default), Airflow will automatically run every past scheduled instance of the DAG that wasn't executed when you enable the DAG. If catchup=False, it skips all past runs and only executes the latest scheduled instance plus future ones.

What Exactly is Backfill?

Backfill is a manual, on-demand tool to run specific historical DAG runs for a time range you define. Unlike catchup (which is automatic when you turn on a DAG), backfill gives you granular control over exactly which dates, tasks, or even failed runs you want to process. It's your go-to when:

  • You fix a bug in your DAG and need to reprocess past data
  • You add a new task to an existing DAG and want to run it for historical dates
  • You initially disabled catchup but later realize you need to run some past instances
  • You need to rerun only failed tasks from a specific time period

How to Use Backfill (Concrete CLI Examples)

Backfill is primarily run via the Airflow CLI — here are common scenarios with commands you can copy-paste:

Example 1: Basic Backfill for a Date Range

Say your DAG is named ecommerce_sales_pipeline, and you need to run all scheduled instances between March 1, 2024 and March 10, 2024. Run this:

airflow dags backfill -s 2024-03-01 -e 2024-03-10 ecommerce_sales_pipeline

This will trigger all missing DAG runs in that range, following your DAG's schedule interval (e.g., daily runs if your DAG is scheduled @daily).

Example 2: Rerun Only Failed DAG Runs

If you only want to reprocess the runs that failed in a time range (instead of all missing runs), add the --rerun-failed-tasks flag:

airflow dags backfill -s 2024-03-01 -e 2024-03-10 --rerun-failed-tasks ecommerce_sales_pipeline

This skips successful past runs and only retries the ones that failed.

Example 3: Backfill a Single Specific Date

If you need to run the DAG for just one specific date (e.g., March 5, 2024), set the start and end date to the same value:

airflow dags backfill -s 2024-03-05 -e 2024-03-05 ecommerce_sales_pipeline

Example 4: Backfill Only Specific Tasks

Suppose you added a new task compute_customer_ltv to your DAG and need to run it for all of March 2024, without rerunning the entire DAG. Use the -t flag to specify the task ID:

airflow dags backfill -s 2024-03-01 -e 2024-03-31 -t compute_customer_ltv ecommerce_sales_pipeline

If you don't want to wait for upstream tasks to complete (e.g., you know the upstream data is already there), add --ignore-dependencies to run the task standalone:

airflow dags backfill -s 2024-03-01 -e 2024-03-31 -t compute_customer_ltv --ignore-dependencies ecommerce_sales_pipeline

Key Differences at a Glance

To wrap up, here's a quick comparison of backfill vs catchup:

  • Trigger: Catchup is automatic when enabling a DAG (if catchup=True); Backfill is manual via CLI/API.
  • Control: Catchup runs all missing past instances; Backfill lets you pick exact date ranges, specific tasks, and whether to rerun only failures.
  • Use Case: Catchup is for initial DAG setup to catch up on all missed runs; Backfill is for targeted reprocessing or running specific historical ranges later.

内容的提问来源于stack exchange,提问作者Saurabh Sharma

火山引擎 最新活动