Celery+Redis架构下Acks_late机制及节点故障后任务重入队问题

阿华AIGC实验室

2026-5-15

How Tasks Get Re-Queued When a Celery Worker Crashes (Redis Broker/Backend)

Great question! This ties into some key details of how Celery and Redis handle task lifecycle, especially when acks_late is enabled. Let's break it down clearly:

First, a Quick Recap of Task Retrieval with Redis

When a Celery worker uses Redis as the broker, it doesn’t just "pop" a task from the main queue (like celery) and call it a day. Instead, it relies on Redis’s RPOPLPUSH command to:

Remove the task from the main queue
Immediately push it to a special "unacked" queue (typically named celery.unacked)

This is a critical safety step—it ensures the task isn’t lost if the worker crashes mid-execution. The RedisWorkerController you mentioned creates the task request right after this transfer is completed.

The Role of `acks_late` in This Flow

As you observed, acks_late is handled during task_trace execution. Here’s how it alters the acknowledgment behavior:

Default behavior (acks_late=False): The worker sends an acknowledgment to the broker the moment it retrieves the task (right after the RPOPLPUSH step). At this point, the task is removed from the celery.unacked queue. If the worker crashes now, the task is already marked as acknowledged and won’t be re-queued.
With acks_late=True: The acknowledgment is delayed until after the task finishes executing (whether it succeeds or fails). Until that ack is sent, the task stays in the celery.unacked queue.

What Happens When the Worker Crashes?

Redis broker has a built-in "reaper" mechanism that monitors the celery.unacked queue to recover stuck tasks. Here’s the step-by-step process:

Every healthy worker sends regular heartbeats to the broker to signal it’s alive and processing tasks.
If a worker crashes, it stops sending these heartbeats.
The broker checks tasks in celery.unacked against the visibility_timeout setting (default is 3600 seconds, configurable via broker_transport_options).
Any task in celery.unacked that hasn’t been acknowledged and whose associated worker is unresponsive (or has exceeded the timeout) is moved back to the main queue (like celery).
The task is now available for any healthy worker to pick up and execute.

Key Notes

The visibility_timeout acts as a safety net: even if a worker is slow but not crashed, tasks will be re-queued after this period to prevent them from being stuck indefinitely.
Using acks_late=True gives you better fault tolerance for long-running tasks, since a crash mid-execution won’t result in lost work.

内容的提问来源于stack exchange，提问作者vin