Celery+Redis架构下Acks_late机制及节点故障后任务重入队问题
Great question! This ties into some key details of how Celery and Redis handle task lifecycle, especially when acks_late is enabled. Let's break it down clearly:
First, a Quick Recap of Task Retrieval with Redis
When a Celery worker uses Redis as the broker, it doesn’t just "pop" a task from the main queue (like celery) and call it a day. Instead, it relies on Redis’s RPOPLPUSH command to:
- Remove the task from the main queue
- Immediately push it to a special "unacked" queue (typically named
celery.unacked)
This is a critical safety step—it ensures the task isn’t lost if the worker crashes mid-execution. The RedisWorkerController you mentioned creates the task request right after this transfer is completed.
The Role of acks_late in This Flow
As you observed, acks_late is handled during task_trace execution. Here’s how it alters the acknowledgment behavior:
- Default behavior (
acks_late=False): The worker sends an acknowledgment to the broker the moment it retrieves the task (right after theRPOPLPUSHstep). At this point, the task is removed from thecelery.unackedqueue. If the worker crashes now, the task is already marked as acknowledged and won’t be re-queued. - With
acks_late=True: The acknowledgment is delayed until after the task finishes executing (whether it succeeds or fails). Until that ack is sent, the task stays in thecelery.unackedqueue.
What Happens When the Worker Crashes?
Redis broker has a built-in "reaper" mechanism that monitors the celery.unacked queue to recover stuck tasks. Here’s the step-by-step process:
- Every healthy worker sends regular heartbeats to the broker to signal it’s alive and processing tasks.
- If a worker crashes, it stops sending these heartbeats.
- The broker checks tasks in
celery.unackedagainst thevisibility_timeoutsetting (default is 3600 seconds, configurable viabroker_transport_options). - Any task in
celery.unackedthat hasn’t been acknowledged and whose associated worker is unresponsive (or has exceeded the timeout) is moved back to the main queue (likecelery). - The task is now available for any healthy worker to pick up and execute.
Key Notes
- The
visibility_timeoutacts as a safety net: even if a worker is slow but not crashed, tasks will be re-queued after this period to prevent them from being stuck indefinitely. - Using
acks_late=Truegives you better fault tolerance for long-running tasks, since a crash mid-execution won’t result in lost work.
内容的提问来源于stack exchange,提问作者vin




