Apache Flink FsStateBackend:TaskManager故障时的状态恢复及配置疑问
Let's walk through each of your questions based on your setup (2 JobManagers with ZK HA, 3 TaskManagers, FsStateBackend using local file:/data storage):
1. How does job state recover if a TaskManager crashes and its local checkpoint data is inaccessible?
In this scenario, you won't be able to recover from that specific checkpoint. Flink requires full access to all checkpoint shards (each generated by a TaskManager) to restore the job's state. When a TaskManager crashes, its local checkpoint shard is lost, making the entire checkpoint incomplete.
Flink will automatically fall back to the most recent complete checkpoint (if one exists). If there are no valid complete checkpoints available, the job will have to restart from its initial state and reprocess all data.
2. Does the checkpoint process send state information to the JobManager?
No, it does not. With FsStateBackend, each TaskManager writes its state data directly to the configured storage path (your local file:/data directory) without sending the actual state data to the JobManager. The JobManager only acts as a coordinator: it triggers checkpoints, tracks which TaskManagers have finished their checkpoint tasks, and collects metadata about the checkpoints (more on that next).
3. What specific metadata do TaskManagers send to the JobManager during checkpointing?
The metadata is all the information the JobManager needs to locate and validate checkpoint shards later. It includes:
- The storage path of the checkpoint shard generated by the TaskManager (e.g.,
file:/data/tm-2/checkpoint-456/keyed-state) - The size of the shard (to help with resource planning during recovery)
- The subtask ID the shard corresponds to, so Flink knows which operator task should read it during restore
- A status marker indicating whether the checkpoint shard was successfully written
- Integrity checks (like hash values) to ensure the shard data hasn't been corrupted
4. Is a distributed file system (like NFS, S3) mandatory?
It's not technically mandatory, but it's strongly recommended for production environments. Local storage works fine for development or testing where you don't need robust fault tolerance. However, for production jobs that require high availability and the ability to recover from failures, distributed storage is essential.
Distributed storage ensures all TaskManagers can access all checkpoint shards, even if one TaskManager goes down. Without it, you risk losing checkpoint data and being unable to recover your job.
5. What are the impacts of using local storage for checkpoint storage?
Using local storage for checkpoints comes with several critical limitations:
- Poor fault tolerance: As you noted, if a TaskManager crashes, its checkpoint shard is lost, rendering the entire checkpoint useless. Your job will either have to fall back to an older checkpoint or restart from scratch.
- No support for job migration: You can't easily move your job to a different cluster or re-schedule TaskManagers to new nodes, since the new nodes won't have access to the old local checkpoint data.
- Only suitable for non-critical workloads: It's fine for testing or short-lived jobs where reprocessing data isn't a big issue, but it's not viable for production jobs that need to maintain state consistency and minimize downtime.
- Disparate data management: Checkpoint data is scattered across individual TaskManager disks, making it hard to monitor, back up, or manage centrally.
内容的提问来源于stack exchange,提问作者Raghavendar




