Checkpointing is a technique that saves the current state of a running process at specific points. If something fails, the process can resume from the last checkpoint instead of starting over. For businesses, this means long-running jobs that fail at hour 3 can pick up at hour 3, not hour 0. Without checkpointing, any failure means losing all progress.
Your 3-hour data migration process fails at hour 2:47.
You have no choice but to start from scratch. Another 3 hours.
The same row that caused the failure? It fails again at 2:47.
Long-running processes without save points are time bombs waiting to waste your hours.
ORCHESTRATION LAYER - Makes long-running processes recoverable instead of fragile.
Layer 4: Orchestration & Control / Category 4.1: Process Control
Save your place so you can pick up where you left off
Checkpointing saves the current state of a running process at specific points. If something fails, the process can resume from the last checkpoint instead of starting over. A 3-hour job that fails at 2:47 picks up at 2:47, not 0:00.
The mechanism is straightforward: before processing each batch or completing each step, the system writes its current position and any accumulated results to persistent storage. On restart, it reads that state and continues forward.
Checkpointing converts "all-or-nothing" operations into resumable work. The longer a process runs, the more value checkpointing provides. Without it, failure at 99% means losing 99% of the work.
Checkpointing solves a universal problem: how do you protect hours of work from being lost to a single failure? The same pattern appears anywhere long-running operations need resilience.
Save state at regular intervals. Record what has been completed. On failure, read the saved state. Resume from where you stopped, not from the beginning.
Start the migration, watch it fail at record 14, then click Resume. Toggle checkpointing OFF first to see what happens without save points.
Three approaches to saving and resuming work
Track where you are in a list
Record the ID or offset of the last successfully processed item. On resume, query for items after that position. Simple and effective for ordered datasets.
Capture everything needed to continue
Serialize the entire working state: processed items, accumulated results, configuration, counters. On resume, deserialize and continue exactly where you left off.
Mark items as done
Maintain a set of completed item IDs. On each iteration, check if already done and skip. Idempotent by design. Works even if items are processed out of order.
"Migrating 50,000 customer records to a new CRM"
The migration job processes thousands of records over several hours. At record 35,000, the destination API goes down briefly. Checkpointing saves the last successful position, so when the job restarts, it picks up at 35,001 instead of starting over from 1.
Hover over any component to see what it does and why it's neededTap any component to see what it does and why it's needed
Animated lines show direct connections · Hover for detailsTap for details · Click to learn more
You update the database, then save the checkpoint. The checkpoint write fails. On restart, the system thinks it needs to redo the update. Now you have duplicate records or double-counted transactions.
Instead: Use two-phase checkpointing: mark as "in progress" before the action, mark as "complete" after. On restart, handle in-progress items specially.
You checkpoint once per hour to minimize overhead. The process fails at minute 59. You lose 59 minutes of work. The optimization cost more than it saved.
Instead: Checkpoint based on work done, not time elapsed. Every 100 items or every step, not every hour.
Your checkpoints are fast because they are in memory. The server restarts. All checkpoint data is gone. The process starts over from the beginning.
Instead: Checkpoints must survive restarts. Use persistent storage: database, file system, or distributed cache with persistence enabled.
You have long-running jobs but no recovery mechanism.
Add a simple position tracker: after each batch, write the last processed ID to a database table or file. On restart, read that ID and query for records after it.
You have some checkpointing but jobs still lose work on failure.
Audit your checkpoint timing: are you saving before or after the action? Move checkpoints to happen after successful completion, and add an "in progress" marker before starting each item.
Checkpointing works but you want better visibility and reliability.
Add checkpoint metadata: timestamp, items processed, error counts, estimated time remaining. Build a dashboard that shows active jobs and their checkpoint status in real time.
You have learned how to make long-running processes recoverable. The natural next step is understanding how to handle loops and iteration patterns that often use checkpointing.