CONCEPT Cited by 1 source
Status-based LLM pipeline checkpointing¶
Definition¶
Status-based LLM pipeline checkpointing is the discipline of tracking each record's processing state in a status column (or equivalent state field) inside the pipeline's tabular state substrate, so that on failure or restart the pipeline resumes at the per-record granularity — re-running only records whose state indicates they have not been fully processed, and never re-paying the LLM cost of records that already have results.
Canonicalised verbatim in the VF Match FDR pipeline:
"Status-based checkpointing: Every record tracks its processing state, enabling pipelines to resume from any point without reprocessing rows with expensive LLM calls."
Why this matters specifically for LLM pipelines¶
Conventional batch-job checkpointing (Spark RDD checkpoints, Airflow task retries) operates at task granularity — when a task fails, the entire task re-runs. For an LLM pipeline at 25M+ records, re-running an entire task means re-paying every record's LLM call even though most records succeeded.
The dollar cost of LLM invocation is what makes per-record state-tracking load-bearing:
- A 10M-page extraction pass at $0.01/page is $100,000 of OpenAI / Bedrock / Vertex spend.
- If the pipeline fails after 9.5M pages and re-runs from scratch on the entire 10M, the second run costs another $100,000.
- With status-based checkpointing, the second run pays only the remaining 500k pages → $5,000 — a 20× cost reduction on the recovery scenario.
Per-record state-tracking is therefore economic resumability, not just operational resumability.
Implementation shape¶
The state column is typically a small enum on the per-record fact table (or fact-equivalent table in a star schema):
record_id | step1_status | step2_status | step3_status | ...
-----------+--------------+--------------+--------------+----
url_001 | DONE | DONE | DONE | ...
url_002 | DONE | DONE | RUNNING | ...
url_003 | DONE | DONE | FAILED | ...
url_004 | DONE | NOT_STARTED | NOT_STARTED | ...
Re-runs select rows by state predicate:
RUNNING is included on the assumption that any in-flight work
from a crashed worker is lost; if the worker actually completed
the LLM call but failed before committing the result, retrying
is at-least-once. Idempotency at the LLM-extraction layer
(stable schema-constrained output) makes this safe.
Composes with¶
- concepts/multi-step-llm-extraction. Each step gets its own status column; a record can be DONE on step 1 but NOT_STARTED on step 3, allowing partial advancement of the population.
- concepts/star-schema. The fact table is the natural home of the status columns; dimension tables hold reference data that doesn't change per re-run.
- Idempotency. Status-based checkpointing is the per-record-granularity instance of idempotency: once a row is DONE, re-runs do not re-process it.
- systems/databricks-ai-functions. SQL-native LLM
invocation makes status-based checkpointing trivial to
express:
SELECT ai_query(...) WHERE step1_status = 'NOT_STARTED'.
Failure modes¶
- Status not committed atomically with output. If the LLM
output is written but the status update fails (or vice-versa),
state and result diverge. Mitigation: write status + output
in the same transaction (a single
MERGEorINSERT). - State enum sprawl. Each new step adds new state values; without discipline the state model becomes one-off-code hell. Mitigation: keep state values to a small, principled set (NOT_STARTED / RUNNING / DONE / FAILED) and reuse across steps.
- Stuck
RUNNINGrecords. A worker crash leaves rows inRUNNINGstate forever; the next run treats them as re-runnable but a long-running fleet of crashes can leave state inconsistent. Mitigation: include arunning_started_atcolumn and consider rows inRUNNINGfor too long as re-runnable. - No backward-recovery semantics. If step 3 emits bad output for some records, status-based checkpointing lets you re-run step 3 on those records — but only if you have a separate signal flagging them as bad. The status column doesn't track output quality, only step completion.
Seen in¶
- sources/2026-05-20-databricks-virtue-foundation-medical-volunteers-72-countries — canonical wiki source. VF Match FDR's per-record state tracking enables resume on a 25M+-page extraction without re-paying LLM cost. "Every record tracks its processing state, enabling pipelines to resume from any point without reprocessing rows with expensive LLM calls."
Related¶
- concepts/multi-step-llm-extraction — the LLM-pipeline decomposition this checkpointing serves.
- concepts/star-schema — the table substrate the status columns live on.
- concepts/idempotent-operations — the broader principle.
- concepts/idempotent-job-design — the job-altitude sibling pattern.
- patterns/multi-step-llm-extraction-pipeline — the named pattern this checkpointing is a sub-property of.