Skip to content

CONCEPT Cited by 1 source

Status-based LLM pipeline checkpointing

Definition

Status-based LLM pipeline checkpointing is the discipline of tracking each record's processing state in a status column (or equivalent state field) inside the pipeline's tabular state substrate, so that on failure or restart the pipeline resumes at the per-record granularity — re-running only records whose state indicates they have not been fully processed, and never re-paying the LLM cost of records that already have results.

Canonicalised verbatim in the VF Match FDR pipeline:

"Status-based checkpointing: Every record tracks its processing state, enabling pipelines to resume from any point without reprocessing rows with expensive LLM calls."

Why this matters specifically for LLM pipelines

Conventional batch-job checkpointing (Spark RDD checkpoints, Airflow task retries) operates at task granularity — when a task fails, the entire task re-runs. For an LLM pipeline at 25M+ records, re-running an entire task means re-paying every record's LLM call even though most records succeeded.

The dollar cost of LLM invocation is what makes per-record state-tracking load-bearing:

  • A 10M-page extraction pass at $0.01/page is $100,000 of OpenAI / Bedrock / Vertex spend.
  • If the pipeline fails after 9.5M pages and re-runs from scratch on the entire 10M, the second run costs another $100,000.
  • With status-based checkpointing, the second run pays only the remaining 500k pages → $5,000 — a 20× cost reduction on the recovery scenario.

Per-record state-tracking is therefore economic resumability, not just operational resumability.

Implementation shape

The state column is typically a small enum on the per-record fact table (or fact-equivalent table in a star schema):

record_id  | step1_status | step2_status | step3_status | ...
-----------+--------------+--------------+--------------+----
url_001    | DONE         | DONE         | DONE         | ...
url_002    | DONE         | DONE         | RUNNING      | ...
url_003    | DONE         | DONE         | FAILED       | ...
url_004    | DONE         | NOT_STARTED  | NOT_STARTED  | ...

Re-runs select rows by state predicate:

WHERE step3_status IN ('NOT_STARTED', 'FAILED', 'RUNNING')

RUNNING is included on the assumption that any in-flight work from a crashed worker is lost; if the worker actually completed the LLM call but failed before committing the result, retrying is at-least-once. Idempotency at the LLM-extraction layer (stable schema-constrained output) makes this safe.

Composes with

  • concepts/multi-step-llm-extraction. Each step gets its own status column; a record can be DONE on step 1 but NOT_STARTED on step 3, allowing partial advancement of the population.
  • concepts/star-schema. The fact table is the natural home of the status columns; dimension tables hold reference data that doesn't change per re-run.
  • Idempotency. Status-based checkpointing is the per-record-granularity instance of idempotency: once a row is DONE, re-runs do not re-process it.
  • systems/databricks-ai-functions. SQL-native LLM invocation makes status-based checkpointing trivial to express: SELECT ai_query(...) WHERE step1_status = 'NOT_STARTED'.

Failure modes

  • Status not committed atomically with output. If the LLM output is written but the status update fails (or vice-versa), state and result diverge. Mitigation: write status + output in the same transaction (a single MERGE or INSERT).
  • State enum sprawl. Each new step adds new state values; without discipline the state model becomes one-off-code hell. Mitigation: keep state values to a small, principled set (NOT_STARTED / RUNNING / DONE / FAILED) and reuse across steps.
  • Stuck RUNNING records. A worker crash leaves rows in RUNNING state forever; the next run treats them as re-runnable but a long-running fleet of crashes can leave state inconsistent. Mitigation: include a running_started_at column and consider rows in RUNNING for too long as re-runnable.
  • No backward-recovery semantics. If step 3 emits bad output for some records, status-based checkpointing lets you re-run step 3 on those records — but only if you have a separate signal flagging them as bad. The status column doesn't track output quality, only step completion.

Seen in

Last updated · 542 distilled / 1,571 read