CONCEPT Cited by 1 source
Checkpoint Intermediate DataFrame for Debugging¶
Definition¶
Checkpointing intermediate DataFrames is the technique of materialising a distributed Spark DataFrame to durable storage (typically S3 / HDFS / scratch paths) so it can be inspected interactively after the job runs, substituting for breakpoint-based debugging that isn't practical on distributed + lazy-evaluation engines.
Distinguished from Spark's own .checkpoint() method,
which writes to HDFS/S3 as a lineage-truncation mechanism for
long query plans. This concept focuses on the debugging
workflow that checkpointing to a scratch path enables, not
on the lineage-truncation effect.
Why Spark makes debugging hard¶
Two fundamental properties of Spark's execution model make traditional breakpoint debugging impractical:
- Distributed execution — DataFrames live across many executor JVMs. You can't set a breakpoint that pauses all executors coherently, and attaching a debugger to one executor shows you 1/N of the data.
- Lazy evaluation — a chain of
select/filter/joinoperations produces a logical plan, not execution. Computation only happens when an action (.collect(),.count(),.write()) forces it. Stepping through the code steps through plan construction, not data processing.
The combination: when something goes wrong, you have a DataFrame that hasn't computed yet, living across N executors, with millions of rows you can't inspect.
The checkpoint-to-scratch workflow¶
Yelp's spark-etl package builds
checkpointing into the framework:
spark-submit \
/path/to/spark_etl_runner.py \
--team-name my_team \
--notify-email my_email@example.com \
--feature-config /path/to/feature_config.yaml \
--publish-path s3a://my-bucket/publish/ \
--scratch-path s3a://my-bucket/scratch/ \
--start-date 2024-02-29 \
--end-date 2024-02-29 \
--checkpoint feature1, feature2, feature3
The --checkpoint flag names features whose output should be
materialised to the scratch path. When the job runs:
- Each named feature's output DataFrame is written to
<scratch-path>/<feature-name>/<date>/. - The job continues using the materialised path rather than re-computing from source.
- After the job, engineers open JupyterHub notebooks and read the Parquet at the scratch path for interactive inspection.
Yelp's verbatim framing: "Checkpointing intermediate data frames to a scratch path would be a convenient way to inspect data for debugging and resuming pipeline faster by specifying computational expensive features' paths."
Benefits beyond debugging¶
- Pipeline resume — on a retry, expensive upstream features can read from checkpointed scratch rather than re-computing from source. Faster iteration when debugging downstream.
- Shareability — checkpointed Parquet is team-readable. One engineer finds a suspicious row; others can reproduce the investigation in their own notebooks by reading the same scratch path.
- Post-incident forensics — after a production incident, scratch-path data is available for root-cause analysis even if the pipeline state is gone.
The scratch-path discipline¶
- Separate from publish — scratch is ephemeral; published outputs are final. Keep them at different S3 prefixes so a lifecycle policy can aggressively expire scratch without risking production data.
- Per-date, per-feature partitioning — scratch paths need a date suffix so multiple job runs don't collide and engineers can target a specific run for investigation.
- Lifecycle policies — scratch fills fast. A 7-day or 30- day expiration policy keeps cost contained.
Comparison to alternatives¶
| Approach | Pros | Cons |
|---|---|---|
| Breakpoint debugger | Familiar, fine-grained control | Impractical on distributed + lazy Spark |
df.show(n) |
Fast for small N | Limited to the driver's truncated view; doesn't help for deeper investigation |
df.collect() to driver |
Full data in memory | OOM risk at scale; still ephemeral |
| Checkpoint to scratch + Jupyter | Full data, durable, shareable | Scratch storage cost; requires framework support |
The Yelp-canonical pattern scales to DataFrames with millions of rows — the scratch Parquet is read lazily by the notebook, so only the rows the engineer actually queries are pulled to the driver.
Seen in¶
- sources/2025-02-19-yelp-revenue-automation-series-building-revenue-data-pipeline
— canonical wiki instance. CLI flag
--checkpointonspark_etl_runner.py; scratch path convention; JupyterHub as downstream inspection surface.
Related¶
- systems/apache-spark — the engine whose distributed + lazy model motivates the technique
- systems/yelp-spark-etl — canonical framework implementation
- systems/jupyterhub — the downstream debugging surface
- systems/aws-s3 — typical scratch substrate
- concepts/spark-etl-feature-dag — the broader feature-DAG abstraction that makes per-feature checkpointing natural
- companies/yelp