CONCEPT Cited by 1 source

Checkpoint Intermediate DataFrame for Debugging¶

Definition¶

Checkpointing intermediate DataFrames is the technique of materialising a distributed Spark DataFrame to durable storage (typically S3 / HDFS / scratch paths) so it can be inspected interactively after the job runs, substituting for breakpoint-based debugging that isn't practical on distributed + lazy-evaluation engines.

Distinguished from Spark's own .checkpoint() method, which writes to HDFS/S3 as a lineage-truncation mechanism for long query plans. This concept focuses on the debugging workflow that checkpointing to a scratch path enables, not on the lineage-truncation effect.

Why Spark makes debugging hard¶

Two fundamental properties of Spark's execution model make traditional breakpoint debugging impractical:

Distributed execution — DataFrames live across many executor JVMs. You can't set a breakpoint that pauses all executors coherently, and attaching a debugger to one executor shows you 1/N of the data.
Lazy evaluation — a chain of select / filter / join operations produces a logical plan, not execution. Computation only happens when an action (.collect(), .count(), .write()) forces it. Stepping through the code steps through plan construction, not data processing.

The combination: when something goes wrong, you have a DataFrame that hasn't computed yet, living across N executors, with millions of rows you can't inspect.

The checkpoint-to-scratch workflow¶

Yelp's spark-etl package builds checkpointing into the framework:

spark-submit \
    /path/to/spark_etl_runner.py \
    --team-name my_team \
    --notify-email my_email@example.com \
    --feature-config /path/to/feature_config.yaml \
    --publish-path s3a://my-bucket/publish/ \
    --scratch-path s3a://my-bucket/scratch/ \
    --start-date 2024-02-29 \
    --end-date 2024-02-29 \
    --checkpoint feature1, feature2, feature3

The --checkpoint flag names features whose output should be materialised to the scratch path. When the job runs:

Each named feature's output DataFrame is written to <scratch-path>/<feature-name>/<date>/.
The job continues using the materialised path rather than re-computing from source.
After the job, engineers open JupyterHub notebooks and read the Parquet at the scratch path for interactive inspection.

Yelp's verbatim framing: "Checkpointing intermediate data frames to a scratch path would be a convenient way to inspect data for debugging and resuming pipeline faster by specifying computational expensive features' paths."

Benefits beyond debugging¶

Pipeline resume — on a retry, expensive upstream features can read from checkpointed scratch rather than re-computing from source. Faster iteration when debugging downstream.
Shareability — checkpointed Parquet is team-readable. One engineer finds a suspicious row; others can reproduce the investigation in their own notebooks by reading the same scratch path.
Post-incident forensics — after a production incident, scratch-path data is available for root-cause analysis even if the pipeline state is gone.

The scratch-path discipline¶

Separate from publish — scratch is ephemeral; published outputs are final. Keep them at different S3 prefixes so a lifecycle policy can aggressively expire scratch without risking production data.
Per-date, per-feature partitioning — scratch paths need a date suffix so multiple job runs don't collide and engineers can target a specific run for investigation.
Lifecycle policies — scratch fills fast. A 7-day or 30- day expiration policy keeps cost contained.

Comparison to alternatives¶

Approach	Pros	Cons
Breakpoint debugger	Familiar, fine-grained control	Impractical on distributed + lazy Spark
`df.show(n)`	Fast for small N	Limited to the driver's truncated view; doesn't help for deeper investigation
`df.collect()` to driver	Full data in memory	OOM risk at scale; still ephemeral
Checkpoint to scratch + Jupyter	Full data, durable, shareable	Scratch storage cost; requires framework support

The Yelp-canonical pattern scales to DataFrames with millions of rows — the scratch Parquet is read lazily by the notebook, so only the rows the engineer actually queries are pulled to the driver.

Seen in¶

sources/2025-02-19-yelp-revenue-automation-series-building-revenue-data-pipeline — canonical wiki instance. CLI flag --checkpoint on spark_etl_runner.py; scratch path convention; JupyterHub as downstream inspection surface.

systems/apache-spark — the engine whose distributed + lazy model motivates the technique
systems/yelp-spark-etl — canonical framework implementation
systems/jupyterhub — the downstream debugging surface
systems/aws-s3 — typical scratch substrate
concepts/spark-etl-feature-dag — the broader feature-DAG abstraction that makes per-feature checkpointing natural
companies/yelp