Skip to content

CONCEPT Cited by 1 source

Checkpoint Intermediate DataFrame for Debugging

Definition

Checkpointing intermediate DataFrames is the technique of materialising a distributed Spark DataFrame to durable storage (typically S3 / HDFS / scratch paths) so it can be inspected interactively after the job runs, substituting for breakpoint-based debugging that isn't practical on distributed + lazy-evaluation engines.

Distinguished from Spark's own .checkpoint() method, which writes to HDFS/S3 as a lineage-truncation mechanism for long query plans. This concept focuses on the debugging workflow that checkpointing to a scratch path enables, not on the lineage-truncation effect.

Why Spark makes debugging hard

Two fundamental properties of Spark's execution model make traditional breakpoint debugging impractical:

  1. Distributed execution — DataFrames live across many executor JVMs. You can't set a breakpoint that pauses all executors coherently, and attaching a debugger to one executor shows you 1/N of the data.
  2. Lazy evaluation — a chain of select / filter / join operations produces a logical plan, not execution. Computation only happens when an action (.collect(), .count(), .write()) forces it. Stepping through the code steps through plan construction, not data processing.

The combination: when something goes wrong, you have a DataFrame that hasn't computed yet, living across N executors, with millions of rows you can't inspect.

The checkpoint-to-scratch workflow

Yelp's spark-etl package builds checkpointing into the framework:

spark-submit \
    /path/to/spark_etl_runner.py \
    --team-name my_team \
    --notify-email my_email@example.com \
    --feature-config /path/to/feature_config.yaml \
    --publish-path s3a://my-bucket/publish/ \
    --scratch-path s3a://my-bucket/scratch/ \
    --start-date 2024-02-29 \
    --end-date 2024-02-29 \
    --checkpoint feature1, feature2, feature3

The --checkpoint flag names features whose output should be materialised to the scratch path. When the job runs:

  1. Each named feature's output DataFrame is written to <scratch-path>/<feature-name>/<date>/.
  2. The job continues using the materialised path rather than re-computing from source.
  3. After the job, engineers open JupyterHub notebooks and read the Parquet at the scratch path for interactive inspection.

Yelp's verbatim framing: "Checkpointing intermediate data frames to a scratch path would be a convenient way to inspect data for debugging and resuming pipeline faster by specifying computational expensive features' paths."

Benefits beyond debugging

  • Pipeline resume — on a retry, expensive upstream features can read from checkpointed scratch rather than re-computing from source. Faster iteration when debugging downstream.
  • Shareability — checkpointed Parquet is team-readable. One engineer finds a suspicious row; others can reproduce the investigation in their own notebooks by reading the same scratch path.
  • Post-incident forensics — after a production incident, scratch-path data is available for root-cause analysis even if the pipeline state is gone.

The scratch-path discipline

  • Separate from publish — scratch is ephemeral; published outputs are final. Keep them at different S3 prefixes so a lifecycle policy can aggressively expire scratch without risking production data.
  • Per-date, per-feature partitioning — scratch paths need a date suffix so multiple job runs don't collide and engineers can target a specific run for investigation.
  • Lifecycle policies — scratch fills fast. A 7-day or 30- day expiration policy keeps cost contained.

Comparison to alternatives

Approach Pros Cons
Breakpoint debugger Familiar, fine-grained control Impractical on distributed + lazy Spark
df.show(n) Fast for small N Limited to the driver's truncated view; doesn't help for deeper investigation
df.collect() to driver Full data in memory OOM risk at scale; still ephemeral
Checkpoint to scratch + Jupyter Full data, durable, shareable Scratch storage cost; requires framework support

The Yelp-canonical pattern scales to DataFrames with millions of rows — the scratch Parquet is read lazily by the notebook, so only the rows the engineer actually queries are pulled to the driver.

Seen in

Last updated · 476 distilled / 1,218 read