Skip to content

PATTERN Cited by 1 source

WAL before lakehouse publish

Definition

WAL before lakehouse publish is the pattern of interposing a latency-optimized write-ahead log between the ingestion endpoint and the final lakehouse storage layer (e.g., Delta tables). The WAL provides the low-latency durability guarantee and client acknowledgement, while the downstream commit to the lakehouse (which may involve transaction coordination, file compaction, metadata updates) proceeds asynchronously.

Why it exists

Lakehouse writes (Delta, Iceberg) are not instantaneous — they involve:

  • Acquiring transaction locks or optimistic concurrency checks
  • Writing Parquet data files
  • Committing metadata (Delta log / manifest)
  • Potentially triggering compaction or liquid clustering

If a streaming service waited for a full lakehouse commit before acknowledging the producer, end-to-end latency would be unacceptable for real-time workloads. The WAL decouples the durability SLO (milliseconds) from the queryability SLO (seconds).

Mechanism (Zerobus instantiation)

  1. Producer pushes data via gRPC bidirectional stream.
  2. Zerobus writes to a latency-optimized WAL.
  3. Once durable in the WAL, Zerobus returns the highest committed offset on the stream (async ack loop).
  4. Client purges its in-flight buffer up to that offset.
  5. Asynchronously, Delta Kernel Rust reads from the WAL and commits to Delta tables.

This mirrors the classic WAL commit-before-ack invariant applied at the service boundary rather than within a single database.

Trade-offs

Advantage Cost
Low-latency ack (ms-level durability) Data not immediately queryable in lakehouse
Producer buffer can be freed quickly Two-phase durability (WAL → Delta) adds complexity
Decouples ingestion rate from commit rate WAL must be sized for burst → drain mismatch

Seen in

Last updated · 542 distilled / 1,571 read