Skip to content

DATABRICKS

Read original ↗

Ingesting the Milky Way: Petabyte-Scale with Zerobus Ingest

Summary

Databricks discloses the internal architecture of Zerobus Ingest — their fully managed, serverless, push-based streaming ingestion service that writes directly into Delta tables governed by Unity Catalog. The post details three critical design decisions that enabled petabyte-scale ingestion: (1) dynamic partitioning with stream-connection-level ordering guarantees instead of static Kafka-style partition topology, (2) a custom zero-copy protobuf decoder ("Zeroparser") achieving ~1 GB/s per CPU core with dynamic descriptors, and (3) a latency-optimized write-ahead log for durability before lakehouse publish. The benchmark ingests NASA's NEOWISE dataset (200 billion data points / 1 PB) in under 24 hours at a sustained 12 GB/s to a single table from 2,048 concurrent streams.

Key takeaways

  1. Stream-connection-level ordering replaces partition-level ordering. Zerobus decouples ordering from partitions — each producer's stream connection is the ordering unit. This eliminates the static partition problem where adding/removing partitions breaks consumer ordering guarantees (Source: §Autoscaling achieved through dynamic partitioning).

  2. True autoscaling via hot routing. Because ordering is per-stream (not per-partition), pods can be added/removed dynamically. New streams route away from hot pods; draining streams finish gracefully on shrinking pods. The system achieves elastic compute utilization without the "provision for peak, carry forever" anti-pattern of Kafka (Source: §Hot routing and true autoscaling).

  3. Zero-copy protobuf decoding at ~1 GB/s per core. Zeroparser is a custom Rust-based protobuf decoder that accepts dynamic descriptors at runtime (like reflection) but achieves codegen-level performance by doing single-pass parsing with zero memory allocations. Rust's lifetime system guarantees compile-time safety while keeping raw wire bytes under exclusive network ownership (Source: §Zero-copy high-performance data handling).

  4. Zeroparser outperforms industry-standard codegen implementations despite being in the "dynamic" category — benchmarked against the NEOWISE schema on single-core parsing (Source: §Results show that Zeroparser…).

  5. WAL provides durability guarantee with low-latency handoff. Zerobus implements a latency-optimized WAL; messages are durable before ack is sent to client. Rather than per-record acks, the server returns the highest committed offset on the stream — an async ack loop that keeps clients lean (Source: §Write Ahead Log).

  6. gRPC bidirectional streaming enables the async ack pattern: one channel for sending messages, another for receiving offset acknowledgements. Clients purge in-flight buffers up to the acknowledged offset (Source: §This async design is key for clients…).

  7. Delta Kernel Rust is used for the core logic of writing to Delta tables from the WAL (Source: §Delta Kernel Rust is then used for the core logic for writing to Delta).

  8. Sustained 12 GB/s / 12M rows/sec to a single table for 24 hours from 2,048 concurrent Locust workers, ingesting 1.04 trillion records total. Each worker runs as a separate Kubernetes pod with 1.5 cores / 2 GiB memory (Source: §The results).

  9. Fan-in pattern is the correct stress model. A single powerful host saturates its own bandwidth before pressuring Zerobus — the service scales with concurrent stream count, not single-client throughput. Locust on Kubernetes with 2,048 pods emulates real-world fan-in (Source: §Why Locust? The Fan-In Problem).

  10. No infrastructure to provision. Create a table and push data. No partitions, no brokers, no connector pipelines. Data lands in the lakehouse queryable in seconds (Source: §Introduction).

Operational numbers

Metric Value
Sustained throughput (bytes) 12 GB/s (11.8 GB/s proto2 wire)
Sustained throughput (rows) 12,000,000 rows/sec
Total rows ingested 1.04 trillion
Test duration ~25 hours (incl. 1h ramp)
Concurrent streams 2,048
Worker spec 1.5 cores / 2 GiB / 10 GiB ephemeral
Zeroparser throughput ~1 GB/s per CPU core
Message format Protocol Buffer 2 (proto2) binary
In-flight records per stream 50,000 (max)
Spawn rate 0.5 users/sec

Architecture diagram (textual)

2,048 Locust workers (K8s pods)
        │  gRPC bidirectional streams
   Zerobus Ingest (serverless, dynamic pod pool)
        │  hot-routing: streams → pods (heuristic-based)
        │  per-stream ordering guarantee
   Write-Ahead Log (latency-optimized)
        │  async ack loop (highest committed offset)
   Delta Kernel Rust → Delta tables (Unity Catalog governed)

Supported formats

Format Use case Performance
Protobuf Generic, fast record-by-record Fastest
Arrow Fast batch ingestion Fast
JSON Batch or row-by-row; convenient Slower than protobuf/Arrow

Caveats

  • Tier-3 source (Databricks Blog) — product/benchmark marketing mixed with real architecture. The 12 GB/s number is benchmark-optimal with 2,048 coordinated workers; real-world fan-in from heterogeneous producers may behave differently.
  • Internals still partially opaque. Pod-level durability semantics (sync vs async WAL fsync), replication topology, failure modes on pod crash mid-stream, and back-pressure behaviour on slow Delta writes are not disclosed.
  • No latency SLO disclosed. "Queryable in seconds" is the only end-to-end claim.
  • Kafka comparison is structural, not benchmarked. The post argues architectural superiority (no static partitions) but does not benchmark against a Kafka cluster under equivalent load.
  • Open-source scope is limited. Zeroparser is OSS (in the Zerobus SDK), but the service itself is proprietary managed infrastructure.

Source

Last updated · 542 distilled / 1,571 read