Skip to content

DATABRICKS 2026-01-13 Tier 3

Read original ↗

Open Sourcing Dicer: Databricks' Auto-Sharder

Summary

Databricks open-sourced Dicer, the auto-sharder that underlies "every major Databricks product". Dicer is an intelligent control plane that continuously, asynchronously reshards a service's keyspace in response to pod health, load, termination notices, and other environmental signals. It sits in the space between two architectures that have long dominated Databricks services: stateless (cache/DB on the hot path, constant network and serialization cost, overread waste) and statically sharded (consistent-hashing style — simple, in-memory fast, but fragile under restarts / failures / hot keys). Dicer makes dynamic sharding a first-class primitive so teams stop picking the worse of the two. It is explicitly positioned in the lineage of Google's Slicer, Microsoft's Centrifuge, and Meta's Shard Manager. Use cases include in-memory serving, LLM KV-cache and LoRA-adapter placement on GPUs, cluster/query control planes, distributed caches (Softstore), background-work partitioning, batch aggregation on write paths, soft leader selection, and rendezvous coordination. Two headline results are published: Unity Catalog dropped DB load ~90–95% via a Dicer-backed in-memory cache, and Softstore's state-transfer feature preserves ~85% cache hit rate through rolling restarts that would otherwise churn the full keyspace (99.9% of planned restarts).

Key takeaways

  1. Stateless and statically-sharded are both bad defaults at scale; dynamic sharding is the missing third option. Stateless services pay the DB-on-every-request tax; even with a remote cache, you're still paying network RTT, (de)serialization CPU, and the "overread" waste of fetching whole objects to use a fraction. Static sharding (e.g. consistent hashing) solves memory locality but introduces downtime during restarts/autoscale, split-brain during failures, and has no answer to hot keys. Dicer treats dynamic resharding as the primitive that both lack. (Source: sources/2026-01-13-databricks-open-sourcing-dicer-auto-sharder)
  2. Sharding unit is a range, not a key. Applications hash keys to SliceKeys; a contiguous range of SliceKeys is a Slice; an Assignment is a set of Slices covering the whole keyspace, each Slice assigned to one or more Resources (pods). The Assigner splits, merges, replicates, and reassigns Slices — all minimal adjustments, not full reshuffles — as signals change. This is the same range-based shape used by systems/slicer and systems/shard-manager. Key-level is too fine; shard-level is too coarse; slice-range is the sweet spot. (Source: sources/2026-01-13-databricks-open-sourcing-dicer-auto-sharder)
  3. Hot keys are solved by per-key slice isolation + replication, not by smarter hashing. When load on a single key spikes, Dicer splits it into its own one-key slice and assigns that slice to multiple pods to absorb the load. Static hashing structurally cannot do this — a hot key is always pinned to one pod. The example in the paper: user ID 42 → SliceKey K10 gets broken out and assigned to P1 and P2. (Source: sources/2026-01-13-databricks-open-sourcing-dicer-auto-sharder)
  4. Clients read and servers publish the Assignment via library SDKs on a background path — never on the request hot path. The Clerk (client-side) and Slicelet (server-side) libraries maintain local caches of the current Assignment, watch the Assigner for updates, and notify the app via listener API. The Slicelet also aggregates per-key load locally and reports summaries to the Assigner asynchronously. All signal collection and assignment distribution is off the critical path; a request still goes Clerk.lookup(key) → pod with no RPC. (Source: sources/2026-01-13-databricks-open-sourcing-dicer-auto-sharder)
  5. Dicer chose eventual-consistency of assignments; Slicer and Centrifuge chose strong key ownership. This is a deliberate availability-over-strong-ownership trade-off. Clerks and Slicelets may briefly disagree on ownership during transitions; the paper says "in our experience this has been the right model for the vast majority of applications, though we do plan to support stronger guarantees in the future." This is the same trade-off Slicer / Centrifuge originally made differently with leases. (Source: sources/2026-01-13-databricks-open-sourcing-dicer-auto-sharder)
  6. Unity Catalog case study: Dicer replaced a remote-cache plan with a sharded in-memory cache and got 90–95% hit rate. Originally a stateless service; scaling hit a DB-read-volume wall. Remote caching was rejected because the cache had to be incrementally updated and snapshot-consistent with storage, and customer catalogs can be gigabytes — partial/replicated snapshots in a remote cache were prohibitively expensive. With Dicer, local method calls replaced remote network calls, cache hit rate reached 90–95%. (Source: sources/2026-01-13-databricks-open-sourcing-dicer-auto-sharder)
  7. SQL Query Orchestration case study: Dicer killed the manual-reshard toil and the restart-induced availability dips. The query orchestration engine on Spark was previously in-memory + static-sharded, requiring manual resharding to scale and losing availability during rolling restarts. Dicer delivered zero-downtime restarts/scaling plus dynamic rebalancing that resolved chronic CPU throttling. (Source: sources/2026-01-13-databricks-open-sourcing-dicer-auto-sharder)
  8. Softstore case study: state transfer lets a distributed cache survive rolling restarts with no hit-rate collapse. Softstore is Databricks' remote distributed KV cache, built on Dicer. 99.9% of restarts are planned (rolling), and without state transfer a planned restart churns the whole keyspace and drops cache hit rate ~30%. Dicer's state-transfer migrates data between pods across resharding so Softstore holds a steady ~85% hit rate through the restart. This is a pattern worth naming: a cache that reshards across restarts without re-warming. (Source: sources/2026-01-13-databricks-open-sourcing-dicer-auto-sharder)
  9. LLM/GPU serving is an auto-sharder use case, not just a stateless-service one. Two concrete examples are named: (a) stateful chat sessions accumulating per-session KV cache — affinity keeps the session hitting the same pod so the cache stays warm; (b) serving many LoRA adapters — the adapters themselves are sharded across constrained GPU resources. Both are an affinity-by-key problem, which is Dicer's native shape. (Source: sources/2026-01-13-databricks-open-sourcing-dicer-auto-sharder)
  10. "Soft" leader election is a byproduct of affinity-based sharding. If all requests for a given key land on one pod, that pod is the de facto coordinator for that key — no Paxos/Raft needed. The paper is explicit that Dicer provides affinity-based leader selection without the overhead of consensus protocols; stronger mutual-exclusion guarantees are called out as future work. Useful when the cost of occasional double-leader is acceptable (e.g., the operation is idempotent or commutative). (Source: sources/2026-01-13-databricks-open-sourcing-dicer-auto-sharder)

Architecture

Concepts

  • Key — application-level entity id (user ID, tenant ID, chat-room ID, …).
  • SliceKey — a hash of the Key. Gives Dicer a uniform keyspace to operate on.
  • Slice — a contiguous range of SliceKeys.
  • Resource — a pod (or equivalent compute unit).
  • Assignment — a partition of the full SliceKey space into Slices, each mapped to one or more Resources. Slices can be split, merged, replicated, or moved.
  • Target — a sharded application served by Dicer. Multi-tenant Assigner can serve many Targets within a region.

Components

  • Assigner (Dicer service) — the controller. Multi-tenant, region-scoped. Consumes health + per-key load signals; runs the sharding algorithm; publishes Assignments. Computes minimal adjustments (individual slice splits/merges/moves) rather than global reshuffles.
  • Slicelet — server-side library embedded in app pods. Two responsibilities: (1) cache the current Assignment, watch for updates from the Assigner, notify the app via listener API; (2) record per-key load locally as the app handles requests, asynchronously report summaries to the Assigner. Both run off the request critical path.
  • Clerk — client-side library. Maintains a cached Assignment so Clerk.lookup(key) → pod is a local call. Watches for updates.

Signal → decision → propagation

app health ─┐
pod load   ─┤
term notices ─┼──▶ Assigner ──(push)──▶ Slicelets (servers) ──▶ app listener
per-key load ─┤                         Clerks (clients)
hot-key detect ─┘

Consistency model: Slicelets and Clerks converge eventually on the latest assignment; during transitions clients/servers may briefly disagree. Trade-off explicitly chosen for availability + fast recovery over strong ownership (contrast Slicer / Centrifuge, which used leases for stronger guarantees).

Prior art (cited)

  • Centrifuge (Microsoft, NSDI 2010) — integrated lease management + partitioning for cloud services.
  • Slicer (Google, OSDI 2016) — auto-sharding for datacenter applications. Closest ancestor.
  • Shard Manager (Meta, SOSP 2021) — generic shard-management framework for geo-distributed apps.

Use cases

(Named in the article as first-class Dicer scenarios — each is a sharding-affinity problem in disguise.)

  • In-memory and GPU serving — KV stores; LLM per-session KV cache; LoRA-adapter placement on constrained GPUs.
  • Control and scheduling systems — cluster managers, query orchestration engines. They need local state + multi-tenant routing.
  • Remote caches — Softstore. Sharded KV cache that rides rolling restarts via state transfer.
  • Work partitioning and background work — garbage collection / cleanup jobs over a large keyspace. Avoids redundant work + lock contention.
  • Batching and aggregation on write paths — route related records to the same pod so the pod can batch writes in memory before committing, reducing IOPS and downstream load.
  • Soft leader selectionkey → primary pod as a lightweight alternative to consensus-based leader election.
  • Rendezvous / coordination — real-time chat rooms, multi-client session coordination — same key routes to same pod → in-memory state instead of a shared DB / back-plane.

Operational numbers

  • Unity Catalog: cache hit rate 90–95% after Dicer integration; catalogs can be gigabytes per customer.
  • Softstore: ~85% cache hit rate preserved across rolling restarts with state transfer; without it, hit rate drops ~30%.
  • Rolling restarts account for 99.9% of planned restarts in the production fleet — making state transfer's benefit a near-constant win.
  • SQL query orchestration: "availability dips eliminated"; chronic CPU throttling resolved via dynamic load rebalancing. Specific numbers not given.
  • Blog reports "hundreds of services" operating sharded at Databricks; no fleet-wide TPS/QPS number given in this post.

Systems / concepts / patterns extracted

Systems - systems/dicer — the auto-sharder itself (open-sourced 2026-01-13). - systems/centrifuge — prior art; Microsoft NSDI 2010 lease-based partitioning. - systems/slicer — prior art; Google OSDI 2016 auto-sharding. - systems/shard-manager — prior art; Meta SOSP 2021 generic shard management. - systems/unity-catalog — Databricks governance service; Dicer-backed in-memory sharded cache (Unity Catalog case study). - systems/softstore — Databricks distributed KV cache built on Dicer; highlights state-transfer.

Concepts - concepts/static-sharding — consistent-hashing-style pinning of keys to nodes; the anti-model Dicer replaces. - concepts/dynamic-sharding — continuously-adjusted Assignment driven by health + load signals. - concepts/hot-key — single-key load spike; static sharding can't escape it; Dicer isolates-and-replicates. - concepts/split-brain — uncoordinated clients develop inconsistent views of shard ownership; failure mode specifically cited as a static-sharding pain point. - concepts/eventual-consistency — Assignment consistency model chosen by Dicer (vs leases in Slicer/Centrifuge). - concepts/soft-leader-election — key-affinity-as-coordinator, without consensus. - concepts/stateless-compute — the foil: Dicer's motivating contrast ("Hidden Costs of Stateless Architectures") — context expands that concept. - concepts/control-plane-data-plane-separation — Dicer Assigner (decide) vs Slicelet/Clerk (deliver). - concepts/tail-latency-at-scale — motivation for stateful-in-memory serving vs remote caching (network + serialization tax + overread).

Patterns - patterns/state-transfer-on-reshard — migrate per-key state across pods when the Assignment changes, so caches / buffered work survive resharding without cold-start. - patterns/shard-replication-for-hot-keys — isolate hot key into its own slice, replicate that slice across N pods, load-split across replicas.

Caveats

  • Tier 3 source. The post is an open-source announcement, not a full design paper. It is explicitly promissory ("Stay tuned for a technical deep dive") — internal sharding algorithm, Assigner availability model, and state-transfer protocol details are not in this post.
  • Eventual-consistency trade-off cuts both ways. Applications with correctness requirements on "only one pod owns this key" need to layer their own mutex, accept Dicer's planned future strong-ownership mode, or use a different system (Slicer, a consensus-based leader). Soft leader election is named as a use case, but the paper is careful: "Dicer currently provides affinity-based leader selection."
  • State transfer is described as a Softstore feature but "mentioned in this post" as capability to be released more broadly — it's not clear at article time which applications can use it yet.
  • Rolling-restart hit-rate recovery is the only latency number in the post. Assigner latency, assignment propagation time, rebalance-convergence time, split/merge cost, and hot-key-detection delay are all absent — future posts promised.
  • No direct comparison to Slicer numbers. The post positions Dicer in the same family but doesn't claim strict improvement over Slicer / Shard Manager on any specific axis. Positioning is "we built it, we open-sourced it, here's how it's wired."
  • Client library coverage. At release, initial libraries; Java and Rust bindings named as upcoming. Means non-JVM services at Databricks (if any) aren't on the auto-sharder yet.
  • No distributed-transactions or multi-key semantics. Dicer is an affinity/routing primitive — cross-slice transactions are out of scope. Applications that need them must add their own coordination layer.

Raw

Last updated · 200 distilled / 1,178 read