PATTERN Cited by 1 source
State transfer on reshard¶
State transfer on reshard preserves per-key in-memory state when an auto-sharder reassigns a slice from one pod to another. Instead of the successor pod starting cold and rebuilding state from the source of truth, the predecessor pod transfers its state to the successor as part of the resharding operation. Applies to caches, work buffers, batchers — anything holding key-scoped memory that would otherwise be thrown away at restart / rebalance.
Why it matters¶
Planned rolling restarts are extremely frequent in production fleets — systems/dicer reports 99.9% of restarts at Databricks are planned rolling restarts. A rolling restart churns the entire keyspace over time. Without state transfer, every slice reassignment during the restart starts cold:
- Cache hit rate collapses. systems/softstore measured a ~30% drop without state transfer.
- Downstream load spikes as cold successor pods hit the backing store.
- User-visible latency regresses for the full duration of the restart window.
With state transfer: Softstore maintains a steady ~85% hit rate across the same restart.
Mechanics (per Dicer)¶
- The auto-sharder's Assigner computes a reassignment (slice S moves from pod A to pod B).
- Before (or as part of) cutover, pod A ships slice S's state to pod B.
- Pod B installs the state before beginning to serve S.
- Clerks' assignment update points traffic at pod B — which now answers from warm memory.
Mechanism details (on-wire format, flow control, failure handling of mid-transfer failures) are not covered in the public open-source post — a future Databricks post is promised.
Preconditions¶
- The application must expose a serializable per-slice state snapshot the auto-sharder can ship.
- The application's semantics must tolerate that state moves between pods at assignment-change time.
- The transfer path itself must not be on the request-serving critical path — or must be small enough to complete quickly.
Seen in¶
- sources/2026-01-13-databricks-open-sourcing-dicer-auto-sharder — Dicer state-transfer is described as Softstore's enabler for ride-through rolling restarts. Figure 5 of the post shows the before/after hit-rate comparison. The paper frames it as broadly-impactful since the restart workload (99.9% planned) is near-constant.
Related¶
- concepts/dynamic-sharding
- systems/dicer
- systems/softstore
- patterns/nondisruptive-migration — parallel: moving live tenant state without customer-visible disruption, at a different layer (EBS volumes) but same shape.
- patterns/shard-replication-for-hot-keys