Skip to content

CONCEPT Cited by 1 source

Critical-path dependency minimization

Definition

Critical-path dependency minimization is the reliability discipline of reducing the count and depth of external dependencies on the synchronous request path of a load-bearing operation. In a chain of N dependencies each at availability A, the operation's effective availability is A^N — a direct application of availability multiplication. Removing a dependency from the critical path is mathematically equivalent to making it 100% available; the practical implication is that the fewer external services on the hot path, the higher the achievable availability ceiling.

The discipline is most valuable when:

  • The operation is on a synchronous user-visible path (its outage is observed as customer impact, not as queue depth).
  • The dependencies are themselves complex services with their own reliability budgets (cloud-provider control planes, Kubernetes system services, IAM, DNS).
  • The operation runs at high frequency (the dependency chain's composite outage rate is multiplied by the operation rate to get customer-impact rate).

Canonical Lakebase framing

Verbatim from the systems/lakebase reliability roadmap (Source: sources/2026-05-27-databricks-how-the-lakebase-architecture-stays-resilient-to-cloud-failures):

"Serving agentic workloads means creating and resuming databases must be highly reliable. Reliability is strongly correlated with the dependency chain and the amount of machinery involved in the flow. In a traditional setup with Postgres in cloud provider VMs, this goes well beyond the data plane:

  • Cloud provider's compute control plane to provision VMs
  • Available VM capacity (where the cloud provider controls the policy of who gets it)
  • Cloud provider's block store control plane to provision local storage
  • Cloud provider's networking control plane to allocate IPs, configure firewalls and network routes to the new VM
  • If using Kubernetes (K8s) - an additional dependency on the K8s system services."

That's a 5-link chain on the cold-start critical path of a single Postgres database — and each link is itself a multi-component service with its own outage budget and capacity policies.

The architectural reply

The Lakebase architectural answer is to collapse the chain by pre-completing the dependencies rather than calling them on the hot path (verbatim, Source: sources/2026-05-27-databricks-how-the-lakebase-architecture-stays-resilient-to-cloud-failures):

"In Lakebase, we take a different approach that drastically reduces the amount of control plane machinery involved in critical database flows:

  • We allocate a pool of big (often bare metal) instances from the cloud provider. We carry buffers to sustain cloud provider provisioning outages.
  • We built our own vertically autoscaling virtualization layer that schedules multiple Postgres instances onto those cloud instances.
  • We don't rely on cloud block store devices, but instead store data in our own zone-resilient storage that is ultimately backed in object stores like S3 or Azure Blob storage."

The structural transformations:

Original critical-path dependency Lakebase replacement
Cloud-provider compute control plane Pre-allocated bare-metal pool with provisioning buffer
Cloud-provider VM capacity Buffered headroom in already-allocated pool
Cloud-provider block-store control plane In-house zone-resilient storage on object stores
Cloud-provider networking control plane Fewer per-Postgres IPs to allocate (multiple Postgres on one host)
Kubernetes system services for per-Postgres scheduling In-house vertical-autoscaling virtualization layer

Each replacement either moves the dependency off the hot path (pre-completion) or replaces a complex external service with a purpose-built simpler one tuned for this specific workload.

The buffer-of-bare-metal-instances primitive

The pool-with-buffer primitive is the statically stable realisation of critical-path dependency minimisation: keep enough headroom that the cloud-provider control plane is not on the critical path of any individual Postgres start.

Three properties make it statically stable:

  1. Buffer size > expected provisioning-outage duration. Sized so that even during a cloud-provider compute-control-plane outage, Lakebase can keep starting Postgres instances from already-allocated headroom for the duration of the outage.
  2. Buffer is replenished off the hot path. Replenishment uses the cloud-provider control plane but does not block any user request — at worst it depletes faster than usual until the cloud provider recovers.
  3. Buffer is shared across customers. Per-customer pre-allocation would scale linearly with tenancy; shared buffer amortises across the fleet.

See patterns/preallocated-bare-metal-pool-with-virtualization for the operational pattern.

When this discipline is load-bearing

The discipline applies to any operation on a critical path with a multi-dependency chain. It is load-bearing specifically when:

  • The operation has agentic / on-demand / scale-to-zero shape — see concepts/control-plane-as-the-new-data-plane for the workload-shape forcing function.
  • The dependency-chain availability multiplier exceeds the operation's target availability budget. (E.g. five 99.99% dependencies = 99.95% composite, which alone consumes the entire 99.95% target SLO.)
  • The dependencies have correlated failure modes — a cloud-provider control-plane regional outage takes out compute / block / network control planes simultaneously. See concepts/blast-radius for the correlation framing.

Generalises beyond databases

The same pattern shape recurs whenever a service runs on a cloud provider but cannot afford the full cloud-provider-control-plane dependency chain on the request path:

  • Container platforms that pre-allocate node pools with capacity buffer, then schedule containers onto them locally — same shape, different verticalisation.
  • Serverless function platforms (AWS Lambda) — pre-warm executor pools, fast in-process container start.
  • Bursty-workload databases generally — see Lakebase / Neon / PlanetScale for database-tier instances.

Caveats

  • Not free. Bare-metal pool + buffer + virtualisation layer are capital-intensive; they pay off when fleet scale + reliability target both demand it. A small fleet on a fat dependency chain may be better off accepting the chain.
  • Buffer sizing is a calibration problem. Too small → cloud- provider outages bleed through; too large → wasted capacity. Sizing depends on assumed outage-duration distribution and provisioning- request rate.
  • In-house virtualisation tax. The vertical-autoscaling-virtualisation-layer is engineering effort that cloud-provider VMs would have provided for free. Justified only by the reliability + density payoff.
  • Off-hot-path is not no-path. The dependency chain is still on the replenishment path; a sustained cloud-provider outage will eventually deplete the buffer.

Seen in

Last updated · 542 distilled / 1,571 read