CONCEPT Cited by 1 source

Critical-path dependency minimization¶

Definition¶

Critical-path dependency minimization is the reliability discipline of reducing the count and depth of external dependencies on the synchronous request path of a load-bearing operation. In a chain of N dependencies each at availability A, the operation's effective availability is A^N — a direct application of availability multiplication. Removing a dependency from the critical path is mathematically equivalent to making it 100% available; the practical implication is that the fewer external services on the hot path, the higher the achievable availability ceiling.

The discipline is most valuable when:

The operation is on a synchronous user-visible path (its outage is observed as customer impact, not as queue depth).
The dependencies are themselves complex services with their own reliability budgets (cloud-provider control planes, Kubernetes system services, IAM, DNS).
The operation runs at high frequency (the dependency chain's composite outage rate is multiplied by the operation rate to get customer-impact rate).

Canonical Lakebase framing¶

Verbatim from the systems/lakebase reliability roadmap (Source: sources/2026-05-27-databricks-how-the-lakebase-architecture-stays-resilient-to-cloud-failures):

"Serving agentic workloads means creating and resuming databases must be highly reliable. Reliability is strongly correlated with the dependency chain and the amount of machinery involved in the flow. In a traditional setup with Postgres in cloud provider VMs, this goes well beyond the data plane:

Cloud provider's compute control plane to provision VMs

Available VM capacity (where the cloud provider controls the policy of who gets it)

Cloud provider's block store control plane to provision local storage

Cloud provider's networking control plane to allocate IPs, configure firewalls and network routes to the new VM

If using Kubernetes (K8s) - an additional dependency on the K8s system services."

That's a 5-link chain on the cold-start critical path of a single Postgres database — and each link is itself a multi-component service with its own outage budget and capacity policies.

The architectural reply¶

The Lakebase architectural answer is to collapse the chain by pre-completing the dependencies rather than calling them on the hot path (verbatim, Source: sources/2026-05-27-databricks-how-the-lakebase-architecture-stays-resilient-to-cloud-failures):

"In Lakebase, we take a different approach that drastically reduces the amount of control plane machinery involved in critical database flows:

We allocate a pool of big (often bare metal) instances from the cloud provider. We carry buffers to sustain cloud provider provisioning outages.

We built our own vertically autoscaling virtualization layer that schedules multiple Postgres instances onto those cloud instances.

We don't rely on cloud block store devices, but instead store data in our own zone-resilient storage that is ultimately backed in object stores like S3 or Azure Blob storage."

The structural transformations:

Original critical-path dependency	Lakebase replacement
Cloud-provider compute control plane	Pre-allocated bare-metal pool with provisioning buffer
Cloud-provider VM capacity	Buffered headroom in already-allocated pool
Cloud-provider block-store control plane	In-house zone-resilient storage on object stores
Cloud-provider networking control plane	Fewer per-Postgres IPs to allocate (multiple Postgres on one host)
Kubernetes system services for per-Postgres scheduling	In-house vertical-autoscaling virtualization layer

Each replacement either moves the dependency off the hot path (pre-completion) or replaces a complex external service with a purpose-built simpler one tuned for this specific workload.

The buffer-of-bare-metal-instances primitive¶

The pool-with-buffer primitive is the statically stable realisation of critical-path dependency minimisation: keep enough headroom that the cloud-provider control plane is not on the critical path of any individual Postgres start.

Three properties make it statically stable:

Buffer size > expected provisioning-outage duration. Sized so that even during a cloud-provider compute-control-plane outage, Lakebase can keep starting Postgres instances from already-allocated headroom for the duration of the outage.
Buffer is replenished off the hot path. Replenishment uses the cloud-provider control plane but does not block any user request — at worst it depletes faster than usual until the cloud provider recovers.
Buffer is shared across customers. Per-customer pre-allocation would scale linearly with tenancy; shared buffer amortises across the fleet.

See patterns/preallocated-bare-metal-pool-with-virtualization for the operational pattern.

When this discipline is load-bearing¶

The discipline applies to any operation on a critical path with a multi-dependency chain. It is load-bearing specifically when:

The operation has agentic / on-demand / scale-to-zero shape — see concepts/control-plane-as-the-new-data-plane for the workload-shape forcing function.
The dependency-chain availability multiplier exceeds the operation's target availability budget. (E.g. five 99.99% dependencies = 99.95% composite, which alone consumes the entire 99.95% target SLO.)
The dependencies have correlated failure modes — a cloud-provider control-plane regional outage takes out compute / block / network control planes simultaneously. See concepts/blast-radius for the correlation framing.

Generalises beyond databases¶

The same pattern shape recurs whenever a service runs on a cloud provider but cannot afford the full cloud-provider-control-plane dependency chain on the request path:

Container platforms that pre-allocate node pools with capacity buffer, then schedule containers onto them locally — same shape, different verticalisation.
Serverless function platforms (AWS Lambda) — pre-warm executor pools, fast in-process container start.
Bursty-workload databases generally — see Lakebase / Neon / PlanetScale for database-tier instances.

Caveats¶

Not free. Bare-metal pool + buffer + virtualisation layer are capital-intensive; they pay off when fleet scale + reliability target both demand it. A small fleet on a fat dependency chain may be better off accepting the chain.
Buffer sizing is a calibration problem. Too small → cloud- provider outages bleed through; too large → wasted capacity. Sizing depends on assumed outage-duration distribution and provisioning- request rate.
In-house virtualisation tax. The vertical-autoscaling-virtualisation-layer is engineering effort that cloud-provider VMs would have provided for free. Justified only by the reliability + density payoff.
Off-hot-path is not no-path. The dependency chain is still on the replenishment path; a sustained cloud-provider outage will eventually deplete the buffer.

Seen in¶

sources/2026-05-27-databricks-how-the-lakebase-architecture-stays-resilient-to-cloud-failures — canonical wiki framing on serverless-Postgres-startup. The five-link cloud-provider-control-plane chain enumerated verbatim; the bare-metal-pool-plus-virtualisation-layer architectural reply enumerated verbatim.

concepts/availability-multiplication-of-dependencies — the mathematical framing of why dependency-chain length matters
concepts/control-plane-as-the-new-data-plane — the workload-shape forcing function that makes start-path reliability load-bearing
concepts/control-plane-data-plane-separation — the architectural parent
concepts/static-stability — the buffer-of-pool primitive is a statically-stable realisation
concepts/scale-to-zero — workload property that puts start on the critical path in the first place
concepts/availability-dependency — neighbouring concept
systems/lakebase / systems/neon — canonical instances
systems/aws-ec2 — the cloud-provider compute primitive being buffered against
systems/kubernetes — alternative dependency that the pattern also bypasses for the hot path
patterns/preallocated-bare-metal-pool-with-virtualization — the operational pattern