PATTERN Cited by 1 source

Per-database availability attainment¶

The pattern¶

Measure database reliability per individual database (not as a fleet aggregate) and report what percentage of databases in the fleet met a given availability bar during a defined window (typically monthly). Use two-bar reporting (a "good" bar like 99.95% and a "goal" bar like 99.99%) to give operators a calibrated tail-customer-impact view.

The pattern is the operational realisation of database availability attainment — see that concept for the rationale; this page covers how to instantiate the measurement.

Components¶

Per-database availability calculator. For each database D in the fleet, compute availability(D, window) = uptime / total_time over the measurement window. Defined precisely:
Uptime: minutes during the window in which D was reachable and serving queries successfully.
Window: typically a calendar month; can be rolling.
Attainment aggregator. For each target bar B (e.g. 99.95%, 99.99%), compute attainment(B, window) = |{D : availability(D, window) ≥ B}| / |fleet|.
Reporting surface. Publish the attainment numbers per bar per window. Lakebase's published shape (Source: sources/2026-05-27-databricks-how-the-lakebase-architecture-stays-resilient-to-cloud-failures):

Month	Met 99.95%	Met 99.99%
2026-01	99.96%	99.85%
2026-02	99.95%	99.84%
2026-03	99.96%	99.81%
2026-04	99.93%	99.75%

Two-bar discipline¶

The two-bar (good + goal) shape is structurally important:

Bottom bar (e.g. 99.95%) answers "how many databases had a bad month?" Move on this number is detection of a tail-customer outage cluster — typically a single bad cell, a regional control-plane issue, or a sustained per-customer abuse pattern.
Goal bar (e.g. 99.99%) answers "how many databases hit the designed target?" Move on this number is detection of broader reliability drift even when the absolute outage time per database is small.

A single bar at 99.99% would conflate these signals. No bar at all (just fleet-aggregate) hides both.

Per-SLI generalisation¶

The attainment shape generalises beyond availability to any per-database SLI in the source's disclosed menu:

Database availability (this pattern's primary instance).
Database startup time — attainment-of-startup-time = % of databases meeting startup-time goal this month.
Database switchover/failover frequency + latency — attainment measured the same way.
Storage availability + latency on page reads + durable writes — see concepts/storage-io-latency-sli.
Control-plane API success rates + latencies.

Each gets its own attainment number per window per bar.

Composability with cell-based architecture¶

Cell-bounded outages produce a tractable signature in attainment: "3% of databases missed the bar this month" with cells = 1/N of fleet → ~one bad cell. The 2026-05-08 us-east-1 thermal-event incident affected ~13% of databases in the region — that's an attainment-impact signature directly readable from the metric: "in May, ~13% of the us-east-1 fleet had a bad day". The attainment metric makes cell-bounded outages directly observable in the SLO substrate.

Operational discipline¶

Publish externally. Lakebase / Neon publish the table on neonstatus.com (high-level customer-facing) and inside the engineering organisation (high-resolution per-database). External publication is a commitment device.
Monthly cadence with rolling complement. Monthly attainment is the "published" number; engineering teams typically also monitor rolling-7-day and rolling-30-day attainment for faster-feedback signal.
Window-vs-database-age policy. Newly-created databases need a "qualifying lifetime" threshold to avoid skewing the metric with partial-month data.
Alerting on attainment drop. A WoW or MoM drop in the attainment number triggers the "why did the tail get worse?" investigation, regardless of fleet-aggregate uptime.
Per-region reporting. Splitting attainment by region helps diagnose regional issues invisible in the fleet-wide number.

Caveats¶

Definition rigour. "Available" must be precisely defined (e.g. "≥99.5% query success rate per minute"). Without precision, attainment numbers are not comparable across windows.
Customer-database skew. A customer with 100 databases at 99.99% has different SLA implications than a customer with 1 database at 99.99%. Per-database attainment doesn't directly capture per-customer impact.
Survivorship bias on suspended databases. Auto-suspended databases that can't fail (because they're not running) artificially inflate the attainment number. Mature regimes measure only over time the database was active.
Attainment is lagging. Monthly attainment is a backward- looking metric; engineering cycles need faster signal (rolling windows, alarm-driven SLO burn-rate detection).
Window-edge artifacts. Outages that span window boundaries may register on either side; reporting policy must define the attribution.

Seen in¶

sources/2026-05-27-databricks-how-the-lakebase-architecture-stays-resilient-to-cloud-failures — canonical wiki framing as the operational shape of Lakebase / Neon's reliability measurement substrate. The 2026 H1 attainment table is the disclosed instance. The two-bar (99.95% / 99.99%) reporting shape. The attainment-vs-fleet-aggregate framing ("individual customer doesn't care if the fleet had great availability").

concepts/database-availability-attainment — the concept this pattern operationalises
concepts/database-startup-time-sli — sibling SLI; attainment shape applies
concepts/operation-based-slo — sibling per-customer-metric shape at the user-journey altitude
concepts/storage-io-latency-sli — sibling per-component SLI
concepts/control-plane-as-the-new-data-plane — the workload- shape forcing function for the multi-SLI menu
systems/lakebase / systems/neon — canonical instances