PATTERN Cited by 1 source
Per-database availability attainment¶
The pattern¶
Measure database reliability per individual database (not as a fleet aggregate) and report what percentage of databases in the fleet met a given availability bar during a defined window (typically monthly). Use two-bar reporting (a "good" bar like 99.95% and a "goal" bar like 99.99%) to give operators a calibrated tail-customer-impact view.
The pattern is the operational realisation of database availability attainment — see that concept for the rationale; this page covers how to instantiate the measurement.
Components¶
- Per-database availability calculator. For each database
Din the fleet, computeavailability(D, window) = uptime / total_timeover the measurement window. Defined precisely: - Uptime: minutes during the window in which D was reachable and serving queries successfully.
- Window: typically a calendar month; can be rolling.
- Attainment aggregator. For each target bar
B(e.g. 99.95%, 99.99%), computeattainment(B, window) = |{D : availability(D, window) ≥ B}| / |fleet|. - Reporting surface. Publish the attainment numbers per bar per window. Lakebase's published shape (Source: sources/2026-05-27-databricks-how-the-lakebase-architecture-stays-resilient-to-cloud-failures):
| Month | Met 99.95% | Met 99.99% |
|---|---|---|
| 2026-01 | 99.96% | 99.85% |
| 2026-02 | 99.95% | 99.84% |
| 2026-03 | 99.96% | 99.81% |
| 2026-04 | 99.93% | 99.75% |
Two-bar discipline¶
The two-bar (good + goal) shape is structurally important:
- Bottom bar (e.g. 99.95%) answers "how many databases had a bad month?" Move on this number is detection of a tail-customer outage cluster — typically a single bad cell, a regional control-plane issue, or a sustained per-customer abuse pattern.
- Goal bar (e.g. 99.99%) answers "how many databases hit the designed target?" Move on this number is detection of broader reliability drift even when the absolute outage time per database is small.
A single bar at 99.99% would conflate these signals. No bar at all (just fleet-aggregate) hides both.
Per-SLI generalisation¶
The attainment shape generalises beyond availability to any per-database SLI in the source's disclosed menu:
- Database availability (this pattern's primary instance).
- Database startup time — attainment-of-startup-time = % of databases meeting startup-time goal this month.
- Database switchover/failover frequency + latency — attainment measured the same way.
- Storage availability + latency on page reads + durable writes — see concepts/storage-io-latency-sli.
- Control-plane API success rates + latencies.
Each gets its own attainment number per window per bar.
Composability with cell-based architecture¶
Cell-bounded outages produce a tractable signature in attainment: "3% of databases missed the bar this month" with cells = 1/N of fleet → ~one bad cell. The 2026-05-08 us-east-1 thermal-event incident affected ~13% of databases in the region — that's an attainment-impact signature directly readable from the metric: "in May, ~13% of the us-east-1 fleet had a bad day". The attainment metric makes cell-bounded outages directly observable in the SLO substrate.
Operational discipline¶
- Publish externally. Lakebase / Neon publish the table on neonstatus.com (high-level customer-facing) and inside the engineering organisation (high-resolution per-database). External publication is a commitment device.
- Monthly cadence with rolling complement. Monthly attainment is the "published" number; engineering teams typically also monitor rolling-7-day and rolling-30-day attainment for faster-feedback signal.
- Window-vs-database-age policy. Newly-created databases need a "qualifying lifetime" threshold to avoid skewing the metric with partial-month data.
- Alerting on attainment drop. A WoW or MoM drop in the attainment number triggers the "why did the tail get worse?" investigation, regardless of fleet-aggregate uptime.
- Per-region reporting. Splitting attainment by region helps diagnose regional issues invisible in the fleet-wide number.
Caveats¶
- Definition rigour. "Available" must be precisely defined (e.g. "≥99.5% query success rate per minute"). Without precision, attainment numbers are not comparable across windows.
- Customer-database skew. A customer with 100 databases at 99.99% has different SLA implications than a customer with 1 database at 99.99%. Per-database attainment doesn't directly capture per-customer impact.
- Survivorship bias on suspended databases. Auto-suspended databases that can't fail (because they're not running) artificially inflate the attainment number. Mature regimes measure only over time the database was active.
- Attainment is lagging. Monthly attainment is a backward- looking metric; engineering cycles need faster signal (rolling windows, alarm-driven SLO burn-rate detection).
- Window-edge artifacts. Outages that span window boundaries may register on either side; reporting policy must define the attribution.
Seen in¶
- sources/2026-05-27-databricks-how-the-lakebase-architecture-stays-resilient-to-cloud-failures — canonical wiki framing as the operational shape of Lakebase / Neon's reliability measurement substrate. The 2026 H1 attainment table is the disclosed instance. The two-bar (99.95% / 99.99%) reporting shape. The attainment-vs-fleet-aggregate framing ("individual customer doesn't care if the fleet had great availability").
Related¶
- concepts/database-availability-attainment — the concept this pattern operationalises
- concepts/database-startup-time-sli — sibling SLI; attainment shape applies
- concepts/operation-based-slo — sibling per-customer-metric shape at the user-journey altitude
- concepts/storage-io-latency-sli — sibling per-component SLI
- concepts/control-plane-as-the-new-data-plane — the workload- shape forcing function for the multi-SLI menu
- systems/lakebase / systems/neon — canonical instances