Skip to content

PATTERN Cited by 1 source

Per-database availability attainment

The pattern

Measure database reliability per individual database (not as a fleet aggregate) and report what percentage of databases in the fleet met a given availability bar during a defined window (typically monthly). Use two-bar reporting (a "good" bar like 99.95% and a "goal" bar like 99.99%) to give operators a calibrated tail-customer-impact view.

The pattern is the operational realisation of database availability attainment — see that concept for the rationale; this page covers how to instantiate the measurement.

Components

  1. Per-database availability calculator. For each database D in the fleet, compute availability(D, window) = uptime / total_time over the measurement window. Defined precisely:
  2. Uptime: minutes during the window in which D was reachable and serving queries successfully.
  3. Window: typically a calendar month; can be rolling.
  4. Attainment aggregator. For each target bar B (e.g. 99.95%, 99.99%), compute attainment(B, window) = |{D : availability(D, window) ≥ B}| / |fleet|.
  5. Reporting surface. Publish the attainment numbers per bar per window. Lakebase's published shape (Source: sources/2026-05-27-databricks-how-the-lakebase-architecture-stays-resilient-to-cloud-failures):
Month Met 99.95% Met 99.99%
2026-01 99.96% 99.85%
2026-02 99.95% 99.84%
2026-03 99.96% 99.81%
2026-04 99.93% 99.75%

Two-bar discipline

The two-bar (good + goal) shape is structurally important:

  • Bottom bar (e.g. 99.95%) answers "how many databases had a bad month?" Move on this number is detection of a tail-customer outage cluster — typically a single bad cell, a regional control-plane issue, or a sustained per-customer abuse pattern.
  • Goal bar (e.g. 99.99%) answers "how many databases hit the designed target?" Move on this number is detection of broader reliability drift even when the absolute outage time per database is small.

A single bar at 99.99% would conflate these signals. No bar at all (just fleet-aggregate) hides both.

Per-SLI generalisation

The attainment shape generalises beyond availability to any per-database SLI in the source's disclosed menu:

  • Database availability (this pattern's primary instance).
  • Database startup time — attainment-of-startup-time = % of databases meeting startup-time goal this month.
  • Database switchover/failover frequency + latency — attainment measured the same way.
  • Storage availability + latency on page reads + durable writes — see concepts/storage-io-latency-sli.
  • Control-plane API success rates + latencies.

Each gets its own attainment number per window per bar.

Composability with cell-based architecture

Cell-bounded outages produce a tractable signature in attainment: "3% of databases missed the bar this month" with cells = 1/N of fleet → ~one bad cell. The 2026-05-08 us-east-1 thermal-event incident affected ~13% of databases in the region — that's an attainment-impact signature directly readable from the metric: "in May, ~13% of the us-east-1 fleet had a bad day". The attainment metric makes cell-bounded outages directly observable in the SLO substrate.

Operational discipline

  • Publish externally. Lakebase / Neon publish the table on neonstatus.com (high-level customer-facing) and inside the engineering organisation (high-resolution per-database). External publication is a commitment device.
  • Monthly cadence with rolling complement. Monthly attainment is the "published" number; engineering teams typically also monitor rolling-7-day and rolling-30-day attainment for faster-feedback signal.
  • Window-vs-database-age policy. Newly-created databases need a "qualifying lifetime" threshold to avoid skewing the metric with partial-month data.
  • Alerting on attainment drop. A WoW or MoM drop in the attainment number triggers the "why did the tail get worse?" investigation, regardless of fleet-aggregate uptime.
  • Per-region reporting. Splitting attainment by region helps diagnose regional issues invisible in the fleet-wide number.

Caveats

  • Definition rigour. "Available" must be precisely defined (e.g. "≥99.5% query success rate per minute"). Without precision, attainment numbers are not comparable across windows.
  • Customer-database skew. A customer with 100 databases at 99.99% has different SLA implications than a customer with 1 database at 99.99%. Per-database attainment doesn't directly capture per-customer impact.
  • Survivorship bias on suspended databases. Auto-suspended databases that can't fail (because they're not running) artificially inflate the attainment number. Mature regimes measure only over time the database was active.
  • Attainment is lagging. Monthly attainment is a backward- looking metric; engineering cycles need faster signal (rolling windows, alarm-driven SLO burn-rate detection).
  • Window-edge artifacts. Outages that span window boundaries may register on either side; reporting policy must define the attribution.

Seen in

Last updated · 542 distilled / 1,571 read