Skip to content

CONCEPT Cited by 1 source

Golden signals for RDS

Definition

Golden signals for RDS is Zalando's empirical decomposition of RDS Postgres health into twelve metrics across four buckets — CPU, Memory, Disk, Workload — each with a specific AWS Performance Insights metric path and an empirically-calibrated threshold derived from past incidents. The method is a specialisation of the USE method for managed Postgres on cloud block storage: instead of enumerating every OS resource, it picks twelve signals that are sufficient for health conclusions at the altitude managed databases expose.

(Source: sources/2024-02-19-zalando-twelve-golden-signals.)

The twelve signals

CPU bucket

  • C1 CPU Utilisationos.cpuUtilization.total. Threshold: >40-60% is an incident precursor, not a healthy saturation target. Database workloads are memory/IO-bound by design; high CPU is an anomaly signal. See concepts/cpu-utilisation-ceiling-database.
  • C2 CPU Awaitos.cpuUtilization.await. Linux kernel IO-wait metric. Threshold: >5-10% indicates instance is IO-bandwidth-bound.

Memory bucket

  • M1/M2 Swap In/Outos.swap.in, os.swap.out. Threshold: any non-zero activity is a low-memory symptom. Because disk is orders of magnitude slower than RAM, swap activity slows the OS and its applications simultaneously.

Disk bucket

  • D1 Storage Read IOPSos.diskIO.rdsdev.readIOsPS. Interpretation requires the provisioned IOPS baseline (GP2 = 3 IOPS per GB of storage, min 100; IO1/IO2 = explicit at deployment). Very low → dataset served from memory; near-provisioned → IO-bound.
  • D2 Storage Write IOPSos.diskIO.rdsdev.writeIOsPS. High value indicates write-mostly workload; watch against provisioned IOPS.
  • D3 Storage Latencyos.diskIO.rdsdev.await. Three-tier ladder: **<5ms healthy, 5-10ms SLO impact,

    10ms incident precursor**. See concepts/storage-io-latency-sli.

Workload / Postgres bucket

  • P1 Cache Hit Ratiodb.Cache.blks_hit / (db.Cache.blks_hit + db.IO.blk_read). Threshold: <80% = shared-buffers or physical RAM insufficient; working set has overflowed memory. See concepts/cache-hit-ratio-memory-pressure.
  • P2 Block Read Latencydb.IO.blk_read_time. Threshold: >10ms impacts application SLOs. Postgres- altitude analogue of os.diskIO.rdsdev.await.
  • P3 Database Deadlocksdb.Concurrency.deadlocks. Ideal: 0. Non-zero → schema/IO-logic review.
  • P4 Transaction Ratedb.Transactions.xact_commit. Low value indicates the instance is standby / no workload.
  • P5 SQL Efficiencydb.SQL.tup_fetched / db.SQL.tup_returned. Ratio of rows consumed by clients to rows read from storage. Low ratio signals query-design malpractice (missing indexes, overly broad scans). See concepts/sql-efficiency-ratio.

Why this decomposition (vs USE, vs Google's 4 golden signals)

  • Vs USE: USE enumerates every resource × every dimension (utilisation, saturation, errors). For RDS, where AWS manages the underlying hardware, most of USE's extensibility is not useful — you cannot observe switch queue depth, can't measure disk-controller saturation separately from block-device latency. Zalando's 12 signals are the projection of USE onto the managed-Postgres observability surface, with P-series signals (cache hit, block read time, SQL efficiency) adding engine-internal visibility USE doesn't name.
  • Vs Google's four golden signals (latency, traffic, errors, saturation): Google's four are symptom-level metrics for request-response services. Zalando's twelve are cause-level metrics for a database engine. The two are complementary altitudes — symptom-based CBO alerts (see concepts/symptom-based-alerting) page the on-call; the golden signals for RDS are what the on-call looks at to diagnose.

Why thresholds are empirical, not absolute

Each threshold in the methodology is Zalando-specific — derived from production incidents on their own RDS Postgres fleet. The direction of each signal generalises (CPU hot = bad for DBs, cache hit < 80% = RAM pressure, await

10ms = SLO violation), but the exact numbers should be calibrated per-fleet. The methodology's contribution is structural: which signals to care about, not what the number should be.

Packaged as a utility

Zalando ships the methodology as systems/rds-health — an open-source Go CLI that queries AWS Performance Insights for every RDS instance in an account/region and produces a fleet-wide report against these twelve thresholds. See patterns/fleet-wide-methodology-via-cli for the pattern this instantiates.

Seen in

Last updated · 501 distilled / 1,218 read