CONCEPT Cited by 1 source
Golden signals for RDS¶
Definition¶
Golden signals for RDS is Zalando's empirical decomposition of RDS Postgres health into twelve metrics across four buckets — CPU, Memory, Disk, Workload — each with a specific AWS Performance Insights metric path and an empirically-calibrated threshold derived from past incidents. The method is a specialisation of the USE method for managed Postgres on cloud block storage: instead of enumerating every OS resource, it picks twelve signals that are sufficient for health conclusions at the altitude managed databases expose.
(Source: sources/2024-02-19-zalando-twelve-golden-signals.)
The twelve signals¶
CPU bucket
- C1 CPU Utilisation —
os.cpuUtilization.total. Threshold: >40-60% is an incident precursor, not a healthy saturation target. Database workloads are memory/IO-bound by design; high CPU is an anomaly signal. See concepts/cpu-utilisation-ceiling-database. - C2 CPU Await —
os.cpuUtilization.await. Linux kernel IO-wait metric. Threshold: >5-10% indicates instance is IO-bandwidth-bound.
Memory bucket
- M1/M2 Swap In/Out —
os.swap.in,os.swap.out. Threshold: any non-zero activity is a low-memory symptom. Because disk is orders of magnitude slower than RAM, swap activity slows the OS and its applications simultaneously.
Disk bucket
- D1 Storage Read IOPS —
os.diskIO.rdsdev.readIOsPS. Interpretation requires the provisioned IOPS baseline (GP2 = 3 IOPS per GB of storage, min 100; IO1/IO2 = explicit at deployment). Very low → dataset served from memory; near-provisioned → IO-bound. - D2 Storage Write IOPS —
os.diskIO.rdsdev.writeIOsPS. High value indicates write-mostly workload; watch against provisioned IOPS. - D3 Storage Latency —
os.diskIO.rdsdev.await. Three-tier ladder: **<5ms healthy, 5-10ms SLO impact,10ms incident precursor**. See concepts/storage-io-latency-sli.
Workload / Postgres bucket
- P1 Cache Hit Ratio —
db.Cache.blks_hit / (db.Cache.blks_hit + db.IO.blk_read). Threshold: <80% = shared-buffers or physical RAM insufficient; working set has overflowed memory. See concepts/cache-hit-ratio-memory-pressure. - P2 Block Read Latency —
db.IO.blk_read_time. Threshold: >10ms impacts application SLOs. Postgres- altitude analogue ofos.diskIO.rdsdev.await. - P3 Database Deadlocks —
db.Concurrency.deadlocks. Ideal: 0. Non-zero → schema/IO-logic review. - P4 Transaction Rate —
db.Transactions.xact_commit. Low value indicates the instance is standby / no workload. - P5 SQL Efficiency —
db.SQL.tup_fetched / db.SQL.tup_returned. Ratio of rows consumed by clients to rows read from storage. Low ratio signals query-design malpractice (missing indexes, overly broad scans). See concepts/sql-efficiency-ratio.
Why this decomposition (vs USE, vs Google's 4 golden signals)¶
- Vs USE: USE enumerates every resource × every dimension (utilisation, saturation, errors). For RDS, where AWS manages the underlying hardware, most of USE's extensibility is not useful — you cannot observe switch queue depth, can't measure disk-controller saturation separately from block-device latency. Zalando's 12 signals are the projection of USE onto the managed-Postgres observability surface, with P-series signals (cache hit, block read time, SQL efficiency) adding engine-internal visibility USE doesn't name.
- Vs Google's four golden signals (latency, traffic, errors, saturation): Google's four are symptom-level metrics for request-response services. Zalando's twelve are cause-level metrics for a database engine. The two are complementary altitudes — symptom-based CBO alerts (see concepts/symptom-based-alerting) page the on-call; the golden signals for RDS are what the on-call looks at to diagnose.
Why thresholds are empirical, not absolute¶
Each threshold in the methodology is Zalando-specific — derived from production incidents on their own RDS Postgres fleet. The direction of each signal generalises (CPU hot = bad for DBs, cache hit < 80% = RAM pressure, await
10ms = SLO violation), but the exact numbers should be calibrated per-fleet. The methodology's contribution is structural: which signals to care about, not what the number should be.
Packaged as a utility¶
Zalando ships the methodology as systems/rds-health — an open-source Go CLI that queries AWS Performance Insights for every RDS instance in an account/region and produces a fleet-wide report against these twelve thresholds. See patterns/fleet-wide-methodology-via-cli for the pattern this instantiates.
Seen in¶
- sources/2024-02-19-zalando-twelve-golden-signals — canonical article; names each signal with its exact AWS Performance Insights path.
Related¶
- concepts/use-method · concepts/observability
- concepts/cpu-utilisation-ceiling-database · concepts/cache-hit-ratio-memory-pressure · concepts/storage-io-latency-sli · concepts/sql-efficiency-ratio — the four signals with dedicated wiki pages
- concepts/database-fleet-standardisation — the motivating problem
- systems/rds-health · systems/aws-rds · systems/aws-performance-insights · systems/postgresql
- patterns/fleet-wide-methodology-via-cli