CONCEPT

Golden signals for RDS¶

Definition¶

Golden signals for RDS is Zalando's empirical decomposition of RDS Postgres health into twelve metrics across four buckets — CPU, Memory, Disk, Workload — each with a specific AWS Performance Insights metric path and an empirically-calibrated threshold derived from past incidents. The method is a specialisation of the USE method for managed Postgres on cloud block storage: instead of enumerating every OS resource, it picks twelve signals that are sufficient for health conclusions at the altitude managed databases expose.

(Source: .)

The twelve signals¶

CPU bucket

C1 CPU Utilisation — os.cpuUtilization.total. Threshold: >40-60% is an incident precursor, not a healthy saturation target. Database workloads are memory/IO-bound by design; high CPU is an anomaly signal. See concepts/cpu-utilisation-ceiling-database.
C2 CPU Await — os.cpuUtilization.await. Linux kernel IO-wait metric. Threshold: >5-10% indicates instance is IO-bandwidth-bound.

Memory bucket

M1/M2 Swap In/Out — os.swap.in, os.swap.out. Threshold: any non-zero activity is a low-memory symptom. Because disk is orders of magnitude slower than RAM, swap activity slows the OS and its applications simultaneously.

Disk bucket

D1 Storage Read IOPS — os.diskIO.rdsdev.readIOsPS. Interpretation requires the provisioned IOPS baseline (GP2 = 3 IOPS per GB of storage, min 100; IO1/IO2 = explicit at deployment). Very low → dataset served from memory; near-provisioned → IO-bound.
D2 Storage Write IOPS — os.diskIO.rdsdev.writeIOsPS. High value indicates write-mostly workload; watch against provisioned IOPS.
D3 Storage Latency — os.diskIO.rdsdev.await. Three-tier ladder: **<5ms healthy, 5-10ms SLO impact,

10ms incident precursor**. See concepts/storage-io-latency-sli.

Workload / Postgres bucket

P1 Cache Hit Ratio — db.Cache.blks_hit / (db.Cache.blks_hit + db.IO.blk_read). Threshold: <80% = shared-buffers or physical RAM insufficient; working set has overflowed memory. See concepts/cache-hit-ratio-memory-pressure.
P2 Block Read Latency — db.IO.blk_read_time. Threshold: >10ms impacts application SLOs. Postgres- altitude analogue of os.diskIO.rdsdev.await.
P3 Database Deadlocks — db.Concurrency.deadlocks. Ideal: 0. Non-zero → schema/IO-logic review.
P4 Transaction Rate — db.Transactions.xact_commit. Low value indicates the instance is standby / no workload.
P5 SQL Efficiency — db.SQL.tup_fetched / db.SQL.tup_returned. Ratio of rows consumed by clients to rows read from storage. Low ratio signals query-design malpractice (missing indexes, overly broad scans). See concepts/sql-efficiency-ratio.

Why this decomposition (vs USE, vs Google's 4 golden signals)¶

Vs USE: USE enumerates every resource × every dimension (utilisation, saturation, errors). For RDS, where AWS manages the underlying hardware, most of USE's extensibility is not useful — you cannot observe switch queue depth, can't measure disk-controller saturation separately from block-device latency. Zalando's 12 signals are the projection of USE onto the managed-Postgres observability surface, with P-series signals (cache hit, block read time, SQL efficiency) adding engine-internal visibility USE doesn't name.
Vs Google's four golden signals (latency, traffic, errors, saturation): Google's four are symptom-level metrics for request-response services. Zalando's twelve are cause-level metrics for a database engine. The two are complementary altitudes — symptom-based CBO alerts (see concepts/symptom-based-alerting) page the on-call; the golden signals for RDS are what the on-call looks at to diagnose.

Why thresholds are empirical, not absolute¶

Each threshold in the methodology is Zalando-specific — derived from production incidents on their own RDS Postgres fleet. The direction of each signal generalises (CPU hot = bad for DBs, cache hit < 80% = RAM pressure, await

10ms = SLO violation), but the exact numbers should be calibrated per-fleet. The methodology's contribution is structural: which signals to care about, not what the number should be.

Packaged as a utility¶

Zalando ships the methodology as systems/rds-health — an open-source Go CLI that queries AWS Performance Insights for every RDS instance in an account/region and produces a fleet-wide report against these twelve thresholds. See patterns/fleet-wide-methodology-via-cli for the pattern this instantiates.

Seen in¶

— canonical article; names each signal with its exact AWS Performance Insights path.

concepts/use-method · concepts/observability
concepts/cpu-utilisation-ceiling-database · concepts/cache-hit-ratio-memory-pressure · concepts/storage-io-latency-sli · concepts/sql-efficiency-ratio — the four signals with dedicated wiki pages
concepts/database-fleet-standardisation — the motivating problem
systems/rds-health · systems/aws-rds · systems/aws-performance-insights · systems/postgresql
patterns/fleet-wide-methodology-via-cli