Skip to content

ZALANDO 2024-02-19

Read original ↗

Zalando — 12 Golden Signals To Discover Anomalies And Performance Issues on Your AWS RDS Fleet

Summary

Dmitry Kolesnikov (Zalando, 2024-02-19) publishes the platform team's consolidated methodology for observing a fleet of AWS RDS Postgres instances at microservices scale. Zalando's database-per-service posture — each microservice deploys its own RDS instance — turns what would be a single-DB operations problem into a fleet-health problem that generates enough toil that teams were burning sprints to months of engineer time on database anomaly detection. The response is two-pronged: (1) collapse the large space of RDS / Postgres / kernel metrics into twelve "golden signals" grouped into four buckets (CPU, Memory, Disk, Workload) — each with a specific AWS Performance Insights metric name and an empirically-calibrated threshold derived from Zalando's past incidents; (2) package the methodology as an open-source CLI utility (systems/rds-health) that automates the analysis across all RDS instances and clusters in a given AWS account, so that "a single click of the button" yields a fleet-wide health report rather than per-instance ad-hoc scripts. The post is anchored in a concrete production lesson — the fragmentation that comes with microservices only becomes tractable at scale via standardisation at both the methodology altitude (which signals) and the tooling altitude (which utility).

Key takeaways

  1. Database-per-service turns anomaly detection into a fleet problem. Once every microservice owns its own RDS instance, "complex anomaly detection tasks, such as byzantine failures or issues with SQL statements, takes a noticeable investment all over the place. [...] some teams are required to allocate engineers for sprint or even months for such activities". Manual processes and ad-hoc scripts fail at fleet scale; the bottleneck is methodology fragmentation — each team defines its own thresholds and signals. Standardisation is the forcing function: "if teams use the same frameworks or design pattern then making changes at scale becomes easier. Same concept is extendable into the operation domain." (Source: this article.) See concepts/database-fleet-standardisation.

  2. The twelve signals across four buckets. Zalando names each signal with its exact AWS Performance Insights metric path. CPU (C1/C2): os.cpuUtilization.total, os.cpuUtilization.await. Memory (M1/M2): os.swap.in, os.swap.out. Disk (D1/D2/D3): os.diskIO.rdsdev.readIOsPS, os.diskIO.rdsdev.writeIOsPS, os.diskIO.rdsdev.await. Workload/Postgres (P1/P2/P3/P4/P5): cache hit ratio db.Cache.blks_hit / (db.Cache.blks_hit + db.IO.blk_read), db.IO.blk_read_time, db.Concurrency.deadlocks, db.Transactions.xact_commit, db.SQL.tup_fetched / db.SQL.tup_returned. The four- bucket decomposition (CPU / Memory / Disk / Workload) is Zalando's canonicalisation of the USE method specialised for managed-Postgres on cloud block storage. See concepts/golden-signals-rds.

  3. CPU utilisation above 40-60% on a database is an incident precursor, not a healthy saturation target. "Typical database workloads are bound to memory or storage, high CPU is an anomaly that requires further investigation. Our past experience advises us that CPU utilisation over 40% - 60% on database instances eventually leads to incidents." This is a surprising operational claim relative to general-purpose utilisation targets (70-80% typical for stateless services) — and its justification is the database- specific shape of the workload: if CPU is hot, the query mix has diverged from the memory/IO-bound baseline and something has gone wrong. Zalando also names a companion threshold: CPU await above 5-10% indicates IO-bandwidth- bound instance. See concepts/cpu-utilisation-ceiling-database (Source: this article).

  4. Cache hit ratio < 80% means working-set-exceeds-RAM, not a tuning knob issue. "Any values below 80% show that databases have insufficient amount of shared buffers or physical RAM. Data required for top-called queries don't fit into memory, and the database has to read it from disk." The empirical threshold is tighter than folklore "aim for high cache hit ratio" advice — 80% is the breakpoint, not 95% or 99%. Below it, the database has drifted into disk-bound territory and the remediation is capacity (more RAM / larger instance class) rather than SQL tuning. See concepts/cache-hit-ratio-memory-pressure.

  5. Storage latency thresholds: <5ms healthy, 5-10ms SLO impact, >10ms incident precursor. "Our empirical observations show that storage latency above 10 ms eventually leads to incident, the latency above 5 ms impacts on applications SLOs. A typical storage latency for database systems should be less than 4 - 5 ms." The three-tier threshold gives ops teams a calibrated ladder for disk-latency SLIs rather than a single boolean alert. The signal name is os.diskIO.rdsdev.await for block-device latency and db.IO.blk_read_time as the Postgres-altitude equivalent. See concepts/storage-io-latency-sli.

  6. SQL efficiency = tup_fetched / tup_returned surfaces query-design problems the execution engine can't. "SQL efficiency shows the percentage of rows fetched by the client vs rows returned from the storage. [...] For example, if you do select count(*) from million_row_table, one million rows will be returned, but only one row will be fetched." A low ratio doesn't indicate a performance problem in the usual sense — the database is doing what it's asked to do efficiently — but it signals schema / index / query-shape malpractice: the application is asking the storage layer to read far more than it consumes. The ratio is the canonical pointer toward missing indexes, overly-broad scans, or aggregation queries that should be materialised. See concepts/sql-efficiency-ratio.

  7. GP2 IOPS provisioning is coupled to volume size at 3 IOPS per GB (minimum 100). The D1/D2 signals (readIOsPS, writeIOsPS) have to be interpreted against the deployed storage configuration: "With the GP2 volume type, IOPS are provisioned by volume size, 3 IOPS per GB of storage with a minimum of 100 IOPS. The IO volume type has an explicit value defined at deployment time. Note that a very low value shows that the entire dataset is served from memory." The specific number matters because it's what turns a raw readIOsPS number into a utilisation fraction — and the observation that very-low IOPS means the working set is memory-resident inverts the usual "high IOPS = stress" reading. See systems/aws-rds.

  8. Standardisation at both altitudes: methodology + CLI. Zalando ships the methodology as rds-health — an open-source Go CLI that is "a frontend for AWS APIs that simply automates analysis of discussed golden signals across your accounts and regions". AWS already offers CloudWatch, Performance Insights, etc. for the raw signals; the gap Zalando fills is a unified holistic report across the fleet, not per-instance deep dives. Features: show configuration of all RDS instances and clusters, check health across all deployments, conduct capacity planning. Released publicly at github.com/zalando/rds-health as the mechanism by which the methodology propagates beyond Zalando. See patterns/fleet-wide-methodology-via-cli.

Operational numbers

Signal Path Zalando threshold Interpretation
C1 CPU os.cpuUtilization.total >40-60% Incident precursor
C2 CPU Await os.cpuUtilization.await >5-10% IO-bandwidth-bound
M1/M2 Swap os.swap.in / os.swap.out >0 Low-memory symptom
D1 Read IOPS os.diskIO.rdsdev.readIOsPS Align with storage config GP2 = 3 IOPS/GB
D2 Write IOPS os.diskIO.rdsdev.writeIOsPS Align with storage config High → IO-bound
D3 Storage await os.diskIO.rdsdev.await <5ms healthy, 5-10ms SLO, >10ms incident Block-device latency ladder
P1 Cache hit db.Cache.blks_hit / (hit + read) >80% <80% = working-set > RAM
P2 Block read time db.IO.blk_read_time <10ms >10ms impacts SLO
P3 Deadlocks db.Concurrency.deadlocks Ideally 0 Non-zero → schema/IO-logic review
P4 Transactions db.Transactions.xact_commit >0 Low → standby
P5 SQL efficiency db.SQL.tup_fetched / tup_returned Near 1.0 ideal Low → missing index / bad query shape

Caveats

  • Thresholds are empirical and Zalando-specific — derived from their microservices-on-RDS-Postgres posture. The direction of the signals (CPU hot = bad for DBs, cache hit < 80% = RAM pressure, storage await > 10ms = SLO violation) generalises; the exact numbers should be calibrated per-fleet.
  • The methodology is scoped to AWS RDS Postgres. Signal paths (os.diskIO.rdsdev.*, db.Cache.blks_hit) are AWS Performance Insights names; on non-RDS Postgres the equivalent metrics exist but under different paths. On MySQL/Aurora/etc. several of the P-series signals (cache hit, block read time) have different instrumentation.
  • The post does not quantify rds-health adoption across the Zalando fleet — no number of services, engineer-hours saved, or incidents caught. The open-source release is framed as a methodology-distribution mechanism, not a retrospective on internal use. Without that data the causal claim ("standardisation reduces toil") is directional rather than measured.
  • The CPU 40-60% ceiling is surprising relative to textbook utilisation advice. The post's justification is correct but brief: database workloads are memory/IO-bound by design, so CPU hotness is a diagnostic signal, not a capacity-used signal. Teams treating CPU utilisation as a headroom indicator on DBs will mis-calibrate. See concepts/cpu-utilisation-ceiling-database for the full argument.

Source

Last updated · 501 distilled / 1,218 read