Skip to content

SYSTEM Cited by 6 sources

AWS RDS

Amazon RDS (Relational Database Service) is AWS's managed-relational- database offering, covering MySQL, Postgres, MariaDB, SQL Server, and Oracle. RDS takes over backups, patching, failover, and automated minor-version upgrades; capacity growth is primarily via instance class resizing + storage autoscaling + read replicas.

Pattern of appearance

RDS comes up in scaling stories as the operationally-sane default OLTP store — and, eventually, the vertical-scale ceiling.

  • Initial choice is usually fine for 1–2 years of product traffic.
  • Doubling strategy: as storage burns, instance class goes up; so does cost and blast radius.
  • At several TB, version upgrades have to be "close to zero downtime", which dominates DBRE on-call budget.
  • Shared-instance risk: many features on one RDS means downtime in one kills all of them, forcing a database split before real sharding.

Canva's Creators-payment pipeline walked this arc: MySQL RDS, instance size doubling every 8–10 months, free storage falling ~500 GB (≈50%) in 6 months, DB split done to reduce shared-instance blast radius, then ultimately the move off RDS for the aggregation workload and onto Snowflake. RDS survived in the architecture as the serving tier for Snowflake-computed aggregates, with a rate-limited ingester — and showed up again as the tuning constraint (RDS CPU spikes when the warehouse unload was too fast). See patterns/warehouse-unload-bridge. (Source: sources/2024-04-29-canva-scaling-to-count-billions)

Key operational realities

  • Vertical-first scaling. Storage autoscale helps the "free disk" metric; it doesn't help the CPU or I/O ceiling of a hot table.
  • Zero-downtime upgrades at TB scale are specialist work.
  • Read-replica lag matters once you're serving off replicas.
  • Ingest throughput sensitivity. RDS CPU can spike on sustained write bursts; ingestion-from-warehouse workflows need explicit rate-limit tuning — Canva documented this directly.

Seen in

  • — Zalando's platform team canonicalises an empirically- derived 12-golden-signals methodology for RDS Postgres fleet health, grouped into four buckets (CPU, Memory, Disk, Workload) with specific AWS Performance Insights metric paths and thresholds. Three RDS-specific operational claims: (1) CPU >40-60% is an incident precursor on database workloads, not a healthy saturation target (see concepts/cpu-utilisation-ceiling-database); (2) GP2 IOPS are provisioned at 3 IOPS per GB (minimum 100) — the D1/D2 signals need to be read against the storage config; (3) block-device latency os.diskIO.rdsdev.await above 10ms eventually leads to incident, 5-10ms impacts SLOs, <5ms is healthy (see concepts/storage-io-latency-sli). Zalando packages the methodology as an open-source Go CLI (systems/rds-health) that queries AWS Performance Insights to produce fleet-wide health reports — the canonical instance of patterns/fleet-wide-methodology-via-cli. Wiki anchor for the managed-Postgres-fleet observability methodology altitude on AWS RDS.

  • sources/2026-04-21-planetscale-increase-iops-and-throughput-with-shardingCanonicalises two RDS-specific pricing constraints as architectural constraints. (1) 3-node multi-AZ RDS only supports d-class instance types (db.m6id, db.m7gd etc. — the ones with attached NVMe storage): "for 3-node multi-AZ RDS clusters, RDS only only supports the d class variants. These come with attached NVMe SSD storage, in addition to the vCPUs and RAM." Non-d-class instances (db.m6i.2xlarge) are theoretically cheaper but not available in the 3-node topology — forces customers needing 3-node multi-AZ onto the d-class tier. (2) 8× workload forces upgrade from gp3 to io1 / io2 provisioned IOPS, producing an [[concepts/linear-vs-superlinear-cost-scaling|11-13× cost cliff]] for an 8× workload ($2,136/mo small-DB → $24,197/mo for 8× on db.m6id.16xlarge + io1). The architectural consequence: an RDS customer seeing growth must either (a) pay the super-linear bill, or (b) migrate to a sharded architecture — which RDS itself doesn't directly support (must layer Vitess or application-level sharding on top, or move to a managed sharded provider like PlanetScale). Canonical complement to the 2021-09-30 Reyes comparison post: that post frames RDS vs PlanetScale on feature grounds (sharding support, connection ceiling, deployment workflow); this post frames them on cost scaling grounds at the high end.

  • sources/2024-04-29-canva-scaling-to-count-billions — MySQL RDS as v1 counting store hitting vertical-scale wall; later as serving RDS for Snowflake unload with CPU-spike tuning.

  • sources/2025-05-03-aws-postgresql-transaction-visibility-read-replicas — AWS's response to Jepsen's 2025-04-29 Multi-AZ-Postgres analysis confirming the reported transaction-visibility anomaly but clarifying it is inherent to community Postgres (pgsql-hackers since 2013), not RDS-specific. RDS for Postgres Multi-AZ cluster configurations inherit Postgres's ProcArray-based visibility model, in which the order transactions become visible (removal from ProcArray) can diverge from the order they become durable (WAL commit-record write); this admits the Long Fork anomaly (two readers on primary + replica observing concurrent non- conflicting transactions in different orders — a violation of concepts/snapshot-isolation's atomic-visibility property). Single-AZ deployments are unaffected (no cross-node divergence path). The sibling AWS offerings that sidestep the anomaly are systems/aurora-limitless and systems/aurora-dsql, which replace ProcArray-based visibility with time-based MVCC via Postgres-extension surgery (see patterns/postgres-extension-over-fork). AWS's PostgreSQL Contributors Team (formed 2022) is co-developing the proposed upstream CSN fix.
  • sources/2026-02-05-aws-convera-verified-permissions-fine-grained-authorization — RDS as the tenant-isolated data store at the bottom of Convera's multi-tenant authorization chain: "Amazon RDS is configured to accept only requests with specific tenant context and returns data specific to the requested tenant_id." This RDS-side enforcement is the last line of defense under zero-trust re-verification — even if the authorizer + backend checks both fail, the database refuses cross-tenant reads. RDS also serves as the user roles store that Convera's pre-token hook queries at login time to enrich the Cognito JWT with role claims.
  • sources/2026-04-21-figma-how-figmas-databases-team-lived-to-tell-the-scale — Figma's 2020–2022 scaling story on RDS Postgres: ~100× database stack growth, 2020 on AWS's largest instance → end-of-2022 with a dozen vertically-partitioned RDS Postgres instances. Named three RDS-instance ceilings that triggered the subsequent horizontal sharding effort: (1) vacuum reliability impact at TB-scale tables, (2) maximum IOPS supported by Amazon RDS on high-write tables growing fast enough to soon exceed the per-instance cap, (3) CPU on the hottest partitions. Canonical instance of the "RDS-as-vertical-scale-ceiling" story — with the wrinkle that Figma chose to keep RDS Postgres as the substrate even through horizontal sharding, building systems/dbproxy-figma on top rather than migrating to NewSQL (CockroachDB / TiDB / Spanner / Vitess) or NoSQL. First horizontally-sharded table shipped September 2023 with 10s partial primary availability.
  • sources/2026-03-31-aws-streamlining-access-to-dr-capabilities — RDS as the canonical data-tier DR example. Figure 1 shows the multi-destination fanout for RDS: automated backups + manual snapshots + cross-Region snapshot copy + cross-account snapshot copy + AWS Backup vault copies
  • read replicas — a single RDS instance can be protected along any of these axes independently. The post also names RDS as the canonical DR config-translation case: restored RDS has a new endpoint, and applications must be rebound to it (the post's named mechanism: Route 53 private hosted zone CNAME mapping old-endpoint → new-endpoint in the recovered VPC).
  • first wiki citation of the 16,000-connection ceiling on RDS MySQL. Jarod Reyes (PlanetScale, 2021-09-30, preserved as historical context) uses the figure as the scale-out-wall argument in a vendor-comparison pitch: "While RDS limits connections to 16,000, PlanetScale has been designed to scale to nearly limitless database connections per database. And while you can have up to 16,000 connections on RDS, you will have to manually upgrade and increase connection limits or create and manage your own connection pool." The figure is a 2021-era RDS MySQL datum; treat it as point-in-time context rather than a current-version claim. Canonical wiki instance of the "RDS-as-scale-out-wall" framing paired with the specific failure mode of connection-pool exhaustion beyond the 16k limit. The post also positions RDS's schema-change / staging-environment ceremony (e.g. mysqldump for staging copies; 13-step provisioning flow) as the developer-workflow gap relative to PlanetScale's non-blocking schema changes + database branching. 2021 claim "Aurora is still on MySQL 5.7" preserved as historical context only; Aurora MySQL 8.0 compatibility shipped in late 2021 / early 2022 and is not the current state.

  • — PlanetScale's 2023-10-05 comparison (unsigned) that, together with the same-day Aurora sibling, re-asserts RDS as the "manual operational overhead" counterpart to PlanetScale's managed Vitess topology: "While Amazon RDS is a managed solution, a common complaint about this service is that many users end up managing database operations manually. In contrast to PlanetScale, RDS does not automatically load balance or handle major version upgrades. It requires user input for backup windows and its managed services do not fine-tune cluster resources, create comprehensive disaster recovery plans, or design a horizontal sharding scheme unique to your data model." Canonicalises three RDS gaps relative to the Vitess control plane: (1) no platform-level load balancer per database cluster — "logic can be manually defined by the user at the application level to direct reads and writes to either instance"; (2) no native online schema changes or schema reverts; (3) connection management via RDS Proxy as the narrower counterpart to VTTablet's two-tier pool + queue. Architecturally substantive only on the VTGate/VTTablet/topo-server side (which is about PlanetScale, not RDS); this page preserves the RDS-side gap enumeration. See systems/vtgate, systems/vttablet, concepts/vitess-topo-server for the PlanetScale substrate this post draws the contrast against.

Last updated · 542 distilled / 1,571 read