Skip to content

CANVA 2024-04-29 Tier 3

Read original ↗

Canva — Scaling to Count Billions

Summary

Canva's Creators-payment pipeline counts billions of content-usage events per month (templates, images, videos) to pay creators. The team walked through three architectural stages: (1) MySQL + worker services doing single-threaded sequential scans with one DB round-trip per record, which exploded operationally as storage doubled every 8–10 months and required O(N) DB queries per pass; (2) a DynamoDB move that helped raw-event storage but left the per-record round-trip problem intact; (3) an OLAP + ELT redesign where service data is replicated into Snowflake, DBT transformations do end-to-end deduplication and aggregation as SQL over the whole month, and a scheduled unloader exports results via S3 + SQS back into service RDS for low-latency reads. Key numerical outcomes: billions of records aggregated in minutes, pipeline latency >1 day → <1 hour, >50% storage reduction, incidents from ≥1/month to ~1 per several months.

Key takeaways

  • OLTP-style incremental counting over billions of records doesn't scale. A per-record loop that does ≥1 DB read and ≥1 DB write is O(N) round-trips regardless of batching, because O(N/C) is still O(N) for constant batch C. Add single-threaded sequential scans and a stuck event delays everything behind it. (Source: sources/2024-04-29-canva-scaling-to-count-billions)
  • MySQL RDS vertical scaling is a finite game. Canva doubled RDS instance size every 8–10 months; at several TB the shared-instance blast radius and version-upgrade operational cost dominated. This is the classic "vertical-scale wall" that pushes teams to horizontal storage or offloading the workload. (See systems/mysql, systems/aws-rds.)
  • OLAP + ELT collapses intermediary state. By pushing deduplication and aggregation into Snowflake + DBT as SQL transformations over the source data, Canva eliminated multiple persisted intermediary tables and "thousands of lines of deduplication/aggregation code". The transformation output is a SQL view over the stage before it, materialised on demand. (See concepts/elt-vs-etl, systems/dbt.)
  • Compute–storage separation is the scalability lever. Snowflake's independent compute scaling is what makes "aggregate the whole month in minutes" feasible: most of the calculation is in memory, orders of magnitude faster than per-record DB round-trips. (See concepts/compute-storage-separation.)
  • End-to-end recompute turns most incidents into a pipeline rerun. Overcounting / undercounting / misclassification previously required engineers to SSH into a database and surgically fix intermediary tables over days. With ELT, rerunning the pipeline re-derives the result from source data; an outer-join aggregation step overwrites old aggregates with new ones (including setting obsolete rows to 0). "Fixing code is easier than fixing data." (See patterns/end-to-end-recompute.)
  • OLAP ↔ OLTP needs a bridge, not a direct link. OLAP warehouses are not built for low-latency high-QPS lookups. Canva unloads Snowflake output via a scheduled worker that lands data in S3 as intermediate durable storage, fans out through SQS for durability, and ingests into service RDS with rate-limited throughput (they show RDS CPU spikes when the rate was too high). (See patterns/warehouse-unload-bridge.)
  • DynamoDB solved raw-event storage but not processing. Moving raw usage events to DynamoDB fixed the storage scaling problem; the team deliberately stopped before rewriting the whole pipeline on DynamoDB because per-record DB round-trips would remain. The later OLAP move extracts DynamoDB JSON data and projects JSON fields into typed columns for warehouse query performance. (See systems/dynamodb.)
  • Start small and let observability gate the rework. Canva explicitly defends the initial MySQL design as correct for the first 2 years — "it served its purpose" — and argues that architectural change should be triggered by measured growth, not anticipatory redesign. Paired with the tenet "observe from day 1" at every pipeline stage (latency/throughput/backlog) because "it's hard to predict where a problem can happen."
  • New architecture brings new complexity. The DBT codebase is a separate deployable with its own CI/CD and release schedule; schema-change compatibility must be reasoned about across two release cadences. Observability tooling for the data platform is different from the services, adding integration cost.

Architecture (final)

Three pipeline stages, each composed of DBT-defined SQL transformations running on Snowflake:

  1. Data collection — events from web / mobile / etc., validated and filtered, written to DynamoDB as JSON.
  2. Deduplication + classification — duplicates removed, rules applied to categorise each event into a payable type. Implemented as SQL with GROUP BY over extracted typed columns.
  3. Aggregation — per-dimension (e.g. per-brand, per-template, per-day) counts summed up. Output compared to prior aggregate via outer join, overwriting changed rows and zeroing obsolete ones.

Bridge back to operational serving:

  • Scheduled unloader reads Snowflake → writes S3 → emits SQS message → ingester reads and writes to service RDS, rate-limited to avoid RDS CPU spikes.

Aggregation-overwrite semantics (from the article's table):

Old count New count Output
X X X
X Y Y
null X X
X null 0

This is the engine of "rerun to fix" — whatever the new aggregation says, that's the truth.

Numbers

  • Usage growth: doubles every 18 months over 3+ years.
  • Scale: billions of usage events per month.
  • Pipeline latency: >1 day → <1 hour end-to-end.
  • Aggregation runtime: billions of records in a few minutes.
  • Storage: >50% reduction vs. the MySQL intermediary-table design.
  • Incident rate: ≥1/month (worst case) → ~1 per several months.
  • RDS storage burn rate: ~500 GB consumed in 6 months (≈ 50% of free capacity) under the MySQL design.

Incident taxonomy (MySQL era vs. ELT era)

Incident type MySQL era recovery ELT era recovery
Overcounting Identify bad event type, pause, hand-edit dedup + agg tables Rerun pipeline end-to-end
Undercounting Find window, backfill from backup (backfill can take days) Rerun pipeline against backfilled src
Misclassification Find bug, pause, hand-edit dedup + agg tables with cross-review Rerun pipeline end-to-end
Processing delay Debug stuck worker, drain backlog Eliminated — no sequential scan

The recovery-cost asymmetry is what pays for the whole migration.

Caveats & open trade-offs

  • OLAP is not a serving tier. Results must be unloaded to something that can serve reads at low latency; that unload path is a first-class piece of the architecture (S3 + SQS + rate-limited RDS ingester), with its own tuning story (RDS CPU spikes).
  • Schema-change blast radius is larger. SQL transformations in DBT are coupled to the upstream tables' schemas; the DBT service has a separate release cadence, so schema evolution needs explicit compatibility reasoning.
  • DynamoDB as raw-event store keeps JSON. Warehouse queries need the JSON columns projected to typed columns as part of the E/L step for reasonable query performance.
  • Infra complexity cost. Data replication pipeline, DBT, separate observability tooling — the team calls out this as a real cost, not free.
  • Tenet: "simplicity is a key aspect of designing reliable services" — reducing code and data complexity. Fixing code is easier than fixing data. This is the OLAP/ELT move's main justification beyond raw performance.

Original

Last updated · 200 distilled / 1,178 read