SYSTEM Cited by 1 source
Canva Usage-Counting Pipeline¶
Canva's usage-counting pipeline tracks content-usage events (templates, images, videos) across the Canva Creators program to compute how much each creator gets paid. It processes billions of events per month, with usage having doubled every 18 months since the program launched in 2021.
Final architecture (2024)¶
Three logical stages, implemented as DBT models running on Snowflake, fed by a managed replication pipeline from the operational tier, and exported back out to RDS for serving:
- Data collection — events from web / mobile / other sources; validated and filtered; raw events persisted to DynamoDB in JSON.
- Deduplication + classification — Snowflake SQL (DBT) extracts JSON fields into typed columns, removes duplicates, and applies classification rules that determine which events are payable and at what rate.
- Aggregation — per-dimension (brand, template, day) counts computed with GROUP BY; final step outer-joins against prior aggregate output to overwrite changed rows and zero-out obsolete rows (see patterns/end-to-end-recompute).
Unload path:
(See patterns/warehouse-unload-bridge.)
Outcomes¶
- Pipeline latency: >1 day → <1 hour.
- Aggregation runtime: billions of records / a few minutes.
- Stored data: >50% reduction (intermediary tables eliminated).
- Lines of deduplication + aggregation code: thousands deleted (rewritten as SQL in DBT).
- Incident rate: ≥1/month → ~1/several months.
- Processing-delay incidents: eliminated.
Evolution¶
| Era | Raw store | Dedup/agg store | Processing model | Problems |
|---|---|---|---|---|
| v1 MySQL | MySQL RDS | MySQL RDS | Worker services, single-threaded sequential scan, 1+ DB round-trip per record | O(N) queries, vertical-scale wall, days-long incident recovery, shared-instance blast radius |
| v2 DynamoDB for raw | DynamoDB | MySQL RDS | Same worker model | Storage scale fixed; per-record round-trip problem remained; team decided not to rewrite further on DynamoDB |
| v3 OLAP + ELT | DynamoDB | Snowflake + DBT | End-to-end SQL recomputation; outer-join overwrite | Schema-change coupling across release cadences; bridge back to RDS needs tuning (CPU spikes) |
(Source: sources/2024-04-29-canva-scaling-to-count-billions)
Incident-recovery semantics (key property)¶
Before: overcounting, undercounting, and misclassification each required a distinct forensic procedure involving pausing the pipeline and editing rows in dedup + aggregation tables, cross-verified by multiple engineers over days.
After: overcounting, undercounting, and misclassification all reduce to "fix the code, rerun the pipeline" — the outer-join aggregation step overwrites stale values and zeroes out obsolete rows. "Fixing code is generally easier than fixing broken data." (See patterns/end-to-end-recompute.)
Caveats¶
- OLAP warehouses are not a serving tier — explicit unload bridge to RDS, with rate limiting to avoid RDS CPU spikes.
- DBT codebase has its own release cadence; schema changes need compatibility reasoning.
- Infrastructure complexity: data replication pipeline + DBT + separate observability tooling is real cost.
Seen in¶
- sources/2024-04-29-canva-scaling-to-count-billions — the only source describing this system; architectural-evolution retrospective (MySQL → DynamoDB → Snowflake+DBT).