SYSTEM Cited by 1 source

Canva Usage-Counting Pipeline¶

Canva's usage-counting pipeline tracks content-usage events (templates, images, videos) across the Canva Creators program to compute how much each creator gets paid. It processes billions of events per month, with usage having doubled every 18 months since the program launched in 2021.

Final architecture (2024)¶

Three logical stages, implemented as DBT models running on Snowflake, fed by a managed replication pipeline from the operational tier, and exported back out to RDS for serving:

Data collection — events from web / mobile / other sources; validated and filtered; raw events persisted to DynamoDB in JSON.
Deduplication + classification — Snowflake SQL (DBT) extracts JSON fields into typed columns, removes duplicates, and applies classification rules that determine which events are payable and at what rate.
Aggregation — per-dimension (brand, template, day) counts computed with GROUP BY; final step outer-joins against prior aggregate output to overwrite changed rows and zero-out obsolete rows (see patterns/end-to-end-recompute).

Unload path:

Snowflake → scheduled worker → S3 → SQS → rate-limited ingester → service RDS

(See patterns/warehouse-unload-bridge.)

Outcomes¶

Pipeline latency: >1 day → <1 hour.
Aggregation runtime: billions of records / a few minutes.
Stored data: >50% reduction (intermediary tables eliminated).
Lines of deduplication + aggregation code: thousands deleted (rewritten as SQL in DBT).
Incident rate: ≥1/month → ~1/several months.
Processing-delay incidents: eliminated.

Evolution¶

Era	Raw store	Dedup/agg store	Processing model	Problems
v1 MySQL	MySQL RDS	MySQL RDS	Worker services, single-threaded sequential scan, 1+ DB round-trip per record	O(N) queries, vertical-scale wall, days-long incident recovery, shared-instance blast radius
v2 DynamoDB for raw	DynamoDB	MySQL RDS	Same worker model	Storage scale fixed; per-record round-trip problem remained; team decided not to rewrite further on DynamoDB
v3 OLAP + ELT	DynamoDB	Snowflake + DBT	End-to-end SQL recomputation; outer-join overwrite	Schema-change coupling across release cadences; bridge back to RDS needs tuning (CPU spikes)

(Source: sources/2024-04-29-canva-scaling-to-count-billions)

Incident-recovery semantics (key property)¶

Before: overcounting, undercounting, and misclassification each required a distinct forensic procedure involving pausing the pipeline and editing rows in dedup + aggregation tables, cross-verified by multiple engineers over days.

After: overcounting, undercounting, and misclassification all reduce to "fix the code, rerun the pipeline" — the outer-join aggregation step overwrites stale values and zeroes out obsolete rows. "Fixing code is generally easier than fixing broken data." (See patterns/end-to-end-recompute.)

Caveats¶

OLAP warehouses are not a serving tier — explicit unload bridge to RDS, with rate limiting to avoid RDS CPU spikes.
DBT codebase has its own release cadence; schema changes need compatibility reasoning.
Infrastructure complexity: data replication pipeline + DBT + separate observability tooling is real cost.

Seen in¶

sources/2024-04-29-canva-scaling-to-count-billions — the only source describing this system; architectural-evolution retrospective (MySQL → DynamoDB → Snowflake+DBT).