CONCEPT Cited by 2 sources
ELT vs ETL¶
ETL (Extract → Transform → Load) transforms data before loading it into the warehouse, usually in an external worker. ELT (Extract → Load → Transform) lands raw data into the warehouse first and runs transformations inside it, as SQL over warehouse compute.
The flip is powered by OLAP warehouses with separate, elastic compute: once the engine can aggregate billions of rows in minutes (see concepts/compute-storage-separation), pulling data out to transform in application code becomes the slow path.
Why ELT wins for aggregation-heavy pipelines¶
- Compute is where the data is. Scan + GROUP BY + JOIN stays in the warehouse engine; no per-row wire traffic to an external worker.
- Intermediate stages become SQL views, not persisted tables. Canva explicitly called this out: intermediate DBT transformation outputs materialised as SQL Views that are "simple and flexible to change", eliminating multiple persisted intermediary tables. (Source: sources/2024-04-29-canva-scaling-to-count-billions)
- Code shrinks. Canva removed "thousands of lines of deduplication and aggregation calculation code" by rewriting the logic in SQL, which was "surprisingly simpler" than the imperative equivalent.
- End-to-end recompute becomes cheap. Reruns are just re-executing the DBT graph; see patterns/end-to-end-recompute.
Typical ELT stack¶
- Extract + Load: a managed replication pipeline from source databases into the warehouse. Canva uses their data-platform team's replication service to pull from DynamoDB + service DBs into Snowflake. Raw JSON fields get projected into typed columns as part of this step for query performance.
- Transform: SQL models organised as a DAG, authored in a tool like DBT (systems/dbt). Each model selects from upstream models and emits the next; the final models are the numbers you actually serve.
- Warehouse: OLAP engine that can scale compute for the transform runs — Snowflake, BigQuery, Redshift, Databricks SQL.
Costs / caveats¶
- Separate deploy cadence. The DBT codebase at Canva is a standalone service with its own CI/CD; schema changes require compatibility reasoning across two release schedules. (Source: sources/2024-04-29-canva-scaling-to-count-billions)
- Different observability tooling. The warehouse + DBT stack has its own metrics/logs story, adding integration cost.
- Still needs a bridge to OLTP for serving. Warehouse query latency isn't millisecond-class; low-latency reads require exporting the derived data back out. See patterns/warehouse-unload-bridge.
- Upstream schema coupling. SQL models depend on the shapes of the extracted tables; source schema changes ripple through the DAG.
Seen in¶
- sources/2024-04-29-canva-scaling-to-count-billions — Canva moves dedup + aggregation into Snowflake + DBT SQL models; billions of records aggregated in minutes; >50% storage reduction.
- sources/2025-09-18-mongodb-modernizing-core-insurance-systems-breaking-the-batch-bottleneck —
sibling architectural answer to the same per-record
round-trip trap, solving it
without moving to a warehouse: keep the transform in the
application layer (Java + MongoDB Atlas) but collapse the
round-trips via native
bulkWrite+ prefetch + parallel execution (patterns/bulk-write-batch-optimization). When the workload is per-record enrichment with side effects rather than large-scale aggregation, ETL-via-bulk-operations wins; when it's aggregation-heavy, ELT-in-warehouse wins.