Skip to content

CONCEPT Cited by 2 sources

ELT vs ETL

ETL (Extract → Transform → Load) transforms data before loading it into the warehouse, usually in an external worker. ELT (Extract → Load → Transform) lands raw data into the warehouse first and runs transformations inside it, as SQL over warehouse compute.

The flip is powered by OLAP warehouses with separate, elastic compute: once the engine can aggregate billions of rows in minutes (see concepts/compute-storage-separation), pulling data out to transform in application code becomes the slow path.

Why ELT wins for aggregation-heavy pipelines

  • Compute is where the data is. Scan + GROUP BY + JOIN stays in the warehouse engine; no per-row wire traffic to an external worker.
  • Intermediate stages become SQL views, not persisted tables. Canva explicitly called this out: intermediate DBT transformation outputs materialised as SQL Views that are "simple and flexible to change", eliminating multiple persisted intermediary tables. (Source: sources/2024-04-29-canva-scaling-to-count-billions)
  • Code shrinks. Canva removed "thousands of lines of deduplication and aggregation calculation code" by rewriting the logic in SQL, which was "surprisingly simpler" than the imperative equivalent.
  • End-to-end recompute becomes cheap. Reruns are just re-executing the DBT graph; see patterns/end-to-end-recompute.

Typical ELT stack

  • Extract + Load: a managed replication pipeline from source databases into the warehouse. Canva uses their data-platform team's replication service to pull from DynamoDB + service DBs into Snowflake. Raw JSON fields get projected into typed columns as part of this step for query performance.
  • Transform: SQL models organised as a DAG, authored in a tool like DBT (systems/dbt). Each model selects from upstream models and emits the next; the final models are the numbers you actually serve.
  • Warehouse: OLAP engine that can scale compute for the transform runs — Snowflake, BigQuery, Redshift, Databricks SQL.

Costs / caveats

  • Separate deploy cadence. The DBT codebase at Canva is a standalone service with its own CI/CD; schema changes require compatibility reasoning across two release schedules. (Source: sources/2024-04-29-canva-scaling-to-count-billions)
  • Different observability tooling. The warehouse + DBT stack has its own metrics/logs story, adding integration cost.
  • Still needs a bridge to OLTP for serving. Warehouse query latency isn't millisecond-class; low-latency reads require exporting the derived data back out. See patterns/warehouse-unload-bridge.
  • Upstream schema coupling. SQL models depend on the shapes of the extracted tables; source schema changes ripple through the DAG.

Seen in

Last updated · 200 distilled / 1,178 read