SYSTEM Cited by 1 source
DBT (data build tool)¶
DBT (data build tool) is the transformation layer of modern ELT
pipelines. Engineers write SQL SELECT queries as "models"; DBT
resolves the DAG between them via {{ ref('upstream_model') }} references,
compiles and executes the queries against the warehouse, and materialises
outputs as tables, views, or incremental tables. Documentation at
https://docs.getdbt.com/docs/introduction.
Why it appears in architectures¶
- It's the T in ELT — data lands in the warehouse raw, and DBT models push it through stages. Canva's counting pipeline defines deduplication, classification, and aggregation as DBT models on Snowflake; intermediate stages materialise as SQL Views. (Source: sources/2024-04-29-canva-scaling-to-count-billions)
- Lineage + DAG reasoning comes for free via
ref; pipeline stages aren't just "whatever SQL is in cron" — they have dependencies DBT understands. - Tests and schema contracts live alongside the models; migration hazards from upstream schema changes are visible.
Example (from Canva)¶
-- aggregate daily template usage up to per-brand counts
select
day_id,
template_brand,
...
sum(usage_count) as usage_count
from {{ ref('daily_template_usages') }}
group by
day_id,
template_brand
The previous-stage model daily_template_usages is resolved by DBT;
this model is what a downstream brand-level report would ref.
Operational properties¶
- Separate deployable with its own CI/CD. Canva explicitly flags that the DBT codebase is a standalone service with a different release cadence from the main services, and schema-change compatibility has to be reasoned about across the two.
- Materialisation choice matters. Views = cheap, always fresh, recomputed at read. Tables = cached, fast to read, need refresh. Incremental = append / merge for large tables with stable keys. Canva prefers views for intermediates: "simple and flexible to change" and no intermediary persisted state.
- Rerun semantics. A full DBT run re-executes the DAG — which is what enables patterns/end-to-end-recompute when paired with an outer-join upsert at the sink.
Caveats¶
- Observability lives in DBT + warehouse tooling, not the service stack — integration cost is real.
- Source-schema coupling. Models are tied to upstream table shapes; source schema changes ripple through. Canva lists this as a challenge.
Seen in¶
- sources/2024-04-29-canva-scaling-to-count-billions — Canva's counting pipeline is DBT models on Snowflake; dedup + aggregation as SQL; intermediate stages as views; "thousands of lines" of imperative code deleted.