Skip to content

PATTERN Cited by 1 source

End-to-end recompute

End-to-end recompute is a pipeline-design pattern where the output is always a deterministic function of the source data, recomputed from scratch on each run — no live-maintained incremental counters, no persisted intermediary state that needs repair. When the logic is wrong, you fix the code and rerun the pipeline, not the data.

It's the counterpart to incremental-computation pipelines, and becomes practical when OLAP + ELT make re-aggregating huge inputs fast and cheap enough (see concepts/elt-vs-etl, concepts/compute-storage-separation).

Why it's powerful

Canva's Creators-payment pipeline is the textbook case. In the MySQL era, every incident type (overcounting, undercounting, misclassification) required engineers to pause the pipeline and surgically edit intermediary tables — days of work with multi-engineer cross-review to verify nothing else regressed.

After the OLAP + ELT migration, the aggregation step recomputes totals from the deduplication step and does an outer join against prior output, overwriting rows and setting obsolete rows to 0:

Old count New count Output
X X X
X Y Y
null X X
X null 0

Any of overcounting / undercounting / misclassification becomes: fix the code, rerun, done — assuming source data is intact. Fixing code is "generally easier than fixing broken data." (Source: sources/2024-04-29-canva-scaling-to-count-billions)

Preconditions

  • Source-of-truth preserved. Raw events must be durable and complete — Canva keeps raw events in DynamoDB and uses a managed replication pipeline into Snowflake. If raw is corrupt, recompute propagates the corruption.
  • Recompute is fast enough. Warehouse compute has to reaggregate the full window in a time budget that fits incident response (Canva: billions of records / a few minutes).
  • Output overwrite is safe. The aggregation step must either replace or reconcile prior output deterministically — outer-join upsert works when the output is keyed by a stable dimension set.

Knock-on benefits

  • Operational complexity drops. Recovery becomes a code change + rerun, not a forensic-edit session. Canva's incident rate dropped from ≥1/month to ~1 every few months.
  • Processing-delay incidents disappear. Sequential-scan stuck workers don't exist when the pipeline is a bounded SQL DAG.
  • Intermediary state collapses. Canva deleted >50% of stored data because the incremental intermediary tables were no longer needed.

Costs / caveats

  • Compute cost per rerun. End-to-end recompute burns full compute every run; acceptable on an OLAP warehouse with elastic compute, not on your OLTP cluster.
  • Not "self-healing free". The outer-join overwrite only corrects what the new logic produces — a bug that corrupts source data (wrong raw event ingested, missed partition) is not fixed by rerunning; you first need to fix the raw inputs.
  • Partitioning the recompute window. In practice you run on a rolling window (e.g. month-to-date), not all-time; decisions about the window size bound recovery scope and cost.
  • Schema evolution in the source ripples through the DAG; the DBT codebase moves on its own release cadence (see concepts/elt-vs-etl).

Seen in

Last updated · 200 distilled / 1,178 read