Faster continuous integration builds at Canva¶
One-paragraph summary¶
Canva's Developer Platform group cut the average PR-to-merge CI time from
~80 min (Apr 2022, trending toward 1–3 h) to <30 min (sometimes 15 min) over
~2 years. The core diagnosis was that their CI was a horizontally-scaled
distributed system (build graph ~900K–10^5 nodes, ~3000 jobs per check-merge
build, >1000 builds/workday, each job taking a whole EC2 instance) whose
efficiency was collapsing under its own shape: non-hermetic arbitrary
scripts-as-steps, per-job EC2 warm-up tax, and a Bazel migration whose cache
and sandbox benefits were being eaten by artifact-download I/O. They attacked
it with first-principles theoretical-limit reasoning (20 min worst case given
modern hardware), big experimentation (e.g. a 448-CPU 6TB RAM instance that
still needed 18 min due to single-core critical-path actions), and a
multi-year sequence of incremental levers: Bazel "build without the bytes"
(-download-minimal) with retry-on-cache-eviction; hermetic TestContainers
replacing shared-localstack test harnesses; bazelifying frontend integration
and accessibility tests; pipeline step consolidation (BE/ML pipeline 45→16
steps, ~50% build minutes cut); moving bazel-diff hash-computation to an
out-of-band S3-published job (pipeline generation 10+ min → 2–3 min → near-zero);
switching build agent shape from i4i.8xlarge → c6id.12xlarge; EBS-snapshot
agent warm-up (40 min → 10 min P95); and ongoing test-health enforcement.
Framed through Canva's "Product Development Process" distinction between
shaping (experimentation) and building (production) phases.
Key takeaways¶
-
CI is a distributed system and should be diagnosed like one. Canva's CI has >10^5 build-graph nodes, thousands of jobs per commit, and downstream deps on AWS, Buildkite, GitHub, NPM/Maven/PyPI mirrors. Its performance is bound by the critical path — the longest chain of dependent actions — not by aggregate compute. Averages hide the critical-path experience, echoing the queueing-theory framing from the EBS retrospective (see concepts/queueing-theory).
-
First-principles theoretical-limit sets the diagnostic ceiling. "A build or test action shouldn't take more than a few minutes on modern hardware, and the critical path shouldn't have more than 2 long dependent actions." Rough math: 2 × ~10 min = ~20 min worst case. Observed: 3 hours. That ~10× gap is the opportunity surface. Experiments (e.g.,
bazel build //... - //web/...on a 448-CPU / 6 TB RAM instance taking 18 min cold) confirmed the single-core critical-path action was the real floor, not aggregate CPU. See concepts/first-principles-theoretical-limit. -
"Arbitrary commands in CI steps" compounds at scale. Allowing any script/binary as a CI step was great for authors but forced each step onto its own EC2 instance (non-hermetic = non-parallel-safe, non-cacheable, state-leaky, flake-prone, hard-to-reproduce locally). With ~3000 steps per build and >1000 builds/day, the instance-warm-up tax dominated. The fix path is hermeticity: every action declares its inputs, runs in a sandbox, and is cache-keyed by input hash. See concepts/hermetic-build, concepts/content-addressed-caching.
-
Bazel promises fast+correct, but onboarding cost is real. Canva hit four concrete Bazel tax-items: slow startup (loading 900K-node graph takes minutes); sandbox symlink-per-input overhead (painful on
node_moduleswith thousands of files per action); migration effort (every input must be declared); and RBE compatibility (workers must match local execution inputs). Not a blocker, but a work-item that had to be paid down. -
Build without the bytes (BwoB) was a 2–3× lever on Bazel steps. Bazel's
--remote-download-minimalskips downloading cached artifacts unless another action needs them locally. Canva measured hundreds of GB of per-build downloads (mostly containers) disappear. Backend builds: 10 min → 5 min (2×). ML builds: 6 min → <2 min (3.3×). The one risk — cache eviction mid-build — was mitigated with a simple retry-on-check-failure workaround, rolled out broadly. See patterns/build-without-the-bytes. -
Grouping pipeline steps beats scaling agent count. BE/ML pipeline v2 (Apr 2023) grouped work to reduce instance-warm-up amortization: 45 → 16 steps, average build time 49 → 35 min, ~50% build-minutes cut. A single FE build previously spawned ~100 integration-test jobs per commit; bazelifying these to ~8 grouped jobs removed ~1.3 M jobs/month and their warm-up cost. See patterns/pipeline-step-consolidation.
-
Pipeline generation belongs off the critical path. Canva's TypeScript pipeline generator with
bazel query+bazel-difftook >10 min per commit because it was on the critical path. Fix (Jan 2024): generate pipelines statically (pre-commit), push conditional evaluation to job runtime, and publishbazel-diffinput-hashes to S3 from dedicated instances as soon as a commit is pushed. Job-runtime downloads the hashes to decide what to run (fallback: let Bazel do its thing). The Starlark- based Bazel rewrite of the generator collapsed thousands of lines of conditional TypeScript to a couple hundred. See patterns/static-pipeline-generation. -
Hermetic tests unlock caching AND reliability. Moving backend integration tests from shared-
localstackto per-testTestContainerssandboxes (a) made them cacheable (same inputs → skip), (b) removed flakes from overloaded shared containers, and (c) enabled extending the pattern to service-container tests and hermetic E2E environments. Each service has its own TestContainer + a launch-validation test; E2E composes these. Deployment failures shift left to CI. See systems/testcontainers. -
Instance shape matters more than you think. Switching the slow
build-allstep from generic instances toi4i.8xlarge(multi-SSD, I/O-optimized) took the job from 3 h → 15 min on cold cache. Later moving agents fromi4i.8xlarge→c6id.12xlarge(better CPU:mem:disk balance) saved another 2–6 min across builds (Mar 2024). The experimentation methodology —top+iotopon a test instance, then scale up to isolate single-CPU vs I/O bounds — is itself the pattern. See patterns/instance-shape-right-sizing. -
EBS-snapshot agent warm-ups cut P95 startup 75%. Preloading caches into CI agents via EBS snapshots (instead of cold-fetching on boot) dropped large-agent P95 wait from 40 → 10 min and agent startup from 27 → 8 min. Non-critical-path win, but also a cost win (fewer "alive-but-not-working" minutes). See patterns/snapshot-based-warmup.
-
"Shaping" vs "Building" discipline keeps exploration cheap. Canva's PDP separates shaping (exploring breadth of solutions with mocks, local containers, burner accounts, no PRs) from building (polishing for production). The anti-patterns named: treating shaping as building (wasted review cycles on code that will be thrown away) and building as shaping (half-finished prototypes slipping into prod). Applied to the CI project, it let them run the 448-CPU experiment, the
bazel build //...OOM experiment, and the BwoB PoC without committing infrastructure first. See patterns/shaping-vs-building. -
Flaky tests silently cap the critical path. "If one test takes 20 min and flakes with 3 retries, your build is 60 min regardless of everything else." Canva enforced per-test runtime caps (10 min hard, 5 min P95 goal), manually disabled offenders, and shrunk the >10-min test pool to ~3 min worst case — unblocking the post-BE/ML-pipeline-v2 critical-path gains.
Numbers & observations¶
- Scale of CI: build graph >10^5 nodes (Bazel reports 900K nodes);
~3000 jobs (P90) per
check-mergebuild; >1000check-mergebuilds per workday; each job = 1 EC2 instance. - Theoretical limit (first-principles): ~20 min worst case (2 dependent actions × ~10 min each).
- Observed starting point: 80 min avg Apr 2022, often >1 h / up to 3 h Jul 2022.
- Experiment: 448-CPU / 6 TB RAM instance: 18 min cold build (excluding
/web). Single-CPU critical-path action was the floor. - Experiment:
bazel build //...: OOM on the 900K-node graph; increasing JVM memory would force OS swap. Rejected. - i4i.8xlarge
build-all: 3 h → ~15 min. - BwoB lever: BE builds 10 min → 5 min (2×); ML builds 6 min → <2 min (3.3×). Retry-on-cache-eviction makes it safe.
- BE/ML Pipeline v2 (Apr 2023): 45 steps → 16; 49 min avg → 35 min; ~50% build-minutes cut.
- FE integration tests bazelified: ~100 jobs/commit → 8 jobs/commit; ~1.3 M jobs/month removed.
- FE a11y tests bazelified: 67,145 jobs / 13,947 commits → ~1 job/commit. Expected 80% time cut, ~$100K/yr savings.
- Pipeline generation: >10 min → 2–3 min (drop
bazel-diff) → near-zero (static generation + S3-cached hashes, Jan 2024). - E2E refinement (Feb 2024): -7–10 min.
- Worker-pool reshape i4i.8xlarge → c6id.12xlarge (Mar 2024): -2–6 min.
- RBE enabled on select jobs: TS builds 200% faster; BE unit tests 25% faster; BE compile+pack: cost $0.262 → $0.21 / build (same speed).
- Agent warm-up: P95 wait 40 → 10 min (-75%); startup 27 → 8 min (-70%).
- Per-build-minute cost estimate (EOY 2022): ~$0.018.
Architectural diagram (text)¶
Engineer opens PR
|
v
check-merge (trigger)
|
+---+------------------------------------------------+
| | |
v v v
BE/ML pipeline FE pipeline other pipelines
| | |
| (after v2: (after v2:
| 45→16 grouped frontend integ + (omitted)
| Bazel steps) a11y in Bazel;
| page sub-pipelines
| decommissioned)
| |
+---+------------------+
|
v
All green? → merge to main
Underlying build substrate:
Bazel clients (CI agents)
|
| --remote-download-minimal (BwoB) with retry-on-cache-eviction
v
bazel-remote (on every instance) ---> S3 bucket (shared cache storage)
|
| (partial) Remote Build Execution
v
RBE worker pool (for TS builds / BE unit tests / BE compile+pack)
Pipeline-generation flow (post Jan 2024):
Commit pushed
|
+--> bazel-diff hash job (dedicated instances, ~seconds)
| |
| v
| S3 bucket (input-hash manifest per target)
|
+--> Static pipeline YAML (pre-generated, no git-checkout on critical path)
CI job starts
|
| downloads hash manifest from S3
| decides which targets to execute
| (fallback: let Bazel decide if download fails)
Caveats¶
- "Cost of a build-minute" is a rough average ($0.018 EOY 2022); headline $1.8 M/yr savings for FE integ tests assumed 50% improvement — posted as an estimate, not a measurement.
- Bazel cache-eviction retry works because evictions are rare; in a GC-busy cache or with smaller TTLs the workaround's cost/success ratio would change.
- Proxyless/horizontal-scale CI culture is an enabling condition — the "arbitrary scripts as steps" anti-pattern is harder to enforce away at orgs without a centralized Developer Platform group.
- Bazel migration is an up-front tax, and this post doesn't quantify the eng-years spent on test/target annotation work. The wins are post-tax.
- Some Tier-3 content bias: this is a Canva-internal retrospective, generous with "our wins", quieter on regressions (briefly flagged: the Oct 2023 flakiness regression after the FE launch; broken observability deps in the BE/ML v2 rollout). Treat percentage claims as self-reported.
Source¶
- Original: https://www.canva.dev/blog/engineering/faster-ci-builds-at-canva/
- Raw markdown:
raw/canva/2024-12-16-faster-continuous-integration-builds-at-canva-e73d5b47.md - HN discussion: https://news.ycombinator.com/item?id=42429601 (21 points)
Related¶
- systems/bazel — the hermetic build system at the center of the story.
- systems/bazel-remote — shared CI cache backed by S3.
- systems/buildkite — CI pipeline runner / UI.
- systems/testcontainers — per-test hermetic storage that made backend integ tests cacheable.
- systems/canva-ci — this wiki's page for Canva's CI system as documented.
- concepts/hermetic-build — precondition for caching, parallelism, reproducibility.
- concepts/content-addressed-caching — payoff of hermeticity.
- concepts/critical-path — the bounding metric for CI time.
- concepts/first-principles-theoretical-limit — 20-min floor vs 3-h observed framing.
- concepts/build-graph — 900K-node DAG as a first-order constraint.
- concepts/remote-build-execution — partial rollout; 200 % faster TS builds.
- patterns/build-without-the-bytes — 2–3× lever on Bazel steps with retry-on-eviction.
- patterns/pipeline-step-consolidation — 45 → 16 steps; ~50 % build-minutes cut.
- patterns/static-pipeline-generation — pipeline generation off the critical path.
- patterns/instance-shape-right-sizing — i4i.8xlarge → c6id.12xlarge worker-pool tuning.
- patterns/snapshot-based-warmup — EBS-snapshot agent warm-up; P95 wait 40 → 10 min.
- patterns/shaping-vs-building — Canva's PDP discipline separating experimentation from production.
- companies/canva — index of Canva wiki content.