CANVA 2024-12-16 Tier 3

Faster continuous integration builds at Canva¶

One-paragraph summary¶

Canva's Developer Platform group cut the average PR-to-merge CI time from ~80 min (Apr 2022, trending toward 1–3 h) to <30 min (sometimes 15 min) over ~2 years. The core diagnosis was that their CI was a horizontally-scaled distributed system (build graph ~900K–10^5 nodes, ~3000 jobs per check-merge build, >1000 builds/workday, each job taking a whole EC2 instance) whose efficiency was collapsing under its own shape: non-hermetic arbitrary scripts-as-steps, per-job EC2 warm-up tax, and a Bazel migration whose cache and sandbox benefits were being eaten by artifact-download I/O. They attacked it with first-principles theoretical-limit reasoning (20 min worst case given modern hardware), big experimentation (e.g. a 448-CPU 6TB RAM instance that still needed 18 min due to single-core critical-path actions), and a multi-year sequence of incremental levers: Bazel "build without the bytes" (-download-minimal) with retry-on-cache-eviction; hermetic TestContainers replacing shared-localstack test harnesses; bazelifying frontend integration and accessibility tests; pipeline step consolidation (BE/ML pipeline 45→16 steps, ~50% build minutes cut); moving bazel-diff hash-computation to an out-of-band S3-published job (pipeline generation 10+ min → 2–3 min → near-zero); switching build agent shape from i4i.8xlarge → c6id.12xlarge; EBS-snapshot agent warm-up (40 min → 10 min P95); and ongoing test-health enforcement. Framed through Canva's "Product Development Process" distinction between shaping (experimentation) and building (production) phases.

Key takeaways¶

CI is a distributed system and should be diagnosed like one. Canva's CI has >10^5 build-graph nodes, thousands of jobs per commit, and downstream deps on AWS, Buildkite, GitHub, NPM/Maven/PyPI mirrors. Its performance is bound by the critical path — the longest chain of dependent actions — not by aggregate compute. Averages hide the critical-path experience, echoing the queueing-theory framing from the EBS retrospective (see concepts/queueing-theory).
First-principles theoretical-limit sets the diagnostic ceiling. "A build or test action shouldn't take more than a few minutes on modern hardware, and the critical path shouldn't have more than 2 long dependent actions." Rough math: 2 × ~10 min = ~20 min worst case. Observed: 3 hours. That ~10× gap is the opportunity surface. Experiments (e.g., bazel build //... - //web/... on a 448-CPU / 6 TB RAM instance taking 18 min cold) confirmed the single-core critical-path action was the real floor, not aggregate CPU. See concepts/first-principles-theoretical-limit.
"Arbitrary commands in CI steps" compounds at scale. Allowing any script/binary as a CI step was great for authors but forced each step onto its own EC2 instance (non-hermetic = non-parallel-safe, non-cacheable, state-leaky, flake-prone, hard-to-reproduce locally). With ~3000 steps per build and >1000 builds/day, the instance-warm-up tax dominated. The fix path is hermeticity: every action declares its inputs, runs in a sandbox, and is cache-keyed by input hash. See concepts/hermetic-build, concepts/content-addressed-caching.
Bazel promises fast+correct, but onboarding cost is real. Canva hit four concrete Bazel tax-items: slow startup (loading 900K-node graph takes minutes); sandbox symlink-per-input overhead (painful on node_modules with thousands of files per action); migration effort (every input must be declared); and RBE compatibility (workers must match local execution inputs). Not a blocker, but a work-item that had to be paid down.
Build without the bytes (BwoB) was a 2–3× lever on Bazel steps. Bazel's --remote-download-minimal skips downloading cached artifacts unless another action needs them locally. Canva measured hundreds of GB of per-build downloads (mostly containers) disappear. Backend builds: 10 min → 5 min (2×). ML builds: 6 min → <2 min (3.3×). The one risk — cache eviction mid-build — was mitigated with a simple retry-on-check-failure workaround, rolled out broadly. See patterns/build-without-the-bytes.
Grouping pipeline steps beats scaling agent count. BE/ML pipeline v2 (Apr 2023) grouped work to reduce instance-warm-up amortization: 45 → 16 steps, average build time 49 → 35 min, ~50% build-minutes cut. A single FE build previously spawned ~100 integration-test jobs per commit; bazelifying these to ~8 grouped jobs removed ~1.3 M jobs/month and their warm-up cost. See patterns/pipeline-step-consolidation.
Pipeline generation belongs off the critical path. Canva's TypeScript pipeline generator with bazel query + bazel-diff took >10 min per commit because it was on the critical path. Fix (Jan 2024): generate pipelines statically (pre-commit), push conditional evaluation to job runtime, and publish bazel-diff input-hashes to S3 from dedicated instances as soon as a commit is pushed. Job-runtime downloads the hashes to decide what to run (fallback: let Bazel do its thing). The Starlark- based Bazel rewrite of the generator collapsed thousands of lines of conditional TypeScript to a couple hundred. See patterns/static-pipeline-generation.
Hermetic tests unlock caching AND reliability. Moving backend integration tests from shared-localstack to per-test TestContainers sandboxes (a) made them cacheable (same inputs → skip), (b) removed flakes from overloaded shared containers, and (c) enabled extending the pattern to service-container tests and hermetic E2E environments. Each service has its own TestContainer + a launch-validation test; E2E composes these. Deployment failures shift left to CI. See systems/testcontainers.
Instance shape matters more than you think. Switching the slow build-all step from generic instances to i4i.8xlarge (multi-SSD, I/O-optimized) took the job from 3 h → 15 min on cold cache. Later moving agents from i4i.8xlarge → c6id.12xlarge (better CPU:mem:disk balance) saved another 2–6 min across builds (Mar 2024). The experimentation methodology — top + iotop on a test instance, then scale up to isolate single-CPU vs I/O bounds — is itself the pattern. See patterns/instance-shape-right-sizing.
EBS-snapshot agent warm-ups cut P95 startup 75%. Preloading caches into CI agents via EBS snapshots (instead of cold-fetching on boot) dropped large-agent P95 wait from 40 → 10 min and agent startup from 27 → 8 min. Non-critical-path win, but also a cost win (fewer "alive-but-not-working" minutes). See patterns/snapshot-based-warmup.
"Shaping" vs "Building" discipline keeps exploration cheap. Canva's PDP separates shaping (exploring breadth of solutions with mocks, local containers, burner accounts, no PRs) from building (polishing for production). The anti-patterns named: treating shaping as building (wasted review cycles on code that will be thrown away) and building as shaping (half-finished prototypes slipping into prod). Applied to the CI project, it let them run the 448-CPU experiment, the bazel build //... OOM experiment, and the BwoB PoC without committing infrastructure first. See patterns/shaping-vs-building.
Flaky tests silently cap the critical path. "If one test takes 20 min and flakes with 3 retries, your build is 60 min regardless of everything else." Canva enforced per-test runtime caps (10 min hard, 5 min P95 goal), manually disabled offenders, and shrunk the >10-min test pool to ~3 min worst case — unblocking the post-BE/ML-pipeline-v2 critical-path gains.

Numbers & observations¶

Scale of CI: build graph >10^5 nodes (Bazel reports 900K nodes); ~3000 jobs (P90) per check-merge build; >1000 check-merge builds per workday; each job = 1 EC2 instance.
Theoretical limit (first-principles): ~20 min worst case (2 dependent actions × ~10 min each).
Observed starting point: 80 min avg Apr 2022, often >1 h / up to 3 h Jul 2022.
Experiment: 448-CPU / 6 TB RAM instance: 18 min cold build (excluding /web). Single-CPU critical-path action was the floor.
Experiment: bazel build //...: OOM on the 900K-node graph; increasing JVM memory would force OS swap. Rejected.
i4i.8xlarge build-all: 3 h → ~15 min.
BwoB lever: BE builds 10 min → 5 min (2×); ML builds 6 min → <2 min (3.3×). Retry-on-cache-eviction makes it safe.
BE/ML Pipeline v2 (Apr 2023): 45 steps → 16; 49 min avg → 35 min; ~50% build-minutes cut.
FE integration tests bazelified: ~100 jobs/commit → 8 jobs/commit; ~1.3 M jobs/month removed.
FE a11y tests bazelified: 67,145 jobs / 13,947 commits → ~1 job/commit. Expected 80% time cut, ~$100K/yr savings.
Pipeline generation: >10 min → 2–3 min (drop bazel-diff) → near-zero (static generation + S3-cached hashes, Jan 2024).
E2E refinement (Feb 2024): -7–10 min.
Worker-pool reshape i4i.8xlarge → c6id.12xlarge (Mar 2024): -2–6 min.
RBE enabled on select jobs: TS builds 200% faster; BE unit tests 25% faster; BE compile+pack: cost $0.262 → $0.21 / build (same speed).
Agent warm-up: P95 wait 40 → 10 min (-75%); startup 27 → 8 min (-70%).
Per-build-minute cost estimate (EOY 2022): ~$0.018.

Architectural diagram (text)¶

Engineer opens PR
      |
      v
  check-merge  (trigger)
      |
  +---+------------------------------------------------+
  |                      |                             |
  v                      v                             v
BE/ML pipeline      FE pipeline                   other pipelines
  |                      |                             |
  | (after v2:            (after v2:
  |  45→16 grouped         frontend integ +           (omitted)
  |  Bazel steps)          a11y in Bazel;
  |                        page sub-pipelines
  |                        decommissioned)
  |                      |
  +---+------------------+
      |
      v
  All green? → merge to main

Underlying build substrate:

Bazel clients (CI agents)
   |
   |  --remote-download-minimal (BwoB)   with retry-on-cache-eviction
   v
bazel-remote  (on every instance)  --->  S3 bucket  (shared cache storage)
   |
   |  (partial) Remote Build Execution
   v
RBE worker pool (for TS builds / BE unit tests / BE compile+pack)

Pipeline-generation flow (post Jan 2024):

Commit pushed
   |
   +--> bazel-diff hash job (dedicated instances, ~seconds)
   |      |
   |      v
   |    S3 bucket (input-hash manifest per target)
   |
   +--> Static pipeline YAML (pre-generated, no git-checkout on critical path)

CI job starts
   |
   | downloads hash manifest from S3
   | decides which targets to execute
   | (fallback: let Bazel decide if download fails)

Caveats¶

"Cost of a build-minute" is a rough average ($0.018 EOY 2022); headline $1.8 M/yr savings for FE integ tests assumed 50% improvement — posted as an estimate, not a measurement.
Bazel cache-eviction retry works because evictions are rare; in a GC-busy cache or with smaller TTLs the workaround's cost/success ratio would change.
Proxyless/horizontal-scale CI culture is an enabling condition — the "arbitrary scripts as steps" anti-pattern is harder to enforce away at orgs without a centralized Developer Platform group.
Bazel migration is an up-front tax, and this post doesn't quantify the eng-years spent on test/target annotation work. The wins are post-tax.
Some Tier-3 content bias: this is a Canva-internal retrospective, generous with "our wins", quieter on regressions (briefly flagged: the Oct 2023 flakiness regression after the FE launch; broken observability deps in the BE/ML v2 rollout). Treat percentage claims as self-reported.

Source¶

Original: https://www.canva.dev/blog/engineering/faster-ci-builds-at-canva/
Raw markdown: raw/canva/2024-12-16-faster-continuous-integration-builds-at-canva-e73d5b47.md
HN discussion: https://news.ycombinator.com/item?id=42429601 (21 points)

systems/bazel — the hermetic build system at the center of the story.
systems/bazel-remote — shared CI cache backed by S3.
systems/buildkite — CI pipeline runner / UI.
systems/testcontainers — per-test hermetic storage that made backend integ tests cacheable.
systems/canva-ci — this wiki's page for Canva's CI system as documented.
concepts/hermetic-build — precondition for caching, parallelism, reproducibility.
concepts/content-addressed-caching — payoff of hermeticity.
concepts/critical-path — the bounding metric for CI time.
concepts/first-principles-theoretical-limit — 20-min floor vs 3-h observed framing.
concepts/build-graph — 900K-node DAG as a first-order constraint.
concepts/remote-build-execution — partial rollout; 200 % faster TS builds.
patterns/build-without-the-bytes — 2–3× lever on Bazel steps with retry-on-eviction.
patterns/pipeline-step-consolidation — 45 → 16 steps; ~50 % build-minutes cut.
patterns/static-pipeline-generation — pipeline generation off the critical path.
patterns/instance-shape-right-sizing — i4i.8xlarge → c6id.12xlarge worker-pool tuning.
patterns/snapshot-based-warmup — EBS-snapshot agent warm-up; P95 wait 40 → 10 min.
patterns/shaping-vs-building — Canva's PDP discipline separating experimentation from production.
companies/canva — index of Canva wiki content.