Skip to content

CANVA 2024-12-16 Tier 3

Read original ↗

Faster continuous integration builds at Canva

One-paragraph summary

Canva's Developer Platform group cut the average PR-to-merge CI time from ~80 min (Apr 2022, trending toward 1–3 h) to <30 min (sometimes 15 min) over ~2 years. The core diagnosis was that their CI was a horizontally-scaled distributed system (build graph ~900K–10^5 nodes, ~3000 jobs per check-merge build, >1000 builds/workday, each job taking a whole EC2 instance) whose efficiency was collapsing under its own shape: non-hermetic arbitrary scripts-as-steps, per-job EC2 warm-up tax, and a Bazel migration whose cache and sandbox benefits were being eaten by artifact-download I/O. They attacked it with first-principles theoretical-limit reasoning (20 min worst case given modern hardware), big experimentation (e.g. a 448-CPU 6TB RAM instance that still needed 18 min due to single-core critical-path actions), and a multi-year sequence of incremental levers: Bazel "build without the bytes" (-download-minimal) with retry-on-cache-eviction; hermetic TestContainers replacing shared-localstack test harnesses; bazelifying frontend integration and accessibility tests; pipeline step consolidation (BE/ML pipeline 45→16 steps, ~50% build minutes cut); moving bazel-diff hash-computation to an out-of-band S3-published job (pipeline generation 10+ min → 2–3 min → near-zero); switching build agent shape from i4i.8xlarge → c6id.12xlarge; EBS-snapshot agent warm-up (40 min → 10 min P95); and ongoing test-health enforcement. Framed through Canva's "Product Development Process" distinction between shaping (experimentation) and building (production) phases.

Key takeaways

  1. CI is a distributed system and should be diagnosed like one. Canva's CI has >10^5 build-graph nodes, thousands of jobs per commit, and downstream deps on AWS, Buildkite, GitHub, NPM/Maven/PyPI mirrors. Its performance is bound by the critical path — the longest chain of dependent actions — not by aggregate compute. Averages hide the critical-path experience, echoing the queueing-theory framing from the EBS retrospective (see concepts/queueing-theory).

  2. First-principles theoretical-limit sets the diagnostic ceiling. "A build or test action shouldn't take more than a few minutes on modern hardware, and the critical path shouldn't have more than 2 long dependent actions." Rough math: 2 × ~10 min = ~20 min worst case. Observed: 3 hours. That ~10× gap is the opportunity surface. Experiments (e.g., bazel build //... - //web/... on a 448-CPU / 6 TB RAM instance taking 18 min cold) confirmed the single-core critical-path action was the real floor, not aggregate CPU. See concepts/first-principles-theoretical-limit.

  3. "Arbitrary commands in CI steps" compounds at scale. Allowing any script/binary as a CI step was great for authors but forced each step onto its own EC2 instance (non-hermetic = non-parallel-safe, non-cacheable, state-leaky, flake-prone, hard-to-reproduce locally). With ~3000 steps per build and >1000 builds/day, the instance-warm-up tax dominated. The fix path is hermeticity: every action declares its inputs, runs in a sandbox, and is cache-keyed by input hash. See concepts/hermetic-build, concepts/content-addressed-caching.

  4. Bazel promises fast+correct, but onboarding cost is real. Canva hit four concrete Bazel tax-items: slow startup (loading 900K-node graph takes minutes); sandbox symlink-per-input overhead (painful on node_modules with thousands of files per action); migration effort (every input must be declared); and RBE compatibility (workers must match local execution inputs). Not a blocker, but a work-item that had to be paid down.

  5. Build without the bytes (BwoB) was a 2–3× lever on Bazel steps. Bazel's --remote-download-minimal skips downloading cached artifacts unless another action needs them locally. Canva measured hundreds of GB of per-build downloads (mostly containers) disappear. Backend builds: 10 min → 5 min (2×). ML builds: 6 min → <2 min (3.3×). The one risk — cache eviction mid-build — was mitigated with a simple retry-on-check-failure workaround, rolled out broadly. See patterns/build-without-the-bytes.

  6. Grouping pipeline steps beats scaling agent count. BE/ML pipeline v2 (Apr 2023) grouped work to reduce instance-warm-up amortization: 45 → 16 steps, average build time 49 → 35 min, ~50% build-minutes cut. A single FE build previously spawned ~100 integration-test jobs per commit; bazelifying these to ~8 grouped jobs removed ~1.3 M jobs/month and their warm-up cost. See patterns/pipeline-step-consolidation.

  7. Pipeline generation belongs off the critical path. Canva's TypeScript pipeline generator with bazel query + bazel-diff took >10 min per commit because it was on the critical path. Fix (Jan 2024): generate pipelines statically (pre-commit), push conditional evaluation to job runtime, and publish bazel-diff input-hashes to S3 from dedicated instances as soon as a commit is pushed. Job-runtime downloads the hashes to decide what to run (fallback: let Bazel do its thing). The Starlark- based Bazel rewrite of the generator collapsed thousands of lines of conditional TypeScript to a couple hundred. See patterns/static-pipeline-generation.

  8. Hermetic tests unlock caching AND reliability. Moving backend integration tests from shared-localstack to per-test TestContainers sandboxes (a) made them cacheable (same inputs → skip), (b) removed flakes from overloaded shared containers, and (c) enabled extending the pattern to service-container tests and hermetic E2E environments. Each service has its own TestContainer + a launch-validation test; E2E composes these. Deployment failures shift left to CI. See systems/testcontainers.

  9. Instance shape matters more than you think. Switching the slow build-all step from generic instances to i4i.8xlarge (multi-SSD, I/O-optimized) took the job from 3 h → 15 min on cold cache. Later moving agents from i4i.8xlargec6id.12xlarge (better CPU:mem:disk balance) saved another 2–6 min across builds (Mar 2024). The experimentation methodology — top + iotop on a test instance, then scale up to isolate single-CPU vs I/O bounds — is itself the pattern. See patterns/instance-shape-right-sizing.

  10. EBS-snapshot agent warm-ups cut P95 startup 75%. Preloading caches into CI agents via EBS snapshots (instead of cold-fetching on boot) dropped large-agent P95 wait from 40 → 10 min and agent startup from 27 → 8 min. Non-critical-path win, but also a cost win (fewer "alive-but-not-working" minutes). See patterns/snapshot-based-warmup.

  11. "Shaping" vs "Building" discipline keeps exploration cheap. Canva's PDP separates shaping (exploring breadth of solutions with mocks, local containers, burner accounts, no PRs) from building (polishing for production). The anti-patterns named: treating shaping as building (wasted review cycles on code that will be thrown away) and building as shaping (half-finished prototypes slipping into prod). Applied to the CI project, it let them run the 448-CPU experiment, the bazel build //... OOM experiment, and the BwoB PoC without committing infrastructure first. See patterns/shaping-vs-building.

  12. Flaky tests silently cap the critical path. "If one test takes 20 min and flakes with 3 retries, your build is 60 min regardless of everything else." Canva enforced per-test runtime caps (10 min hard, 5 min P95 goal), manually disabled offenders, and shrunk the >10-min test pool to ~3 min worst case — unblocking the post-BE/ML-pipeline-v2 critical-path gains.

Numbers & observations

  • Scale of CI: build graph >10^5 nodes (Bazel reports 900K nodes); ~3000 jobs (P90) per check-merge build; >1000 check-merge builds per workday; each job = 1 EC2 instance.
  • Theoretical limit (first-principles): ~20 min worst case (2 dependent actions × ~10 min each).
  • Observed starting point: 80 min avg Apr 2022, often >1 h / up to 3 h Jul 2022.
  • Experiment: 448-CPU / 6 TB RAM instance: 18 min cold build (excluding /web). Single-CPU critical-path action was the floor.
  • Experiment: bazel build //...: OOM on the 900K-node graph; increasing JVM memory would force OS swap. Rejected.
  • i4i.8xlarge build-all: 3 h → ~15 min.
  • BwoB lever: BE builds 10 min → 5 min (2×); ML builds 6 min → <2 min (3.3×). Retry-on-cache-eviction makes it safe.
  • BE/ML Pipeline v2 (Apr 2023): 45 steps → 16; 49 min avg → 35 min; ~50% build-minutes cut.
  • FE integration tests bazelified: ~100 jobs/commit → 8 jobs/commit; ~1.3 M jobs/month removed.
  • FE a11y tests bazelified: 67,145 jobs / 13,947 commits → ~1 job/commit. Expected 80% time cut, ~$100K/yr savings.
  • Pipeline generation: >10 min → 2–3 min (drop bazel-diff) → near-zero (static generation + S3-cached hashes, Jan 2024).
  • E2E refinement (Feb 2024): -7–10 min.
  • Worker-pool reshape i4i.8xlarge → c6id.12xlarge (Mar 2024): -2–6 min.
  • RBE enabled on select jobs: TS builds 200% faster; BE unit tests 25% faster; BE compile+pack: cost $0.262 → $0.21 / build (same speed).
  • Agent warm-up: P95 wait 40 → 10 min (-75%); startup 27 → 8 min (-70%).
  • Per-build-minute cost estimate (EOY 2022): ~$0.018.

Architectural diagram (text)

Engineer opens PR
      |
      v
  check-merge  (trigger)
      |
  +---+------------------------------------------------+
  |                      |                             |
  v                      v                             v
BE/ML pipeline      FE pipeline                   other pipelines
  |                      |                             |
  | (after v2:            (after v2:
  |  45→16 grouped         frontend integ +           (omitted)
  |  Bazel steps)          a11y in Bazel;
  |                        page sub-pipelines
  |                        decommissioned)
  |                      |
  +---+------------------+
      |
      v
  All green? → merge to main

Underlying build substrate:

Bazel clients (CI agents)
   |
   |  --remote-download-minimal (BwoB)   with retry-on-cache-eviction
   v
bazel-remote  (on every instance)  --->  S3 bucket  (shared cache storage)
   |
   |  (partial) Remote Build Execution
   v
RBE worker pool (for TS builds / BE unit tests / BE compile+pack)

Pipeline-generation flow (post Jan 2024):

Commit pushed
   |
   +--> bazel-diff hash job (dedicated instances, ~seconds)
   |      |
   |      v
   |    S3 bucket (input-hash manifest per target)
   |
   +--> Static pipeline YAML (pre-generated, no git-checkout on critical path)

CI job starts
   |
   | downloads hash manifest from S3
   | decides which targets to execute
   | (fallback: let Bazel decide if download fails)

Caveats

  • "Cost of a build-minute" is a rough average ($0.018 EOY 2022); headline $1.8 M/yr savings for FE integ tests assumed 50% improvement — posted as an estimate, not a measurement.
  • Bazel cache-eviction retry works because evictions are rare; in a GC-busy cache or with smaller TTLs the workaround's cost/success ratio would change.
  • Proxyless/horizontal-scale CI culture is an enabling condition — the "arbitrary scripts as steps" anti-pattern is harder to enforce away at orgs without a centralized Developer Platform group.
  • Bazel migration is an up-front tax, and this post doesn't quantify the eng-years spent on test/target annotation work. The wins are post-tax.
  • Some Tier-3 content bias: this is a Canva-internal retrospective, generous with "our wins", quieter on regressions (briefly flagged: the Oct 2023 flakiness regression after the FE launch; broken observability deps in the BE/ML v2 rollout). Treat percentage claims as self-reported.

Source

Last updated · 200 distilled / 1,178 read