Skip to content

SYSTEM Cited by 1 source

Canva CI

Canva CI is Canva's internal continuous-integration system: a Buildkite-orchestrated, Bazel-centric pipeline running on AWS EC2 worker pools. The 2024-12-16 retrospective documents its evolution from ~80 min PR-to-merge (Apr 2022) to <30 min (sometimes 15 min) over ~2 years (Source: sources/2024-12-16-canva-faster-ci-builds).

Scale & shape (as of the retrospective)

  • Build graph: >10⁵ nodes (Bazel reports ~900K).
  • Per-build jobs: ~3000 (P90) for each check-merge.
  • Daily check-merge builds: >1000 per workday.
  • Per-job footprint: 1 EC2 instance per job.
  • Per-build-minute cost (EOY 2022): ~$0.018.

Top-level flow

Engineer opens PR
  |
  v
check-merge  (trigger)
  |
  +----> BE/ML pipeline  (backend + ML builds / tests)
  +----> FE pipeline     (frontend builds / tests)
  +----> other pipelines
  |
  v
All green? -> merge to main

After the 2022-23 work:

  • BE/ML Pipeline v2 grouped 45 → 16 steps; 49 → 35 min avg.
  • FE Pipeline v2 decommissioned per-page sub-pipelines; bazelified integ + a11y tests (~100 → 8 jobs/commit for integ).

Substrate

Bazel clients  (CI agents on EC2, per-pool right-sized)
   |
   |  --remote-download-minimal (BwoB)  + retry-on-eviction
   v
bazel-remote  (on every instance)  --->  S3 bucket (shared cache)
   |
   |  (partial) Remote Build Execution
   v
RBE worker pool  (TS builds / BE unit tests / BE compile+pack)

Pipeline generation (post Jan 2024):

Commit pushed
   |
   +--> bazel-diff hash job (dedicated instances)
   |     |
   |     v
   |   S3 bucket (input-hash manifest per target)
   |
   +--> Static pipeline YAML (pre-generated, off critical path)

CI job starts
   |
   | downloads hash manifest from S3
   | decides which targets to execute
   | (fallback: let Bazel decide if download fails)

Worker pools

  • i4i.8xlarge (I/O-optimized, multi-SSD, NVMe-backed) — for heavy-I/O steps like the build-all step. Took the slowest step from 3 h → 15 min.
  • c6id.12xlarge (better CPU:mem:disk balance) — adopted Mar 2024; saved another -2 to -6 min across builds.

Agent warm-ups use EBS snapshots pre-populated with caches: P95 wait 40 → 10 min, startup 27 → 8 min (patterns/snapshot-based-warmup).

Test architecture

  • Backend integration tests run in per-test TestContainers sandboxes inside Bazel — replaces shared-localstack harness. Cacheable and parallelism-safe.
  • Service-container tests — per-service TestContainer + a launch-validation test — shift deployment failures to CI.
  • E2E tests compose service-container definitions for hermetic end-to-end validation.
  • Per-test runtime caps: 10 min hard, 5 min P95 goal.

Key systems it depends on

Headline results (Apr 2022 → 2024)

  • PR-to-merge: 80 min avg → <30 min (sometimes 15 min).
  • BE builds via BwoB: 10 min → 5 min (2×).
  • ML builds via BwoB: 6 min → <2 min (3.3×).
  • BE/ML Pipeline v2: 49 → 35 min avg, ~50 % build-minutes cut.
  • Pipeline generation: >10 min → ~0 (static + S3-backed).
  • Agent warm-up: P95 40 → 10 min (-75 %); startup 27 → 8 min (-70 %).
  • RBE (partial rollout): TS builds 200 % faster; BE unit tests 25 % faster; BE compile+pack $0.262 → $0.21/build.

Seen in

Last updated · 200 distilled / 1,178 read