SYSTEM Cited by 1 source

Canva CI¶

Canva CI is Canva's internal continuous-integration system: a Buildkite-orchestrated, Bazel-centric pipeline running on AWS EC2 worker pools. The 2024-12-16 retrospective documents its evolution from ~80 min PR-to-merge (Apr 2022) to <30 min (sometimes 15 min) over ~2 years (Source: sources/2024-12-16-canva-faster-ci-builds).

Scale & shape (as of the retrospective)¶

Build graph: >10⁵ nodes (Bazel reports ~900K).
Per-build jobs: ~3000 (P90) for each check-merge.
Daily check-merge builds: >1000 per workday.
Per-job footprint: 1 EC2 instance per job.
Per-build-minute cost (EOY 2022): ~$0.018.

Top-level flow¶

Engineer opens PR
  |
  v
check-merge  (trigger)
  |
  +----> BE/ML pipeline  (backend + ML builds / tests)
  +----> FE pipeline     (frontend builds / tests)
  +----> other pipelines
  |
  v
All green? -> merge to main

After the 2022-23 work:

BE/ML Pipeline v2 grouped 45 → 16 steps; 49 → 35 min avg.
FE Pipeline v2 decommissioned per-page sub-pipelines; bazelified integ + a11y tests (~100 → 8 jobs/commit for integ).

Substrate¶

Bazel clients  (CI agents on EC2, per-pool right-sized)
   |
   |  --remote-download-minimal (BwoB)  + retry-on-eviction
   v
bazel-remote  (on every instance)  --->  S3 bucket (shared cache)
   |
   |  (partial) Remote Build Execution
   v
RBE worker pool  (TS builds / BE unit tests / BE compile+pack)

Pipeline generation (post Jan 2024):

Commit pushed
   |
   +--> bazel-diff hash job (dedicated instances)
   |     |
   |     v
   |   S3 bucket (input-hash manifest per target)
   |
   +--> Static pipeline YAML (pre-generated, off critical path)

CI job starts
   |
   | downloads hash manifest from S3
   | decides which targets to execute
   | (fallback: let Bazel decide if download fails)

Worker pools¶

i4i.8xlarge (I/O-optimized, multi-SSD, NVMe-backed) — for heavy-I/O steps like the build-all step. Took the slowest step from 3 h → 15 min.
c6id.12xlarge (better CPU:mem:disk balance) — adopted Mar 2024; saved another -2 to -6 min across builds.

Agent warm-ups use EBS snapshots pre-populated with caches: P95 wait 40 → 10 min, startup 27 → 8 min (patterns/snapshot-based-warmup).

Test architecture¶

Backend integration tests run in per-test TestContainers sandboxes inside Bazel — replaces shared-localstack harness. Cacheable and parallelism-safe.
Service-container tests — per-service TestContainer + a launch-validation test — shift deployment failures to CI.
E2E tests compose service-container definitions for hermetic end-to-end validation.
Per-test runtime caps: 10 min hard, 5 min P95 goal.

Key systems it depends on¶

systems/bazel — build system + Starlark pipeline generator.
systems/bazel-remote — shared CI cache backed by S3.
systems/buildkite — CI pipeline runner / UI.
systems/testcontainers — per-test hermetic storage.
systems/aws-ec2 — worker compute.
systems/aws-ebs — agent block storage + snapshot warm-ups.

Headline results (Apr 2022 → 2024)¶

PR-to-merge: 80 min avg → <30 min (sometimes 15 min).
BE builds via BwoB: 10 min → 5 min (2×).
ML builds via BwoB: 6 min → <2 min (3.3×).
BE/ML Pipeline v2: 49 → 35 min avg, ~50 % build-minutes cut.
Pipeline generation: >10 min → ~0 (static + S3-backed).
Agent warm-up: P95 40 → 10 min (-75 %); startup 27 → 8 min (-70 %).
RBE (partial rollout): TS builds 200 % faster; BE unit tests 25 % faster; BE compile+pack $0.262 → $0.21/build.

Seen in¶

sources/2024-12-16-canva-faster-ci-builds — full retrospective covering the 2022-24 optimization journey.