SYSTEM Cited by 1 source
Canva CI¶
Canva CI is Canva's internal continuous-integration system: a Buildkite-orchestrated, Bazel-centric pipeline running on AWS EC2 worker pools. The 2024-12-16 retrospective documents its evolution from ~80 min PR-to-merge (Apr 2022) to <30 min (sometimes 15 min) over ~2 years (Source: sources/2024-12-16-canva-faster-ci-builds).
Scale & shape (as of the retrospective)¶
- Build graph: >10⁵ nodes (Bazel reports ~900K).
- Per-build jobs: ~3000 (P90) for each
check-merge. - Daily
check-mergebuilds: >1000 per workday. - Per-job footprint: 1 EC2 instance per job.
- Per-build-minute cost (EOY 2022): ~$0.018.
Top-level flow¶
Engineer opens PR
|
v
check-merge (trigger)
|
+----> BE/ML pipeline (backend + ML builds / tests)
+----> FE pipeline (frontend builds / tests)
+----> other pipelines
|
v
All green? -> merge to main
After the 2022-23 work:
- BE/ML Pipeline v2 grouped 45 → 16 steps; 49 → 35 min avg.
- FE Pipeline v2 decommissioned per-page sub-pipelines; bazelified integ + a11y tests (~100 → 8 jobs/commit for integ).
Substrate¶
Bazel clients (CI agents on EC2, per-pool right-sized)
|
| --remote-download-minimal (BwoB) + retry-on-eviction
v
bazel-remote (on every instance) ---> S3 bucket (shared cache)
|
| (partial) Remote Build Execution
v
RBE worker pool (TS builds / BE unit tests / BE compile+pack)
Pipeline generation (post Jan 2024):
Commit pushed
|
+--> bazel-diff hash job (dedicated instances)
| |
| v
| S3 bucket (input-hash manifest per target)
|
+--> Static pipeline YAML (pre-generated, off critical path)
CI job starts
|
| downloads hash manifest from S3
| decides which targets to execute
| (fallback: let Bazel decide if download fails)
Worker pools¶
i4i.8xlarge(I/O-optimized, multi-SSD, NVMe-backed) — for heavy-I/O steps like thebuild-allstep. Took the slowest step from 3 h → 15 min.c6id.12xlarge(better CPU:mem:disk balance) — adopted Mar 2024; saved another -2 to -6 min across builds.
Agent warm-ups use EBS snapshots pre-populated with caches: P95 wait 40 → 10 min, startup 27 → 8 min (patterns/snapshot-based-warmup).
Test architecture¶
- Backend integration tests run in per-test
TestContainers sandboxes inside
Bazel — replaces shared-
localstackharness. Cacheable and parallelism-safe. - Service-container tests — per-service TestContainer + a launch-validation test — shift deployment failures to CI.
- E2E tests compose service-container definitions for hermetic end-to-end validation.
- Per-test runtime caps: 10 min hard, 5 min P95 goal.
Key systems it depends on¶
- systems/bazel — build system + Starlark pipeline generator.
- systems/bazel-remote — shared CI cache backed by S3.
- systems/buildkite — CI pipeline runner / UI.
- systems/testcontainers — per-test hermetic storage.
- systems/aws-ec2 — worker compute.
- systems/aws-ebs — agent block storage + snapshot warm-ups.
Headline results (Apr 2022 → 2024)¶
- PR-to-merge: 80 min avg → <30 min (sometimes 15 min).
- BE builds via BwoB: 10 min → 5 min (2×).
- ML builds via BwoB: 6 min → <2 min (3.3×).
- BE/ML Pipeline v2: 49 → 35 min avg, ~50 % build-minutes cut.
- Pipeline generation: >10 min → ~0 (static + S3-backed).
- Agent warm-up: P95 40 → 10 min (-75 %); startup 27 → 8 min (-70 %).
- RBE (partial rollout): TS builds 200 % faster; BE unit tests 25 % faster; BE compile+pack $0.262 → $0.21/build.
Related¶
- companies/canva — Canva wiki index.
- systems/bazel, systems/bazel-remote, systems/buildkite, systems/testcontainers.
- concepts/hermetic-build, concepts/content-addressed-caching, concepts/critical-path, concepts/first-principles-theoretical-limit, concepts/build-graph, concepts/remote-build-execution.
- patterns/build-without-the-bytes, patterns/pipeline-step-consolidation, patterns/static-pipeline-generation, patterns/instance-shape-right-sizing, patterns/snapshot-based-warmup, patterns/shaping-vs-building.
Seen in¶
- sources/2024-12-16-canva-faster-ci-builds — full retrospective covering the 2022-24 optimization journey.