Skip to content

PATTERN Cited by 1 source

Pipeline step consolidation

When "more steps" becomes a tax instead of a speedup, group related hermetic build/test actions into fewer, larger CI steps so the per-step overhead (VM warm-up, cache hydration, image pull) amortises over more work. Canva's BE/ML Pipeline v2 went from 45 → 16 steps and cut average build time 49 → 35 min with ~50 % fewer build minutes (Source: sources/2024-12-16-canva-faster-ci-builds).

Intent

Horizontal CI scale-out has a sweet spot: too few steps and you lose parallelism; too many and per-step fixed costs dominate. At Canva's scale (>1000 check-merge builds/day, ~3000 jobs per build, each job = 1 EC2 instance), the per-step VM warm-up tax was eating the parallelism gain.

The pattern: once hermetic caching (concepts/hermetic-build + concepts/content-addressed-caching) means re-running a step is cheap on cache hits, the right step granularity shifts dramatically — from "one step per target" toward "one step per meaningful grouping", because Bazel inside the step can do its own parallelism and caching.

Mechanism

  1. Identify cache-friendly groupings. If two actions are always run together and share most inputs, one step with two Bazel targets is almost always cheaper than two steps.
  2. Preserve the critical-path distinctions. Don't merge steps that should report independently for critical-path diagnosis. Canva kept BE/ML, FE, and other pipelines separate for this reason.
  3. Amortise per-step fixed costs.
  4. VM cold-start
  5. Docker image pull
  6. Bazel JVM warm-up (minutes with a 900K-node graph)
  7. Remote cache handshake
  8. Git checkout
  9. Trust the inner parallelism. Bazel parallelises across targets within a step automatically; consolidation doesn't lose parallel execution, just parallel provisioning.

Canva's measurements

BE/ML Pipeline v2 (Apr 2023):

  • Steps: 45 → 16 (-64 %)
  • Average build time: 49 min → 35 min (-29 %)
  • Build minutes: ~-50 %

FE integration tests (bazelified + consolidated, Sep 2023):

  • Previously: ~100 jobs/commit, each its own VM.
  • After: ~8 jobs/commit (grouped).
  • Effect: ~1.3 M jobs/month removed — eliminating their warm-up and scheduling tax.

FE accessibility tests (bazelified + consolidated):

  • Previously: 67,145 jobs / 13,947 commits.
  • After: ~1 job/commit.
  • Expected ~80 % time cut + ~$100K/yr savings.

Preconditions

  • Hermetic actions inside the grouped step. Without hermeticity, grouping amplifies flakes and state leaks — the original reason Canva had one-VM-per-step.
  • Working remote cache. The benefit depends on cache hits within the step being common.
  • Inner build system with native parallelism. Bazel is the canonical case. Without it, consolidation loses parallel execution.

Trade-offs

  • Observability granularity drops. Per-step logs / metrics / UX show "this whole group passed / failed" instead of one signal per target. Engineers have to learn the finer-grained shape sits inside the step.
  • Reporting UX changes. Canva called out this risk before rollout — broke some downstream deps (observability tools from other teams) that had been using per-step signals.
  • Blast radius is larger per step. A crash in a grouped step takes all its work down together.

Composes with

Seen in

  • sources/2024-12-16-canva-faster-ci-builds — BE/ML v2 (45→16 steps, 49→35 min, ~50 % build-minutes cut); FE integ test bazelification (~100→8 jobs/commit, ~1.3 M jobs/month removed); FE a11y bazelification (67K→~14K jobs).
Last updated · 200 distilled / 1,178 read