Skip to content

PATTERN Cited by 1 source

Instance-shape right-sizing for CI workers

Pick the EC2 (or equivalent) instance type by workload shape — I/O-dominated steps get I/O-optimized instances with multi-SSD and high NVMe throughput; CPU-dominated steps get CPU-balanced instances — instead of running everything on a generic-purpose pool. Canva applied this in two rounds and saved 3 h → 15 min on their slowest step, plus another 2–6 min via a later rebalance (Source: sources/2024-12-16-canva-faster-ci-builds).

Intent

Generic instance pools optimize for "acceptable across any workload", which means nothing is particularly well-matched. For CI — where the critical-path step's runtime directly bounds wall-clock — the payoff of a workload-specific pool can be an order of magnitude.

The diagnostic discipline:

  1. Identify the critical-path step (concepts/critical-path).
  2. Instrument it — top, iotop, IOPS, throughput, CPU saturation — on the current pool.
  3. Pick an instance family whose resource balance matches the observed bottleneck.
  4. Measure. Redo.

Canva's two rounds

Round 1 — the slow build-all step (Bazel, heavy I/O):

  • Experiment: giant instance (448 CPUs, 6 TB RAM) to bound the theoretical floor — still took 18 min cold, confirming a single-CPU critical-path action.
  • Observed that disk IOPS + throughput were the bottleneck for the practical workload.
  • Switched the build-all pool to i4i.8xlarge — multi-SSD, NVMe-backed, I/O-optimized.
  • Result: 3 h → ~15 min on cold cache.

Round 2 — later agent-pool rebalance:

  • Shape mismatch: i4i.8xlarge is I/O-heavy; later Canva workloads wanted more balanced CPU/mem.
  • Switched from i4i.8xlargec6id.12xlarge for the affected pools.
  • Result: -2 to -6 min across builds (Mar 2024).

Why it works

  • The bottleneck moves. Once I/O is no longer the bottleneck, an I/O-optimized instance is just more expensive than needed. Continuous right-sizing follows the concepts/critical-path as it shifts.
  • Instance families encode hardware balance. AWS exposes the trade-off deliberately: i family (IOPS), c family (compute), m family (balanced), r family (memory), g/p (GPU). The prefix is the relevant knob.
  • Pool-level choice, not per-step. Mixing instance types per step adds scheduling complexity. Mapping "pipeline group → pool" gives the benefit with minimal orchestration tax.

Mechanics

  • Separate worker pools per dominant-workload class (I/O, CPU, balanced).
  • Route CI steps by label or tag to the appropriate pool.
  • Use warm-up / capacity plans per pool so demand spikes don't queue on the wrong shape.
  • Revisit every time the critical-path step migrates (which, per concepts/critical-path, it does repeatedly).

Methodology that matters

Canva's methodology — not just the outcome — is the pattern:

We first tried executing bazel build //… to see how long it'd take to build. The Bazel JVM ran out of memory because the graph was too big. … We ran that tens of times in test instances, monitoring the execution with tools like top and iotop to see where the bottlenecks were. Observing that disk IOPS and throughput looked like the main bottlenecks, we tested with larger instances to see if there was an opportunity to use more RAM and less disk…

Shape experiments on a test instance, then scale to production. Pairs naturally with shaping-vs-building.

Seen in

Last updated · 200 distilled / 1,178 read