PATTERN Cited by 1 source
Snapshot-based warm-up (EBS snapshots for CI agents)¶
Preload CI agent caches into an EBS snapshot. New agents boot from the snapshot with caches already populated, so the first build or test action doesn't pay a cold-cache startup tax. Canva's rollout cut P95 large-agent wait from 40 min to 10 min (-75 %) and agent startup from 27 min to 8 min (-70 %) (Source: sources/2024-12-16-canva-faster-ci-builds).
Intent¶
CI agents that rely on large local caches (Bazel disk cache,
Docker image cache, toolchain binaries, node_modules) have a
cold-start problem. On scale-up events a fresh VM must fetch
gigabytes before it can do useful work. The user-visible cost is
agent wait time; the dollar cost is minutes of billed-but-idle
compute.
Snapshot-based warm-up moves the cache-hydration work out of the critical path by baking a warm cache into a snapshot the new agent restores from at boot.
Mechanism¶
- A "golden agent" builds a representative set of targets to populate caches (Bazel disk cache, container images, toolchain, etc.).
- Snapshot its EBS volume. Freeze a point-in-time image.
- Publish the snapshot. Give the scale-up auto-scaling group / launch template a reference to the latest snapshot.
- New agents boot from snapshot. EBS lazy-loads blocks on first access; working-set cache is already there.
- Refresh the snapshot regularly (nightly / on-change) so drift from the cache stays small.
Canva's numbers¶
- P95 wait time for all large agents: 40 min → 10 min (-75 %).
- Agent startup time: 27 min → 8 min (-70 %).
- Additional cost savings from reducing "alive but not working" minutes.
Why this is off-the-critical-path but still worth doing¶
Agent warm-up usually isn't on the critical path of any one build because Canva has enough warm agents that a build rarely waits. But:
- Spikes happen. Burst demand still has to scale up; cold agents during spikes are on the critical path for those builds.
- Cost is always on. Even when not user-visible, every minute of warm-up you eliminate is billed-but-idle compute saved.
- Reliability signal improves. Faster startup makes fleet-wide rollouts less risky because rollbacks can also happen faster.
Preconditions¶
- EBS (or equivalent snapshotable block storage). The pattern needs a storage primitive that supports point-in-time copies AND lazy-load on first access.
- Cache state is re-creatable. Workflow must tolerate the snapshot being slightly stale — first-access cache misses are filled in normally.
- Monotonic cache structure. Bazel's content-addressed cache is ideal; randomly-evicted caches can't be snapshotted safely.
Variations¶
- AMI-baked vs snapshot-attached. AMI is simpler but larger/slower; snapshot attached to a tiny AMI is faster to refresh.
- Per-pool snapshots. Match the patterns/instance-shape-right-sizing pool split — I/O-pool and CPU-pool get different warm-ups.
- Scheduled refresh vs on-change. Canva-scale orgs typically refresh nightly; smaller teams may refresh on each main-branch merge.
Related¶
- systems/aws-ebs — the storage substrate.
- systems/aws-ec2 — the agent host.
- concepts/critical-path — while warm-up is usually off-path, this pattern keeps it off-path at peak demand.
Seen in¶
- sources/2024-12-16-canva-faster-ci-builds — EBS-snapshot warm-up cut P95 wait 40→10 min and startup 27→8 min.