CONCEPT Cited by 1 source

distributed DAG)¶

Definition¶

In a DAG of dependent actions (build targets, pipeline steps, tasks), the critical path is the longest chain of dependent work from the graph's root to its sink. Total wall-clock is bound below by the critical path, no matter how many workers you add for the rest of the graph.

A corollary: improving any action not on the critical path doesn't improve total time — it may only reduce cost. You have to identify which actions are currently critical-path and attack those first.

Canva's framing¶

The post names it explicitly:

CI performance is bound by its longest stretch of dependent actions. Because our CI has so many dependencies, it's difficult to avoid regressions even when we improve things. One bad downstream dependency makes CI build times longer for everyone.

And on flaky tests:

If one test consistently takes 20 minutes to execute and flakes, and has some logic to retry on failure, let's say up to 3 times, it'll take up to 60 minutes. It doesn't matter if all other builds and tests execute in 30 seconds. That one slow, flaky test holds everyone's builds back for up to 1 hour.

Identifying the critical path in practice¶

From the Canva experience:

First-principles math. Canva's "2 dependent actions × ~10 min each → ~20 min theoretical worst-case" set the target; the observed 3 h meant the critical path was way longer than necessary (see concepts/first-principles-theoretical-limit).
Pipeline audit. Identify which steps depend on which; measure per-step wall-clock; highlight longest chain per build. Canva's audit surfaced the expensive integration-test step as critical-path for FE builds, E2E tests as critical-path ~80% of the time on BE/ML builds.
Single-CPU experiments. A 448-CPU / 6 TB RAM instance still took 18 min cold — because the critical-path action was single-core bound. Throwing hardware at a problem doesn't help if the bottleneck is inside one thread of one action.
Watch for regressions. After the BE/ML pipeline v2 launch, the improvements moved the critical-path elsewhere (E2E tests → FE integration tests), requiring the next round of work there.

The moving-target property¶

Fixing the current critical path uncovers the next one. Canva's trajectory illustrates:

v1: non-hermetic integration tests dominate (50+ min) → TestContainers.
v2: agent warm-up + instance-count tax dominates → i4i.8xlarge, step grouping.
v3: pipeline generation dominates (>10 min) → static generation + S3 hash cache.
v4: E2E tests dominate → CPU-requirement tuning + dedicated pool.
v5: FE integration tests dominate → bazelify + grouping.
v6: test health (slow/flaky tests) dominates → per-test runtime cap + disable-and-fix.

This is the concepts/first-principles-theoretical-limit loop in action: every round closes part of the gap between observed and theoretical limit.

concepts/first-principles-theoretical-limit — the diagnostic you pair with critical-path analysis.
concepts/tail-latency-at-scale — fanout analogue: P(some host is slow) drives the user-visible tail. CI version: P(some step regresses) drives critical-path time.
concepts/build-graph — the DAG structure the critical path runs on.
concepts/queueing-theory — queue stages along the critical path stack additively.

Seen in¶

sources/2024-12-16-canva-faster-ci-builds — critical path named as the bounding metric for CI time; every iteration targets the current critical-path step.