CONCEPT Cited by 1 source
First-principles theoretical limit¶
Definition¶
First-principles theoretical-limit reasoning asks: given the physics of the hardware and the structure of the workload, what's the floor on wall-clock time, ignoring the current implementation? The answer sets the diagnostic ceiling — the gap between "floor" and "observed" is the opportunity surface.
Distinct from "best practice," "benchmark against peers," or "prior-version +5%." Instead: compute the floor, observe the ceiling, attack the gap.
Canva's application¶
From the Canva CI retrospective:
From first principles, we knew that:
- Modern computers are incredibly fast.
- PR changes are relatively small (a couple hundred lines on average).
- One build or test action shouldn't take more than a few minutes on a modern computer, and the critical path shouldn't have more than 2 long actions dependent on each other.
So, if we assume a few minutes = 10 and multiply that by 2 (2 actions dependent on each other, each taking 10 minutes), we have a theoretical limit of approximately 20 minutes for the worst-case build scenario. However, we had builds taking up to 3 hours! What was causing this massive difference?
That ~10× gap (20 min floor vs. 3 h observed) framed every subsequent investigation.
Diagnostic loop¶
- Compute the floor. What does the workload have to do? What hardware is available? What's the critical path's theoretical length?
- Measure the ceiling. What does it actually take, P50 / P90 / P99?
- Instrument the gap. Where does the wall-clock go? (I/O, CPU, serialization, queueing, warm-up.) Canva used a 448-CPU / 6 TB RAM instance as a diagnostic instrument to separate "single-CPU-critical-path-bound" from "distributed-system-bound".
- Attack the biggest component. Close some of the gap. Recompute floor and ceiling — they've both probably moved.
Related levers the floor exposes¶
- If the floor is set by a single-thread action, scaling out doesn't help. Need to speed up the action itself, or parallelize inside it.
- If the floor is set by data movement, reduce it (BwoB; patterns/build-without-the-bytes).
- If the floor is set by queueing, reduce queue layers (concepts/queueing-theory).
- If the observed is orders of magnitude above floor, start with the cheapest structural fix (step consolidation, caching) before deep tuning.
Contrast: benchmark-against-peers¶
The floor framing is stricter than "Netflix does it in 30 min so we should too." Peer benchmarks set a ceiling based on someone else's implementation, not on physics. A team stuck inside a comfortable peer envelope can miss a 10× opportunity that first-principles would surface.
Related¶
- concepts/critical-path — what the floor math is usually computing against.
- concepts/queueing-theory — a frequent source of the observed-to-floor gap.
- concepts/hard-drive-physics — a canonical floor: 120 IOPS/HDD since 2006; S3 designs against this floor, not against peer-benchmark IOPS.
Seen in¶
- sources/2024-12-16-canva-faster-ci-builds — "20 min floor vs 3 h observed" framed the whole multi-year CI project.
- sources/2025-02-25-allthingsdistributed-building-and-operating-s3 — HDD floor (~120 IOPS) is the constraint S3 designs around.