CANVA 2024-12-10 Tier 3

Canva: The science of routing print orders¶

Canva's Print team built a configurable rule-driven routing engine that decides which supplier in the global print network produces each item in a checkout cart, and how orders are split across suppliers. The design separates the three hard parts — build the graph, decide between candidate paths, traverse the graph — into three independently evolvable subsystems with defined contracts, backed by a pre-built per-destination-region graph in Redis. Claimed results: p99 routing latency 50 ms at peak (p50 ~30 ms, as low as 5 ms off peak), 99.999% availability, deterministic "same input → same route" decisions, per-order routing logs capturing every decision for post-hoc debugging.

Key takeaways¶

Three-subsystem split: Build+Retrieve, Decision, Traversal. Each has a narrow contract so any of them can be rewritten without touching the others. Explicit worked example: swapping the Decision engine to randomized ranking requires only a type change; no graph or traverser edits. Inspired by microservices but applied to a single in-process algorithm. (See patterns/pluggable-component-architecture.)
CQRS-shaped data architecture: relational source of truth + async graph projection. The operations team manages suppliers / product capabilities / delivery regions in a relational store ("best for small incremental updates + focused queries"). A reactive async process rebuilds a layered graph (products → production sites → delivery options) on every relevant relational write and publishes it to ElastiCache / Redis. (See patterns/async-projected-read-model, concepts/cqrs.)
Graph is sharded by destination region. One graph per destination region — i.e., the shard key is the read-side query dimension, not a tenant ID. This caps both the Redis retrieval size (6 ms in most regions, up to 20 ms for the largest) and the traversal's V + E. A single world-wide graph would be too big for either. (See concepts/geographic-sharding.)
Decision engine = ordered list of rules + a guaranteed-distinct terminal rule. Each rule independently answers "which of these two paths is better?"; if it can't decide, evaluation falls through to the next. The final rule — shortest total distance from production sites to the customer — is chosen because "it's nearly impossible for 2 locations to be the same distance away down to the centimeter", so it almost always breaks ties. This also ties into lower carbon + faster delivery. Net effect: deterministic "same input → same output" routing with no randomness. (See patterns/deterministic-rule-ordering.)
Filters vs. rules: two-phase constraint separation. Hard constraints (e.g., "supplier can't produce this product") become filters applied during graph preprocessing (edges assigned zero flow, making them invisible to traversal). Soft preferences (e.g., "prefer sites with capacity") become rules in the decision engine, generous with equality until something must differentiate. Keeps the runtime rule evaluator simple.
Min-cost-flow traversal with path comparison instead of numeric cost. Canva uses a successive shortest path-inspired min-cost-flow solver (O(V·E) time), modified to rank candidate paths via the Decision engine instead of computing numerical costs. Source + sink nodes balance the graph; capacity-1 edges enforce "traverse once"; zero-flow edges disappear. (See concepts/min-cost-flow.)
Regrets: bounded backtracking in a forward-only traverser. The traverser can make a locally-best choice that later forces a worse global outcome (Sydney can print business cards but can't also print stickers → should have picked Melbourne). To handle this, at each step the unchanged active path is also fed into the ranker; if it "wins" against the proposed forward step, the path has degraded and the second-best option is saved as a regret. Regrets then compete against every future proposed step; winning a ranking triggers regret-revisiting (new active path + recursive sibling-regret bookkeeping). Dead-end edges are remembered as zero-capacity.
Pre-rank to avoid regrets. An important optimization: at each decision point, prefer nodes with the fewest paths forward first. Example: t-shirts can only be printed in Brisbane, so the traversal should route t-shirts before business cards. Locks in the most constrained decisions early → fewer regret cycles, bounded execution time on any graph size.
Routing log = per-traversal structured log, stored in blob store with expiry, ID carried on the order. Every loop iteration writes an action object (state, candidates, decision engine results, action taken); the full run is packaged into a routing log, stamped with an ID, and stored async in KV blob storage with auto-expiry. The log ID is attached to the order, so any "why was this order sent there?" question is answered by replaying a specific log. (See patterns/explainability-log.)
Measured outcomes (user-visible SLOs).
- p99 routing time: 50 ms at peak, 5 ms off peak.
- p50 routing time: ≤30 ms.
- Availability of routing data (ElastiCache / Redis + read replicas): 99.999%.
- "Always produces results, fast enough to display before the user clicks Pay" is explicitly called out as the user-visible success metric.

Architectural numbers¶

Dimension	Value
Graph retrieval (most regions, Redis)	~6 ms
Graph retrieval (largest shards)	~20 ms
Routing latency p50	~30 ms
Routing latency p99 (peak)	~50 ms
Routing latency p99 (off-peak)	~5 ms
Availability	99.999%
Traversal algorithm complexity	O(V·E) — successive shortest path

Caveats¶

The post doesn't give actual graph sizes (nodes / edges per regional shard) or supplier counts, so "graph is large" is qualitative.
The O(V·E) claim is for the successive-shortest-path-inspired inner loop — worst-case with many regrets isn't characterized explicitly.
The Decision engine is described as ordered rules + a terminal tie-breaker, but the actual rule count and priority order per region are not disclosed.
This is a single source describing a system Canva built itself; no cross-references to other companies yet. Treat claims like "99.999% availability" and "50 ms p99" as self-reported.
Article frames "regret" terminology as their own humanization; min-cost-flow literature would call this something like reduced-cost backtracking or path cancellation. Keep the link but don't assume standard vocabulary.

Raw¶

Raw file: raw/canva/2024-12-10-the-science-of-routing-print-orders-279f79a2.md
Original URL: https://www.canva.dev/blog/engineering/the-science-of-routing-print-orders/