SYSTEM Cited by 1 source
Canva Print Routing Engine¶
Canva Print Routing Engine decides, for every Canva Print checkout, which suppliers in Canva's global print network produce which items in the cart and how shipments are grouped. Design goals: deterministic results (same input → same route), fast enough to run pre-checkout (p99 ~50 ms peak), explainable after-the-fact ("why did this order ship from Melbourne?"), and independently evolvable subsystems so routing logic, data model, or traversal algorithm can change without rewriting the others.
Subsystem split¶
Three subsystems, strict contracts between them (see patterns/pluggable-component-architecture):
- Build and Retrieve — owns the graph. Listens for relevant writes in the operational relational store; reactively rebuilds a layered graph (products → production sites → delivery options); publishes the graph to ElastiCache/Redis, keyed by destination-region shard.
- Decision — owns ranking. Two operations:
better(pathA, pathB)andrank([paths]). Internally a list of ordered rules + a distance-based terminal tie-breaker (see patterns/deterministic-rule-ordering). - Traversal — owns pathfinding. Modified successive-shortest-path min-cost-flow over the retrieved graph, using the Decision engine for all comparisons (see concepts/min-cost-flow).
Each subsystem can be replaced in isolation. A/B testing a new Decision engine or a new Traversal algorithm is a type swap.
Data architecture — CQRS-shaped¶
Operations team needs: small-and-frequent updates over a complex supplier / product / region entity model → relational store fits. Routing needs: large-read, rarely-changing, forward-directed graph → pre-built graph fits.
Canva uses both (see patterns/async-projected-read-model, concepts/cqrs):
Relational DB (operations — source of truth)
│
│ async reactive pipeline (rebuild on relevant writes)
▼
Graph (per destination region)
│
▼
ElastiCache / Redis ←── routing service reads here
Source-of-truth benefit: if the graph is lost or corrupted, rebuild from the relational store.
Graph structure¶
Layered / feedforward:
- Source / sink nodes balance the graph so traversal is a simple forward march ending at the sink.
- Edges have flow limits. Undesired edges get flow = 0 and become invisible (this is how filters are applied — filters run during graph preprocessing, before the Decision engine ever sees the graph). Traverse-once edges get flow = 1. Unlimited edges get a large constant.
- A synthetic sink→source edge with flow =
maxFlow − 1signals "traversal done" when no forward move exists and you're at the sink.
Sharding — by destination region¶
Shard key = order destination region. - Retrieval: ~6 ms in most regions, ~20 ms for largest shards. - Smaller regional graphs = smaller V and E = faster traversal with fewer backtracks and dead ends. (See concepts/geographic-sharding.)
Traversal¶
Modified successive-shortest-path min-cost-flow solver,
O(V·E) time:
- At each step the traverser has an active path and a list of candidate forward extensions.
- Decision engine ranks all candidates; top-ranked becomes the new active path.
- Regrets. The unchanged active path is also fed to the ranker. If it outranks the best forward option, the active path has degraded — save the best alternative as a regret. Regrets compete against forward moves at every subsequent step; if a regret wins, it becomes the new active path and sibling regrets are recomputed recursively.
- Dead ends are marked zero-capacity so they're never revisited.
- Pre-rank by fewest-paths-forward first — locks in constrained decisions early and minimizes regret cycles. Example: t-shirts only producible in Brisbane → route t-shirts before business cards.
Explainability — routing logs¶
(See patterns/explainability-log.)
- Every main-loop iteration writes an
actionobject (candidates, decision results, action taken, current state). - Full traversal output is packaged into a routing log with traversal context + visited nodes.
- Stamped with an ID, stored async in KV blob storage with automatic expiry.
- Log ID is attached to the resulting order so any investigation starts by loading one log.
Outcomes¶
| Metric | Value |
|---|---|
| Graph retrieval | 6 ms typical, 20 ms max |
| Routing p50 | ~30 ms |
| Routing p99 peak | ~50 ms |
| Routing p99 off-peak | ~5 ms |
| Availability | 99.999% (ElastiCache + Redis + read replicas) |
| Algorithm | O(V·E) min-cost-flow (successive shortest path inspired) |
Caveats¶
- Self-reported single-source data; no external replication.
- No disclosed numbers for graph size (V, E) per regional shard or supplier count.
- Worst-case regret cost isn't characterized in the post.
- "Regret" / "regret-revisiting" is Canva's terminology; conceptually related to reduced-cost backtracking in standard min-cost-flow literature.
Seen in¶
- sources/2024-12-10-canva-routing-print-orders — architectural retrospective, the only source describing this system.