Skip to content

PATTERN Cited by 1 source

Centralized routing proxy for ML serving

Centralized routing proxy for ML serving is the bootstrap-era pattern where one mandatory proxy fronts all model-inference traffic and bundles routing, model selection, A/B-test allocation, canary deployment, shadow mode, and lifecycle management into a single service that every client must call. Netflix's Switchboard is the canonical wiki example.

The pattern is the ML-serving analogue of a classical in-path API gateway, with one additional responsibility: integrating with the experimentation platform to resolve "for this userId, which model variant should run for this Objective?" — something a generic API gateway (AWS API Gateway, Kong, standalone service mesh) is not designed for.

When this pattern is the right call

  • Bootstrap phase. Early on, velocity of researcher config and cross-cutting routing logic (A/B, canary, shadow, rollback) matters more than last-millisecond latency. Centralising it in one service keeps the interaction simple for both clients (single integration point) and model owners (single place to express rules).
  • Opaque model iteration is the primary goal. Clients don't want to know which model ran, they want the Objective answered. A proxy that owns Objective → model resolution enforces this separation.
  • First-class experimentation integration is required. Off-the-shelf API gateways don't resolve A/B-cell → model. A custom proxy does.
  • gRPC with rich domain context. The post is explicit: "standard out-of-the-box API Gateway solutions did not meet all our requirements. In particular, we needed … the ability to expose gRPC endpoints to clients, and the ability to use rich domain-specific context for routing customizations, which generic proxies were not designed to handle."
  • Lifecycle stages are load-bearing. Shadow mode, canaries, rollbacks all need model-aware routing semantics that generic proxies don't model.

Typical capabilities

From the 2026-05-01 post (Switchboard's four named capabilities):

  1. Common client abstraction — single point of contact for all clients; central rate limits; central concurrency limits.
  2. Context-aware routing — device, locale, surface type, A/B cell, domain context.
  3. Dynamic traffic splitting — canary rollouts + experimentation; real-time percent-of-traffic shifts.
  4. Model versioning and lifecycle management — shadow mode (route new version without affecting UX), instant rollback, version coexistence.

Known scaling pains

The same post catalogs three pains at 1M req/sec across 30+ client services:

  1. Single point of failure — a proxy outage cascades to all clients.
  2. Serialization tax — 10–20ms per request for Switchboard due to payload deserialize + re-serialize; plus tail-latency amplification.
  3. Weak tenant isolation — all tenants in one routing cluster means cross-tenant cascade + real-vs-artificial-traffic logging is costly (MLOps overhead).

When to graduate away

The successor pattern on this wiki is patterns/separate-routing-from-model-selection, which splits the proxy's responsibilities: a metadata resolver (Lightbulb-class) produces a routingKey + config; a data-plane proxy (Envoy-class) does the actual routing. Graduate when:

  • Proxy latency becomes material to client SLAs.
  • Proxy outages are a recognised blast-radius risk.
  • Payload sizes grow to the point where deserialize is the hot path.
  • Client services proliferate beyond what one shared proxy can safely serve.
  • A service-mesh data plane (Envoy, Istio) already covers all hops, making "piggyback routing on it" cheaper than operating a dedicated in-path proxy fleet.

Substrate requirements

  • Custom proxy (typically gRPC), since off-the-shelf gateways don't meet the requirements.
  • Experimentation platform callable per request (or per-session cached) to resolve A/B cell allocation.
  • Pub/sub config substrate (e.g. Netflix's Gutenberg) for rule publication; see patterns/config-separated-from-code-via-pubsub.
  • Model-to-shard control plane that tracks which model lives on which VIP, and refreshes the proxy's routing table as mappings change.

Caveats

  • Treat the proxy as a Tier-0 dependency from day 1. Its blast radius is the full list of client services that depend on it.
  • Budget for the tax. 10–20ms is Netflix's number at scale; your mileage varies with payload size. Measure.
  • Make sure tenant logging is first-class from day 1. Netflix's retrofit problem — real-vs-artificial traffic separation — was introduced by the shared routing cluster.
  • Plan the graduation path. The proxy has value in the bootstrap phase, but it will hit scaling limits. Design the config substrate (Switchboard Rules) so it survives the architecture evolution.

Seen in

  • sources/2026-05-01-netflix-state-of-routing-in-model-serving — canonical wiki instance (Switchboard). 1M req/sec, 30+ client services, Objective-addressable routing, A/B + canary + shadow
  • rollback bundled. Post also documents the graduation to the Lightbulb + Envoy split and the three pains that motivated it, so this pattern is canonicalised alongside its supersedor.
Last updated · 445 distilled / 1,275 read