CONCEPT

Model-per-Endpoint Isolation Trade-off¶

Definition¶

A production-ML serving decision: give each model its own endpoint and its own instance(s) (one-model-per-endpoint) versus multiplexing many models onto shared instances (one-big-instance or multi-model endpoint). The two shapes have different cost curves and different isolation / flexibility properties; the choice is a deliberate trade-off, not a best practice.

The two shapes¶

Shared-instance serving (legacy)¶

All models are loaded into the same process/instance. To scale capacity, you add more (replicas of the same) large instance. Cost efficient — the instance is densely packed — but:

All models must share the same framework / runtime.
Resource contention: a flood to one model affects others on the same instance (concepts/noisy-neighbor).
Scaling adds capacity for every model (you can't scale one model independently).
A model rollout / rollback touches all other models' availability.

One-endpoint-per-model serving¶

Each model has its own endpoint with its own dedicated instance(s). You pay for N fleets instead of 1 (more idle capacity). But:

Each model picks its own tech stack / framework.
Traffic isolation is hard-enforced by the OS / instance boundary.
Each model scales independently on its own autoscaling policy.
Rollouts are per-model; blast radius is one model.

Zalando's verbatim trade-off¶

From :

"Based on our estimates the cost of serving our models will increase significantly after the migration. We anticipate the increase by up to 200%. The main reason behind it is cost efficiency of the legacy system, where all the models are served from one big instance (multiplied for scaling). In the new system every model gets a separate instance(s).

Still, this is a cost increase that we are willing to accept for the following reasons:

Model flexibility. Having a separate instance per model means every model can use a different technology stack or framework for serving.

Isolation. Every model's traffic is separated, meaning we can scale each model individually, and flooding one model with requests doesn't affect other models.

Use of managed services instead of maintaining a custom solution."

The three named benefits that justify the ~2× cost:

Framework plurality per model — XGBoost next to PyTorch next to TensorFlow, each in its own endpoint, chosen freely by each model's owner.
Traffic isolation — the canonical concepts/noisy-neighbor prevention. Sales-event load on one model does not crowd out another model's SLO.
Independent scaling — each model's endpoint autoscales to its own traffic pattern; shared-instance serving forces one autoscale policy for all.

When to pick which¶

Pick shared-instance when: models are cheap, homogeneous (same framework), low-criticality, low-throughput each, cost pressure outweighs isolation needs.
Pick one-endpoint-per-model when: models diverge in framework, traffic is bursty / asymmetric across models, per-model SLOs exist, per-model release cadence matters, regulatory boundary needs hard isolation.

Third shape: Multi-Model Endpoints (post-2020)¶

SageMaker Multi-Model Endpoints (MME) are a middle ground — one endpoint, many models multiplexed by artefact load-on-demand on the same instance fleet. They get shared-instance cost efficiency with some of the per-model benefits of separate endpoints (independent artefact lifecycle). They still require homogeneous framework containers and share the instance's compute/memory budget. Zalando's 2021 post predates wide MME adoption and doesn't discuss them — a newer migration would likely consider MME as a third option.

Seen in¶

— canonical instance. Zalando explicitly projects a 200% serving cost increase moving from shared-instance to per-model endpoint, enumerates three named benefits, and accepts the cost. Also names "use of managed services instead of maintaining a custom solution" as an independent reason, making the trade-off explicit rather than implicit.