SYSTEM Cited by 2 sources

Amazon SageMaker Endpoint¶

Amazon SageMaker Endpoint is AWS's managed model-serving surface within the SageMaker AI family. Two deployment shapes:

Serverless inference endpoints — scale-to-zero, per-request billing; no GPU support; 6 GB memory cap.
Serverful (real-time) inference endpoints — provisioned instances from a chosen family (ml.m*, ml.c*, ml.g* GPU, ml.p* GPU, etc.) with auto-scaling policies.

Stub — expand as dedicated sources arrive.

Serverless-to-Serverful pivot at production scale¶

Canonical wiki instance: sources/2026-04-01-aws-automate-safety-monitoring-with-computer-vision-and-generative-ai started on SageMaker Serverless inference endpoints with approximately 50 cameras. Two hard ceilings forced the pivot at hundreds of sites:

No GPU support on Serverless — blocks most modern CV / LLM workloads.
6 GB maximum memory configuration on Serverless — caused out-of-memory errors at production image volumes.

Fix: migrate to Serverful endpoints on ml.g6 family GPU instances + auto-scaling policies, paired with AWS service-team collaboration to raise Lambda concurrent- execution limits, memory-allocation + multithreading optimisations, and SQS batch-size tuning.

This is the first architecturally-detailed wiki reference for the Serverless → Serverful pivot — a general pattern for ML-serving teams whose PoCs fit Serverless but whose production traffic does not.

Seen in¶

sources/2026-04-01-aws-automate-safety-monitoring-with-computer-vision-and-generative-ai — per-use-case Serverful endpoints on ml.g6 with auto-scaling under the patterns/serverless-driver-worker pattern.
sources/2026-04-21-figma-the-infrastructure-behind-ai-search-in-figma — batched embedding inference role: SageMaker hosts CLIP; embedding requests sent in batches, input a list of thumbnail URLs, output a list of embeddings, one per input image. Inside the container: image download + resize + normalise parallelised. Batch-size sweet spot — "past some threshold we started to see latency growing linearly with batch size, instead of a sublinear batching effect" — inference throughput plateaus past an optimal batch size. Stage 2 of Figma's four-stage discrete-job pipeline.
— Zalando Payments deferred-payment risk-scoring, 2021. One SageMaker endpoint per model (per the model-per-endpoint isolation trade-off), each backed by a SageMaker Inference Pipeline Model (scikit-learn preprocessing container + XGBoost / PyTorch / TF main-model container). Load-test envelope (4-min continuous hits, m5 family):
1× ml.m5.large @ 200 RPS → p99 < 80 ms.
≥ 4× ml.m5.large @ 400 RPS → p99 < 50 ms, ~100% success.
≥ 2× ml.m5.4xlarge / ml.m5.12xlarge @ 1000 RPS → p99 < 200 ms.
~50% scale-up-time reduction vs Zalando's legacy Scala/Spark system when adding an instance.
Requirements — p99.9 in milliseconds, hundreds RPS regular, 10× during sales events, multi-model production. First wiki reference disclosing concrete m5-family load-test envelope for real-time-fraud ML serving and concrete per-model- per-endpoint cost delta (up to 200% projected increase, accepted as migration tax).