SYSTEM Cited by 2 sources
Amazon SageMaker Endpoint¶
Amazon SageMaker Endpoint is AWS's managed model-serving surface within the SageMaker AI family. Two deployment shapes:
- Serverless inference endpoints — scale-to-zero, per-request billing; no GPU support; 6 GB memory cap.
- Serverful (real-time) inference endpoints — provisioned
instances from a chosen family (
ml.m*,ml.c*,ml.g*GPU,ml.p*GPU, etc.) with auto-scaling policies.
Stub — expand as dedicated sources arrive.
Serverless-to-Serverful pivot at production scale¶
Canonical wiki instance: sources/2026-04-01-aws-automate-safety-monitoring-with-computer-vision-and-generative-ai started on SageMaker Serverless inference endpoints with approximately 50 cameras. Two hard ceilings forced the pivot at hundreds of sites:
- No GPU support on Serverless — blocks most modern CV / LLM workloads.
- 6 GB maximum memory configuration on Serverless — caused out-of-memory errors at production image volumes.
Fix: migrate to Serverful endpoints on ml.g6 family GPU
instances + auto-scaling policies, paired with AWS service-team
collaboration to raise Lambda concurrent-
execution limits, memory-allocation + multithreading optimisations,
and SQS batch-size tuning.
This is the first architecturally-detailed wiki reference for the Serverless → Serverful pivot — a general pattern for ML-serving teams whose PoCs fit Serverless but whose production traffic does not.
Seen in¶
- sources/2026-04-01-aws-automate-safety-monitoring-with-computer-vision-and-generative-ai — per-use-case Serverful endpoints on ml.g6 with auto-scaling under the patterns/serverless-driver-worker pattern.
- sources/2026-04-21-figma-the-infrastructure-behind-ai-search-in-figma — batched embedding inference role: SageMaker hosts CLIP; embedding requests sent in batches, input a list of thumbnail URLs, output a list of embeddings, one per input image. Inside the container: image download + resize + normalise parallelised. Batch-size sweet spot — "past some threshold we started to see latency growing linearly with batch size, instead of a sublinear batching effect" — inference throughput plateaus past an optimal batch size. Stage 2 of Figma's four-stage discrete-job pipeline.