Skip to content

SYSTEM Cited by 2 sources

Amazon SageMaker Endpoint

Amazon SageMaker Endpoint is AWS's managed model-serving surface within the SageMaker AI family. Two deployment shapes:

  • Serverless inference endpoints — scale-to-zero, per-request billing; no GPU support; 6 GB memory cap.
  • Serverful (real-time) inference endpoints — provisioned instances from a chosen family (ml.m*, ml.c*, ml.g* GPU, ml.p* GPU, etc.) with auto-scaling policies.

Stub — expand as dedicated sources arrive.

Serverless-to-Serverful pivot at production scale

Canonical wiki instance: sources/2026-04-01-aws-automate-safety-monitoring-with-computer-vision-and-generative-ai started on SageMaker Serverless inference endpoints with approximately 50 cameras. Two hard ceilings forced the pivot at hundreds of sites:

  1. No GPU support on Serverless — blocks most modern CV / LLM workloads.
  2. 6 GB maximum memory configuration on Serverless — caused out-of-memory errors at production image volumes.

Fix: migrate to Serverful endpoints on ml.g6 family GPU instances + auto-scaling policies, paired with AWS service-team collaboration to raise Lambda concurrent- execution limits, memory-allocation + multithreading optimisations, and SQS batch-size tuning.

This is the first architecturally-detailed wiki reference for the Serverless → Serverful pivot — a general pattern for ML-serving teams whose PoCs fit Serverless but whose production traffic does not.

Seen in

Last updated · 200 distilled / 1,178 read