SYSTEM Cited by 2 sources
Amazon SageMaker HyperPod¶
Amazon SageMaker HyperPod is the large-scale distributed-training / inference compute substrate under systems/aws-sagemaker-ai. It targets workloads that run across hundreds to thousands of GPUs — foundation-model training, fine-tuning, and large-scale inference. Stub page: expand as sources cite specifics.
Problem space¶
At HyperPod's scale, failures are inevitable: "hardware overheats. network connections drop. memory gets corrupted." The product's design center is how you detect and recover, not if failures occur. Three design challenges surfaced in sources/2025-08-06-allthingsdistributed-removing-friction-sagemaker-ai-development:
- concepts/grey-failure — GPU thermal throttling, NIC packet loss: partial degradation that standard up/down monitoring misses.
- concepts/monitoring-paradox — single-threaded collectors deployed to catch failures themselves cause failures by hitting CPU limits and filling disks.
- concepts/training-serving-boundary — historically training and serving ran on separate fleets; HyperPod's newer capabilities collapse that split.
Capabilities (2025)¶
- HyperPod observability. Replaces single-threaded collectors with auto-scaling ones that grow and shrink with workload (patterns/auto-scaling-telemetry-collector); automatically correlates high-cardinality metrics from "every GPU, every network interface, every storage device" using algorithms designed for large-scale time-series; detects grey failures, not just binary ones; ships zero-config dashboards that replace the multi-tool detective work (CloudWatch containers + custom GPU dashboards + network monitors). Claimed impact: "days → minutes" for root-cause detection; proactive pre-failure alerts.
- HyperPod model deployment. Train a foundation model on a cluster, deploy it for inference on that same cluster — crosses the concepts/training-serving-boundary by exploiting that modern FM training and inference both want the same GPU fabric.
- HyperPod training operator (for Kubernetes). Restarts only the affected resources on failure rather than the whole job (patterns/partial-restart-fault-recovery); monitors for stalled batches and non-numeric loss (NaN/Inf) as explicit health signals beyond pod liveness; exposes YAML-defined recovery policies so teams codify their own restart/kill/alert behavior.
HyperPod Inference Operator (2026)¶
See dedicated page:
systems/sagemaker-hyperpod-inference-operator. The Kubernetes
controller behind InferenceEndpointConfig +
JumpStartModel CRDs, running on a HyperPod cluster with EKS
orchestration. As of 2026-04-06 ships as a native EKS add-on
(replacing the prior Helm install path — see
patterns/eks-add-on-as-lifecycle-packaging); managed
prerequisite scaffolding includes four IAM roles, an S3 bucket for
TLS certs, VPC endpoints, and four dependency add-ons (cert-
manager, S3 CSI, FSx CSI, metrics-server). Three platform features
land at install:
- Multi-instance-type
deployment via Kubernetes node-affinity rules — prioritised
list of GPU types (e.g.
["ml.p4d.24xlarge", "ml.g5.24xlarge", "ml.g5.8xlarge"]), compiled torequiredDuringSchedulingIgnoredDuringExecution+preferredDuringSchedulingIgnoredDuringExecutionwith descending weights. Scheduler silently falls back when preferred capacity is unavailable. - Managed tiered KV cache with intelligent per-instance-type memory allocation; AWS-claimed up to 40% inference-latency reduction for long-context workloads (un- methodologied). Moves KV-cache management from model-serving-library scope into platform scope.
- Intelligent routing — one of three strategies (prefix-aware / KV-aware / round- robin) picked at install to maximise KV-cache reuse across requests that share a prompt prefix.
Bundled dependencies include systems/keda (IRSA-installed autoscaler) and an ALB Controller. Observability surfaces TTFT / latency / GPU utilisation into Amazon Managed Grafana dashboards.
Seen in¶
- sources/2025-08-06-allthingsdistributed-removing-friction-sagemaker-ai-development — Vogels' friction-removal survey; introduces HyperPod observability, model deployment, and training operator.
- sources/2026-04-06-aws-unlock-efficient-model-deployment-simplified-inference-operator-setup-on-amazon-sagemaker-hyperpod — EKS add-on launch for the HyperPod Inference Operator; canonical wiki reference for the CRDs, the Helm→add-on packaging transition, and the KV-cache + prefix-aware-routing + instance-type-fallback feature envelope.
Related¶
- systems/aws-sagemaker-ai
- systems/sagemaker-hyperpod-inference-operator — the inference-side controller.
- systems/aws-eks — the K8s control plane under HyperPod inference.
- systems/kubernetes — substrate under the training + inference operators.
- systems/helm — the packaging primitive the 2026 inference operator migrates away from.
- systems/keda — bundled autoscaler dependency.
- concepts/grey-failure, concepts/monitoring-paradox, concepts/training-serving-boundary
- concepts/instance-type-fallback, concepts/kv-cache, concepts/prefix-aware-routing
- patterns/auto-scaling-telemetry-collector, patterns/partial-restart-fault-recovery, patterns/eks-add-on-as-lifecycle-packaging