SYSTEM Cited by 2 sources

Amazon SageMaker HyperPod¶

Amazon SageMaker HyperPod is the large-scale distributed-training / inference compute substrate under systems/aws-sagemaker-ai. It targets workloads that run across hundreds to thousands of GPUs — foundation-model training, fine-tuning, and large-scale inference. Stub page: expand as sources cite specifics.

Problem space¶

At HyperPod's scale, failures are inevitable: "hardware overheats. network connections drop. memory gets corrupted." The product's design center is how you detect and recover, not if failures occur. Three design challenges surfaced in sources/2025-08-06-allthingsdistributed-removing-friction-sagemaker-ai-development:

concepts/grey-failure — GPU thermal throttling, NIC packet loss: partial degradation that standard up/down monitoring misses.
concepts/monitoring-paradox — single-threaded collectors deployed to catch failures themselves cause failures by hitting CPU limits and filling disks.
concepts/training-serving-boundary — historically training and serving ran on separate fleets; HyperPod's newer capabilities collapse that split.

Capabilities (2025)¶

HyperPod observability. Replaces single-threaded collectors with auto-scaling ones that grow and shrink with workload (patterns/auto-scaling-telemetry-collector); automatically correlates high-cardinality metrics from "every GPU, every network interface, every storage device" using algorithms designed for large-scale time-series; detects grey failures, not just binary ones; ships zero-config dashboards that replace the multi-tool detective work (CloudWatch containers + custom GPU dashboards + network monitors). Claimed impact: "days → minutes" for root-cause detection; proactive pre-failure alerts.
HyperPod model deployment. Train a foundation model on a cluster, deploy it for inference on that same cluster — crosses the concepts/training-serving-boundary by exploiting that modern FM training and inference both want the same GPU fabric.
HyperPod training operator (for Kubernetes). Restarts only the affected resources on failure rather than the whole job (patterns/partial-restart-fault-recovery); monitors for stalled batches and non-numeric loss (NaN/Inf) as explicit health signals beyond pod liveness; exposes YAML-defined recovery policies so teams codify their own restart/kill/alert behavior.

HyperPod Inference Operator (2026)¶

See dedicated page: systems/sagemaker-hyperpod-inference-operator. The Kubernetes controller behind InferenceEndpointConfig + JumpStartModel CRDs, running on a HyperPod cluster with EKS orchestration. As of 2026-04-06 ships as a native EKS add-on (replacing the prior Helm install path — see patterns/eks-add-on-as-lifecycle-packaging); managed prerequisite scaffolding includes four IAM roles, an S3 bucket for TLS certs, VPC endpoints, and four dependency add-ons (cert- manager, S3 CSI, FSx CSI, metrics-server). Three platform features land at install:

Multi-instance-type deployment via Kubernetes node-affinity rules — prioritised list of GPU types (e.g. ["ml.p4d.24xlarge", "ml.g5.24xlarge", "ml.g5.8xlarge"]), compiled to requiredDuringSchedulingIgnoredDuringExecution + preferredDuringSchedulingIgnoredDuringExecution with descending weights. Scheduler silently falls back when preferred capacity is unavailable.
Managed tiered KV cache with intelligent per-instance-type memory allocation; AWS-claimed up to 40% inference-latency reduction for long-context workloads (un- methodologied). Moves KV-cache management from model-serving-library scope into platform scope.
Intelligent routing — one of three strategies (prefix-aware / KV-aware / round- robin) picked at install to maximise KV-cache reuse across requests that share a prompt prefix.

Bundled dependencies include systems/keda (IRSA-installed autoscaler) and an ALB Controller. Observability surfaces TTFT / latency / GPU utilisation into Amazon Managed Grafana dashboards.

(Source: sources/2026-04-06-aws-unlock-efficient-model-deployment-simplified-inference-operator-setup-on-amazon-sagemaker-hyperpod)

Seen in¶

sources/2025-08-06-allthingsdistributed-removing-friction-sagemaker-ai-development — Vogels' friction-removal survey; introduces HyperPod observability, model deployment, and training operator.
sources/2026-04-06-aws-unlock-efficient-model-deployment-simplified-inference-operator-setup-on-amazon-sagemaker-hyperpod — EKS add-on launch for the HyperPod Inference Operator; canonical wiki reference for the CRDs, the Helm→add-on packaging transition, and the KV-cache + prefix-aware-routing + instance-type-fallback feature envelope.

systems/aws-sagemaker-ai
systems/sagemaker-hyperpod-inference-operator — the inference-side controller.
systems/aws-eks — the K8s control plane under HyperPod inference.
systems/kubernetes — substrate under the training + inference operators.
systems/helm — the packaging primitive the 2026 inference operator migrates away from.
systems/keda — bundled autoscaler dependency.
concepts/grey-failure, concepts/monitoring-paradox, concepts/training-serving-boundary
concepts/instance-type-fallback, concepts/kv-cache, concepts/prefix-aware-routing
patterns/auto-scaling-telemetry-collector, patterns/partial-restart-fault-recovery, patterns/eks-add-on-as-lifecycle-packaging

Amazon SageMaker HyperPod¶

Problem space¶

Capabilities (2025)¶

HyperPod Inference Operator (2026)¶

Seen in¶

Related¶