CONCEPT Cited by 1 source
Instance-type fallback (prioritised node affinity)¶
Definition¶
Instance-type fallback is the scheduling pattern where a workload declares a prioritised list of acceptable compute instance types and the scheduler silently places the pod on the highest-priority type that currently has capacity, falling back to the next in the list as capacity goes away. The list is closed (placement is restricted to the listed set — nothing outside the list is a legal home) and ordered (higher-priority entries are strictly preferred within the set).
Why it matters¶
Accelerator capacity — especially top-tier GPU instance types
(ml.p4d.24xlarge, p5.48xlarge, H100-class) — is the
canonical scarce resource in modern ML fleets. A single-instance-
type deployment has three brittle outcomes:
- Preferred type unavailable → pod goes
Pendingindefinitely. The workload is blocked on capacity the customer has no control over. - Over-provision to the lowest-common-denominator type. Always
request
ml.g5.8xlarge— never benefit from ap4d.24xlargethat's sitting available right now. - Manual re-deployment when capacity shifts. Operator pages —
swap the CRD's
instanceType, redeploy, hope.
The fallback pattern replaces the pager with a scheduler primitive that expresses the customer's actual preference order and lets the scheduler pick the best-available slot silently.
Canonical Kubernetes realisation¶
Express the list via nodeAffinity with two clauses working
together:
nodeAffinity:
# closed set: only these types are eligible
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: node.kubernetes.io/instanceType
operator: In
values: ["ml.p4d.24xlarge", "ml.g5.24xlarge", "ml.g5.8xlarge"]
# ordered priority: descending weights → descending preference
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 100
preference:
matchExpressions:
- key: node.kubernetes.io/instanceType
operator: In
values: ["ml.p4d.24xlarge"]
- weight: 50
preference:
matchExpressions:
- key: node.kubernetes.io/instanceType
operator: In
values: ["ml.g5.24xlarge"]
- weight: 10
preference:
matchExpressions:
- key: node.kubernetes.io/instanceType
operator: In
values: ["ml.g5.8xlarge"]
Two rules, distinct roles:
requiredis a filter — it restricts the placement set. Without this, the scheduler can place the pod on any node that satisfies other constraints; fallback would be meaningless because non-listed types would also be eligible.preferredwith descending weights is a sort key — when multiple required-eligible nodes exist, the one with the highest preference-weight match wins. Weights are additive across matched rules, so assigning strictly decreasing values (100 / 50 / 10) enforces the priority ordering.
The IgnoredDuringExecution suffix on both rules means a
running pod is not evicted if its node later falls out of the
eligibility set — fallback applies only at scheduling time.
Production instance — SageMaker HyperPod Inference¶
The SageMaker HyperPod Inference Operator exposes a short-form syntax that compiles to the affinity rules above:
apiVersion: inference.sagemaker.aws.amazon.com/v1
kind: InferenceEndpointConfig
spec:
replicas: 13
modelName: Llama-3.1-8B-Instruct
instanceTypes: ["ml.p4d.24xlarge", "ml.g5.24xlarge", "ml.g5.8xlarge"]
"In the example below, when deploying a model from S3, ml.p4d.24xlarge has the highest priority and will be selected first if memory capacity is available. If ml.p4d.24xlarge is unavailable, the scheduler automatically falls back to ml.g5.24xlarge, and finally to ml.g5.8xlarge as the last resort."
The operator documents the compilation explicitly: "implemented
using Kubernetes node affinity rules with
requiredDuringSchedulingIgnoredDuringExecution to restrict
scheduling to the specified instance types, and
preferredDuringSchedulingIgnoredDuringExecution with descending
weights to enforce priority ordering."
For more granular scheduling (exclude Spot, prefer specific AZs,
target custom labels), the same CRD exposes raw nodeAffinity
directly — see systems/sagemaker-hyperpod-inference-operator's
worked example.
Contrast with other scheduler strategies¶
| Strategy | What the scheduler does | When to use |
|---|---|---|
Single instanceType |
Pod pending until that exact type is available | When replica is pinned to a specific accelerator class (e.g. FP8 kernel only runs on H100) |
| Instance-type fallback (this) | Best-available within a closed ordered set | When multiple accelerator classes are acceptable but some are preferred |
| Karpenter just-in-time | Launch a new node of an acceptable type on demand | When capacity can be created, not just discovered |
| concepts/locality-aware-scheduling | Co-place with input data | When data-movement cost dominates placement decision |
| concepts/memory-aware-scheduling | Pack by declared memory footprint rather than CPU | When memory is the binding resource (Ray compactor) |
Instance-type fallback is orthogonal to each of these — Karpenter, locality, and memory-aware scheduling can all compose with a fallback list (pick any acceptable type, but prefer the data-local one; launch a new node of the preferred acceptable type).
When it's the wrong answer¶
- Different instance types produce different model quality. If the workload's correctness or output depends on the accelerator class (numerical behaviour, FP8 support, NVLink topology), fallback silently hides a quality regression.
- Different instance types have wildly different per-query cost
or latency. The
preferredweight ordering should follow the business preference (cost vs latency vs quality), not accident of author ordering — gotcha. - Capacity is the problem on every type in the list. Fallback buys nothing if all listed types are simultaneously constrained; pair with Karpenter or a just-in-time provisioner.
- Replica count doesn't fit one type. If
replicas: 13and only 5 p4d nodes are free, the scheduler will place 5 on p4d and 8 on the fallback — a heterogeneous fleet the application tier may not expect (per-replica throughput varies).
Adjacent tradition¶
topologySpreadConstraints— enforce spread across zones / racks; composes with fallback (spread across AZs and prefer instance types in an order).- Taints + tolerations — access-control rather than preference; keeps dedicated pools for dedicated workloads.
- Priority-ordered bidding in EC2 Spot Fleet / Auto Scaling Groups — the same pattern at the node-provisioning layer, below the K8s scheduler.
- Cloud-provider instance-type flexibility in Karpenter / Cluster Autoscaler NodePools — expose a list of acceptable types to the provisioner, not to the scheduler.
Seen in¶
- sources/2026-04-06-aws-unlock-efficient-model-deployment-simplified-inference-operator-setup-on-amazon-sagemaker-hyperpod
— canonical documented instance. Priority list via
InferenceEndpointConfig.spec.instanceTypes; scheduler compilation explicit.
Related¶
- systems/kubernetes — the mechanism substrate.
- systems/sagemaker-hyperpod-inference-operator — the canonical production consumer.
- concepts/locality-aware-scheduling, concepts/memory-aware-scheduling — orthogonal scheduler primitives.
- concepts/workload-aware-routing — analogue at the request-routing layer (route by shape; here we schedule by preference).