Skip to content

CONCEPT Cited by 1 source

Instance-type fallback (prioritised node affinity)

Definition

Instance-type fallback is the scheduling pattern where a workload declares a prioritised list of acceptable compute instance types and the scheduler silently places the pod on the highest-priority type that currently has capacity, falling back to the next in the list as capacity goes away. The list is closed (placement is restricted to the listed set — nothing outside the list is a legal home) and ordered (higher-priority entries are strictly preferred within the set).

Why it matters

Accelerator capacity — especially top-tier GPU instance types (ml.p4d.24xlarge, p5.48xlarge, H100-class) — is the canonical scarce resource in modern ML fleets. A single-instance- type deployment has three brittle outcomes:

  • Preferred type unavailable → pod goes Pending indefinitely. The workload is blocked on capacity the customer has no control over.
  • Over-provision to the lowest-common-denominator type. Always request ml.g5.8xlarge — never benefit from a p4d.24xlarge that's sitting available right now.
  • Manual re-deployment when capacity shifts. Operator pages — swap the CRD's instanceType, redeploy, hope.

The fallback pattern replaces the pager with a scheduler primitive that expresses the customer's actual preference order and lets the scheduler pick the best-available slot silently.

Canonical Kubernetes realisation

Express the list via nodeAffinity with two clauses working together:

nodeAffinity:
  # closed set: only these types are eligible
  requiredDuringSchedulingIgnoredDuringExecution:
    nodeSelectorTerms:
    - matchExpressions:
      - key: node.kubernetes.io/instanceType
        operator: In
        values: ["ml.p4d.24xlarge", "ml.g5.24xlarge", "ml.g5.8xlarge"]

  # ordered priority: descending weights → descending preference
  preferredDuringSchedulingIgnoredDuringExecution:
  - weight: 100
    preference:
      matchExpressions:
      - key: node.kubernetes.io/instanceType
        operator: In
        values: ["ml.p4d.24xlarge"]
  - weight: 50
    preference:
      matchExpressions:
      - key: node.kubernetes.io/instanceType
        operator: In
        values: ["ml.g5.24xlarge"]
  - weight: 10
    preference:
      matchExpressions:
      - key: node.kubernetes.io/instanceType
        operator: In
        values: ["ml.g5.8xlarge"]

Two rules, distinct roles:

  • required is a filter — it restricts the placement set. Without this, the scheduler can place the pod on any node that satisfies other constraints; fallback would be meaningless because non-listed types would also be eligible.
  • preferred with descending weights is a sort key — when multiple required-eligible nodes exist, the one with the highest preference-weight match wins. Weights are additive across matched rules, so assigning strictly decreasing values (100 / 50 / 10) enforces the priority ordering.

The IgnoredDuringExecution suffix on both rules means a running pod is not evicted if its node later falls out of the eligibility set — fallback applies only at scheduling time.

Production instance — SageMaker HyperPod Inference

(Source: sources/2026-04-06-aws-unlock-efficient-model-deployment-simplified-inference-operator-setup-on-amazon-sagemaker-hyperpod)

The SageMaker HyperPod Inference Operator exposes a short-form syntax that compiles to the affinity rules above:

apiVersion: inference.sagemaker.aws.amazon.com/v1
kind: InferenceEndpointConfig
spec:
  replicas: 13
  modelName: Llama-3.1-8B-Instruct
  instanceTypes: ["ml.p4d.24xlarge", "ml.g5.24xlarge", "ml.g5.8xlarge"]

"In the example below, when deploying a model from S3, ml.p4d.24xlarge has the highest priority and will be selected first if memory capacity is available. If ml.p4d.24xlarge is unavailable, the scheduler automatically falls back to ml.g5.24xlarge, and finally to ml.g5.8xlarge as the last resort."

The operator documents the compilation explicitly: "implemented using Kubernetes node affinity rules with requiredDuringSchedulingIgnoredDuringExecution to restrict scheduling to the specified instance types, and preferredDuringSchedulingIgnoredDuringExecution with descending weights to enforce priority ordering."

For more granular scheduling (exclude Spot, prefer specific AZs, target custom labels), the same CRD exposes raw nodeAffinity directly — see systems/sagemaker-hyperpod-inference-operator's worked example.

Contrast with other scheduler strategies

Strategy What the scheduler does When to use
Single instanceType Pod pending until that exact type is available When replica is pinned to a specific accelerator class (e.g. FP8 kernel only runs on H100)
Instance-type fallback (this) Best-available within a closed ordered set When multiple accelerator classes are acceptable but some are preferred
Karpenter just-in-time Launch a new node of an acceptable type on demand When capacity can be created, not just discovered
concepts/locality-aware-scheduling Co-place with input data When data-movement cost dominates placement decision
concepts/memory-aware-scheduling Pack by declared memory footprint rather than CPU When memory is the binding resource (Ray compactor)

Instance-type fallback is orthogonal to each of these — Karpenter, locality, and memory-aware scheduling can all compose with a fallback list (pick any acceptable type, but prefer the data-local one; launch a new node of the preferred acceptable type).

When it's the wrong answer

  • Different instance types produce different model quality. If the workload's correctness or output depends on the accelerator class (numerical behaviour, FP8 support, NVLink topology), fallback silently hides a quality regression.
  • Different instance types have wildly different per-query cost or latency. The preferred weight ordering should follow the business preference (cost vs latency vs quality), not accident of author ordering — gotcha.
  • Capacity is the problem on every type in the list. Fallback buys nothing if all listed types are simultaneously constrained; pair with Karpenter or a just-in-time provisioner.
  • Replica count doesn't fit one type. If replicas: 13 and only 5 p4d nodes are free, the scheduler will place 5 on p4d and 8 on the fallback — a heterogeneous fleet the application tier may not expect (per-replica throughput varies).

Adjacent tradition

  • topologySpreadConstraints — enforce spread across zones / racks; composes with fallback (spread across AZs and prefer instance types in an order).
  • Taints + tolerations — access-control rather than preference; keeps dedicated pools for dedicated workloads.
  • Priority-ordered bidding in EC2 Spot Fleet / Auto Scaling Groups — the same pattern at the node-provisioning layer, below the K8s scheduler.
  • Cloud-provider instance-type flexibility in Karpenter / Cluster Autoscaler NodePools — expose a list of acceptable types to the provisioner, not to the scheduler.

Seen in

Last updated · 200 distilled / 1,178 read