Skip to content

PATTERN Cited by 1 source

EKS add-on as lifecycle packaging

Problem

An AWS-native Kubernetes operator (or dependency controller) ships initially as a Helm chart. Customer installation requires:

  • Creating several IAM roles with specific trust relationships (execution role, controller role, IRSA roles for sub- components).
  • Creating supporting AWS resources (S3 buckets for TLS certs, VPC endpoints, KMS keys).
  • Installing a bundle of dependency charts (cert-manager, CSI drivers, metrics-server, autoscalers) — each with its own version cadence, chart values, IRSA bindings.
  • Managing Helm-release lifecycle: helm upgrade cadence, break- glass rollbacks, drift between declared and actual state.
  • Tracking a compatibility matrix across {EKS version × operator version × dependency-chart versions}.

The result: "a maze of Helm charts, IAM role configurations, dependency management, and manual upgrades — often taking hours before a single model can serve predictions" (sources/2026-04-06-aws-unlock-efficient-model-deployment-simplified-inference-operator-setup-on-amazon-sagemaker-hyperpod).

The customer-operated surface is mostly undifferentiated heavy lifting — none of the Helm/IAM/dependency scaffolding is where the customer's value-add lives.

Pattern shape

Re-package the Helm chart + its dependency bundle + its IAM prerequisites as a native EKS add-on.

An EKS add-on is a first-class EKS API primitive (aws eks create-addon, aws eks update-addon, aws eks delete-addon) where AWS owns:

  • Version compatibility — the add-on carries a declared {EKS version × add-on version} compatibility envelope; the EKS API refuses incompatible combinations.
  • One-click upgrades with rollback-on-failure semantics — update-addon against a new version, EKS handles the reconciliation + reverts on failure.
  • Dependency add-ons — the add-on manifest declares its dependencies as other add-ons (cert-manager-as-add-on, CSI- driver-as-add-on) that EKS installs / upgrades together.
  • IAM scaffolding — the add-on install creates named IAM roles with scoped trust / permissions via the configuration- values blob (executionRoleArn, IRSA role ARNs for sub- components).
  • Config blob — a JSON configuration-values schema for customer-specific knobs (S3 bucket names, cluster ARN, existing IAM role ARNs to reuse).

The customer-facing surface collapses from "install Helm chart + 4 dependency charts + create 4 IAM roles + create S3 bucket + configure VPC endpoints" to a single aws eks create-addon command.

Canonical realisation: SageMaker HyperPod Inference Operator

(Source: sources/2026-04-06-aws-unlock-efficient-model-deployment-simplified-inference-operator-setup-on-amazon-sagemaker-hyperpod)

The operator's 2026-04-06 repackaging is the canonical wiki instance. Before:

  • Helm chart HyperPodHelmChart/charts/inference-operator
  • Helm sub-charts for cert-manager, S3 Mountpoint CSI driver, FSx CSI driver, metrics-server ⋅
  • Customer-created IAM roles: Execution / JumpStart Gated Model / ALB Controller / KEDA Operator (four) ⋅
  • Customer-created S3 bucket for TLS certificates ⋅
  • Customer-created VPC endpoints for S3 access ⋅
  • Customer-managed upgrade cadence.

After:

aws eks create-addon \
  --cluster-name my-hyperpod-cluster \
  --addon-name amazon-sagemaker-hyperpod-inference \
  --addon-version v1.0.0-eksbuild.1 \
  --configuration-values '{
    "executionRoleArn": "...",
    "tlsCertificateS3Bucket": "...",
    "hyperpodClusterArn": "...",
    "alb": { "serviceAccount": {"create": true, "roleArn": "..."}},
    "keda": { "auth": { "aws": { "irsa": { "roleArn": "..."}}}}
  }' \
  --region us-west-2

Or, via the SageMaker console's Quick Install, zero customer parameters — AWS creates IAM roles / S3 bucket / VPC endpoints / dependency add-ons with optimised defaults, one click.

The migration-script sub-pattern

For workloads already running on the Helm chart, shipping the add-on isn't enough — an automated migration is required. SageMaker HyperPod's migration script (helm_to_addon.sh) is the canonical shape:

  1. Auto-discover — read the existing Helm deployment's configuration (IAM roles in use, S3 buckets, dependency charts, release values) and derive the new add-on's configuration-values blob.
  2. Validate prerequisites — confirm dependency CRDs / RBAC exist; create whatever the Helm release used that the add-on also needs.
  3. Tag migrated resourcesCreatedBy: HyperPodInference applied to ALBs, ACM certs, S3 objects so both paths can coexist during cutover and so cleanup is identifiable.
  4. Scale down Helm-managed deployments (ALB, KEDA, operator) before installing the add-on to avoid two controllers reconciling the same CRDs.
  5. Install the add-on with OVERWRITE flag so it owns the CRDs / namespace resources previously owned by Helm.
  6. Clean up old Helm resources.
  7. Migrate dependency add-ons (CSI drivers, cert-manager, metrics-server) from Helm-installed to add-on-installed; --skip-dependencies-migration flag lets organisations that already operate these as their own add-ons opt out.
  8. Preserve rollback — store /tmp/hyperpod-migration-backup- <timestamp>/ so failure mid-migration can revert to the Helm state.

The combination of --auto-approve (non-interactive path for automation) + stepwise-interactive path + rollback backups gives operators the three migration postures: run it in prod now, walk through it on the pre-prod cluster, and undo it if the add-on misbehaves.

Why the pattern works

  • Fewer knobs → fewer foot-guns. The Helm surface had ~N independent knobs (chart values × dependency-chart versions × IAM-role policy JSON × S3 ACL × VPC endpoint config); the add-on surface has one JSON config blob with vendor-validated shapes.
  • Vendor owns the compatibility matrix. The {EKS version × operator version × dependency-add-on versions} matrix is a combinatorial explosion customers can't keep in their head; making it the vendor's problem is a direct application of concepts/managed-data-plane-style responsibility-boundary shift.
  • Upgrade path is one API call. update-addon with a new version is safer than helm upgrade because EKS validates the jump is on the supported matrix and rolls back on failure.
  • Dependency add-ons compose uniformly. When cert-manager ships as an add-on, every operator that depends on it stops having to bundle its own version; the EKS cluster has one cert-manager, not N.

Costs / when the pattern is wrong

  • Customisation surface narrows. Helm's templating lets customers inject arbitrary annotations, labels, init- containers, volume mounts; the add-on's configuration-values schema is vendor-defined. If the customer's deployment requires non-standard tweaks, the add-on path is too rigid — fall back to Helm.
  • Vendor-controlled upgrade cadence. Once the add-on is the only shipping channel, the customer is on the vendor's upgrade clock. For some workloads (pinned compliance versions, custom forks), this is disqualifying.
  • Multi-cloud / portability. Helm charts are cloud-agnostic; EKS add-ons only work on EKS. Organisations that deploy to EKS + GKE + AKS can't share an add-on path.
  • Less ecosystem leverage. The Helm chart ecosystem has well-understood tooling (helmfile, ArgoCD Helm renderer, CI/CD integrations); add-on equivalents are EKS-specific and less mature.

Adjacent tradition

  • OS-level package managers (apt, yum) for system add-ons vs. ./configure && make install — the same packaging- ownership shift at the Linux-distro layer.
  • Managed-RDS extensions — Postgres extensions installed via CREATE EXTENSION (managed) vs. custom .so files on a self- hosted server (customer-managed).
  • concepts/managed-data-plane at the service-mesh layer — AWS-managed Envoy (Service Connect) vs. customer-managed Envoy (App Mesh). Same trade-off: lose configurability, gain lifecycle management.
  • EKS Auto Mode at the node / data-plane layer. EKS add-on packaging is the operator-layer instance of the same shared-responsibility shift Auto Mode made at the node layer.

Seen in

Last updated · 200 distilled / 1,178 read