Skip to content

CONCEPT Cited by 2 sources

Kubernetes Operator pattern

Definition

A Kubernetes Operator is a custom controller that extends the Kubernetes Control Loop (observe → diff → act) to manage a domain- specific workload — typically a stateful system like a database, message broker, or ML training job — whose operational requirements aren't covered by the built-in Kubernetes resources (Deployments, StatefulSets, DaemonSets).

An operator consists of two pieces:

  1. Custom Resource Definition (CRD) — a new Kubernetes resource type defined by the operator author. Users create instances of this resource, declaratively describing the desired state of their domain object ("give me a 3-shard Vitess cluster on 3 AZs").
  2. Custom controller / reconciler — code (usually Go) that watches CRD instances and reconciles real Kubernetes objects (pods, PVCs, Services, etc.) against the desired state. Runs the same observe-diff-act loop as built-in controllers but with domain-specific reconcile logic.

The canonical phrasing from Brian Morrison II:

"Kubernetes offers a great deal of automation, but some workloads require a bit more logic than out-of-the-box Kubernetes is prepared to handle. Operators allow developers to extend Kubernetes by adding custom resources that add to the Control Loop." (Source: sources/2026-04-21-planetscale-scaling-hundreds-of-thousands-of-database-clusters-on-kubernetes)

Why the pattern exists

The built-in Kubernetes resources are generic. They model "keep N pods alive" (Deployment) or "give me N pods with stable names + storage in order" (StatefulSet). But a Vitess cluster — or a MongoDB replica set, or a Kafka cluster with topic/partition state, or a distributed training job with gang scheduling — has domain-specific operational concerns that don't fit these shapes:

  • Failover semantics — which replica becomes primary, how is fencing handled, when is it safe to promote?
  • Topology invariants"always 1 primary + 2 replicas per shard, spread across 3 AZs".
  • Workflow orchestration — schema changes, backups from designated tablets, online resharding.
  • Lifecycle events — version upgrades, config changes, rolling restarts with pre/post hooks.
  • Backup/restore — application-level knowledge of what a backup is and how to validate it.

None of these are naturally expressed as "keep N pods alive". The operator pattern lets the vendor codify their domain model ("a database", "a training job", "a Vitess cluster") as a Kubernetes-native resource and reuse all of Kubernetes' infrastructure (API server, scheduling, RBAC, events) underneath.

Operator reconcile loop

Same three phases as the standard Kubernetes Control Loop, applied to the CRD:

  1. Observe — watch CRD instances via the Kubernetes API (watch-based, not polling).
  2. Diff — compute the set of real Kubernetes objects (Pods, PVCs, Services, ConfigMaps) that should exist to satisfy the CRD's desired state; compare to what's actually there.
  3. Act — create, update, or delete the real objects to converge. Surface events on the CRD to tell the user how the reconcile is going.

The loop runs continuously — if a pod crashes or an AZ goes offline, the next reconcile observes the gap and acts to close it.

Examples in this wiki

When to build an operator

The pattern pays off when:

  • The workload has domain-specific operational logic that can't be expressed via built-in K8s primitives.
  • You run many instances of this workload (one per tenant, per database, per training job) — the operator amortises the engineering cost across thousands of clusters.
  • You need declarative lifecycle management — users describe what they want, not how to get there.
  • Your domain has failure modes that require application knowledge to recover from safely (e.g. fencing an old primary before promoting a new one; refusing to delete a pod whose PVC still has unreplicated writes).

It does not pay off when the workload fits a Deployment or StatefulSet cleanly — at that point you're adding unnecessary custom code.

Contrast with StatefulSet-only approach

A team could run a stateful system (like MySQL) on Kubernetes using only StatefulSets — stable pod identity, ordered startup, attached PVCs. That handles some of the problems (storage, pod-identity stability) but leaves the domain-specific operational concerns — replication handling, source-of-truth determination, backups, AZ-failure reconciliation — to external tooling or human operators.

The operator pattern subsumes the StatefulSet work into domain-aware reconcile logic. PlanetScale's explicit choice to use plain pods + PVC under the Vitess Operator rather than StatefulSets (see patterns/custom-operator-over-statefulset) is the clearest example: with the operator doing the work that StatefulSets would do for a generic workload, the StatefulSet abstraction becomes redundant.

Last updated · 470 distilled / 1,213 read