Skip to content

CONCEPT Cited by 1 source

Goal-oriented orchestrator

Definition

A goal-oriented orchestrator is a topology manager that converges a cluster to the state declared by an external source of truth (here: the Vitess topology server) rather than merely repairing whatever breakage is visible through its own observations. The operator does not need to see a failure to act — if the observed topology diverges from the declared intent, the difference itself is treated as something to fix.

Shlomi Noach's framing:

"This cluster awareness is a fundamental change in orchestrator's approach, and allows us to make orchestrator goal-driven. orchestrator's goal is to ensure a cluster is always in a state compatible with what Vitess expects it to be. This is accomplished by introducing new failure detection modes not possible before, and new recovery methods too opinionated otherwise." (Source: sources/2026-04-21-planetscale-orchestrator-failure-detection-and-recovery-new-beginnings)

What "observational" orchestrator couldn't do

Pre-integration, Orchestrator observed a replication topology and formed opinions about its health. But it had no notion of what the cluster should look like:

"It doesn't know if some standalone server should belong to this or that cluster; if the current primary server is indeed what's advertised to your application; if you really intended to set up a multi-primary cluster. It is generic in that it allows a variety of topology layouts, as requested and used by the greater community."

A standalone server in the data centre was just a server. A writable replica was a functional replica with an unusual flag. A multi-primary topology was a valid if unusual setup. Orchestrator had no basis on which to call these failure states.

What cluster awareness unlocks

The integrated orchestrator reads MySQL metadata directly from the Vitess topology server via vttablet — which binds every MySQL server's schema, shard, and role to topology. Now the operator knows:

  • Two servers belong to the same cluster because Vitess declares they do — not because they happen to be in a replication chain.
  • Server X is supposed to be in the PRIMARY role.
  • Server Y is supposed to be a REPLICA with read_only=1.
  • Multi-primary is not a valid layout.

This enables new recovery modes, all flowing from declared intent ≠ observed state:

Observed Declared (per Vitess) Recovery
Standalone server REPLICA Connect to correct cluster (after GTID validation)
Writable replica Read-only REPLICA Flip to read-only
Read-only primary Writable PRIMARY Flip to writable
Multi-primary topology Single PRIMARY Demote all-but-the-declared-primary to replicas
Functional cluster, wrong primary Different PRIMARY Graceful takeover / planned-reparent

Noach flags the last case as "possibly the most intriguing": the cluster works, writes are being accepted, reads are being served. But Vitess's declared intent is that server Y should be primary, not server X. The goal-oriented orchestrator detects this and performs a graceful reparent, even though nothing is visibly broken from a replication-topology standpoint.

Fail or converge — no partial states

The load-bearing behavioural invariant:

"It is furthermore interesting to note that orchestrator's operations will either fail or converge to the desired state." (Source: sources/2026-04-21-planetscale-orchestrator-failure-detection-and-recovery-new-beginnings)

A half-flipped topology is not an accepted intermediate. If the operation encounters a step it can't complete, it aborts cleanly rather than leaving the cluster in a split state. This is the failure-mode invariant that makes automated reparenting safe to run unattended — compare with the pre-integration hook-script era where dropped events produced co-primary / split states that required human intervention.

The pre-integration split-state failure mode

The integration eliminates a specific class of bug:

"For the past few years, orchestrator was an external entity to Vitess. The two would collaborate over a few API calls. orchestrator did not have any Vitess awareness, and much of the integration was done through pre- and post- recovery hooks, shell scripts and API calls. This led to known situations where Vitess and orchestrator would compete over a failover, or make some operations unknown to each other, causing confusion. Clusters would end up in split state, or in co-primary state. The loss of a single event could cause cluster corruption." (Source: sources/2026-04-21-planetscale-orchestrator-failure-detection-and-recovery-new-beginnings)

Canonical case study for why hook-script integration at critical-path failure boundaries is structurally fragile: pre/post hooks make the contract event-driven at the orchestrator-Vitess seam, and a single dropped event splits shared state. Goal-oriented operation replaces event-driven coordination with declarative-intent coordination — both operators read the same authoritative state (Vitess topology), so they cannot diverge on what it says.

Contrast with Kubernetes-style reconcilers

The pattern is not new in distributed systems — it's the Kubernetes operator pattern applied to MySQL topology. What's novel here is the specific pairing:

  • Control-plane declarative intent lives in Vitess topology server (backed by etcd / ZooKeeper / Consul).
  • Data-plane observer (orchestrator) reads that intent, observes actual MySQL topology, and reconciles.
  • Data-plane agent (vttablet) on each MySQL server exposes the per-node identity that lets orchestrator attribute observations back to declared intent.

The fit with Kubernetes operator vocabulary is not coincidental — Vitess is a Kubernetes-native platform and VTOrc is structured as an operator controller. See systems/vitess-operator for the broader pattern at the platform level.

Seen in

Last updated · 550 distilled / 1,221 read