Skip to content

PATTERN Cited by 1 source

No-downtime cluster upgrade

Definition

No-downtime cluster upgrade is the deployment discipline of upgrading one or more clusters in a fleet without breaking the client-facing endpoint — clients see no connection failures, session drops, or connection-string changes during or after the upgrade. It depends structurally on single-endpoint abstraction: if clients connect directly to a specific cluster, any upgrade that replaces that cluster's identity is client-visible.

Two common shapes behind a gateway

Both shapes below assume a gateway (or equivalent routing tier) sits in front of the clusters and owns the client-facing endpoint.

Blue/green

  • Spin up a green (new-version) cluster alongside the existing blue (old-version) cluster.
  • Gateway routing rules / routing-group membership flip from blue to green atomically (or in a staged shift).
  • If problems surface, flip back — blue is still warm.
  • When green is proven, drain and tear down blue.

Canary

  • Spin up a small canary cluster at the new version.
  • Gateway routing rules send a small fraction of traffic to the canary (by user, by query shape, by time window).
  • Ramp up fraction as confidence grows.
  • When 100% traffic is on the canary shape, retire the old cluster.

Why a gateway makes this tractable

Without a gateway in front, doing either pattern requires client coordination — every client has to update its connection URL to point at the new cluster. At scale (thousands of scheduled jobs, BI dashboards, ad-hoc users, scripts in notebooks) this is an intractable migration; the cluster upgrade becomes a company-wide project rather than an SRE operation.

With a gateway:

  • Clients keep their existing connection URL (the gateway URL).
  • The gateway owns which backend cluster runs each query.
  • Adding / draining / substituting backend clusters is a gateway config change, not a client change.
  • The gateway's routing-rule engine (patterns/routing-rules-as-config) expresses the traffic shape during the transition.

This is why the post (sources/2026-03-24-expedia-operating-trino-at-scale-with-trino-gateway) lists "no-downtime upgrades for Trino clusters behind the gateway in a blue/green model or canary deployment model" as one of the four headline gateway advantages alongside single-URL, automatic routing, and transparent capacity changes.

Required infrastructure

  1. Gateway with healthy substitutability. Clusters must be substitutable behind the routing layer — requires identical data access, identical catalog connectivity, identical auth.
  2. Health-check integration. Three-state health (HEALTHY / UNHEALTHY / PENDING) — so the gateway doesn't route to a new cluster before it's ready, and automatically stops routing to a cluster that goes bad mid-upgrade.
  3. Query-level idempotency (or at least drain-safe disconnect) — a cluster being drained should not have active queries force-killed. Trino-level drain (stop accepting new, finish running) is the standard mechanism.
  4. Observable routing decisions. Operators need to see how traffic is actually being routed during the transition — per-query source, per-cluster load, error rates.
  • patterns/blue-green-service-mesh-migration — AWS App Mesh → ECS Service Connect discontinuation uses the same blue/ green discipline but at the service-mesh layer.
  • patterns/weighted-dns-traffic-shifting — Figma ECS → EKS cutover uses DNS-level blue/green between fleets; a weaker form than gateway-based shifting (TTL-governed lag).
  • Kubernetes rolling update — handles in-cluster upgrades but not whole-cluster substitution.

Seen in

Last updated · 200 distilled / 1,178 read