CONCEPT Cited by 1 source
Cluster health check¶
Definition¶
A cluster health check is a liveness / readiness probe that a router (load balancer, gateway, proxy) issues against a backend cluster to decide whether to send it traffic. The router maintains a per-cluster health state that is distinct from operator intent (active / inactive toggle) — it reflects whether the cluster can actually serve requests right now.
The three-state model¶
The Trino Gateway cluster-health UI
(sources/2026-03-24-expedia-operating-trino-at-scale-with-trino-gateway)
articulates this as a three-state trichotomy the RoutingManager
consumes:
- HEALTHY — health-checks report the cluster is ready.
RoutingManagerroutes requests to it. - UNHEALTHY — health-checks report the cluster is not ready.
RoutingManagerdoes not route requests to it. - PENDING — the cluster is still starting up. Treated as unhealthy (no routing) until it crosses to HEALTHY.
Why PENDING is load-bearing¶
A two-state model (HEALTHY / UNHEALTHY) collapses two operationally distinct cases into one:
- A cluster starting up (brief, expected, usually benign).
- A cluster that broke (alert-worthy, may require intervention).
Both yield "do not route" — but they alert differently, mature on different timescales, and need different operator responses. A three-state model surfaces the difference to both the router (so it can apply different backoff / alerting policies) and the operator (so the UI doesn't flag every deployment as a cluster-down incident).
This is the same structural concern that makes Kubernetes
readinessProbe distinct from livenessProbe — "am I starting"
vs. "am I broken" are different questions with different
remediation paths.
Intent vs. status¶
A long-standing anti-pattern is surfacing only the operator's intent (active / inactive toggle) in the UI, not the cluster's status. From the post:
"Previously, the Gateway only provided a simple active/inactive switch for each cluster, offering limited insight into the actual health status. Previously, this only allowed you to activate or deactivate cluster health checks, but did not indicate whether the cluster was ready to accept queries."
The fix is displaying both — intent (what the operator asked for) and status (what the router observes) — so drift between the two becomes visible and actionable. Expedia's PR #601 added the status surface alongside the pre-existing intent toggle.
Where else this pattern shows up¶
- Envoy health checking (active checks + outlier detection).
- PID-feedback LB (Dropbox Robinhood) — uses health signals as a guard (>15% missing reports freezes the weight-update phase).
- Kubernetes readiness/liveness/startup probes — the same trichotomy at the pod level (startup → live-not-ready → ready).
Seen in¶
- sources/2026-03-24-expedia-operating-trino-at-scale-with-trino-gateway
— the HEALTHY / UNHEALTHY / PENDING trichotomy as consumed by
the
RoutingManager; surfaced in the cluster UI via PR #601.
Related¶
- concepts/workload-aware-routing — health-check state is an input to routing decisions; unhealthy clusters are excluded from the eligible set before rules pick among the remainder.
- patterns/query-gateway — health-check integration is a core gateway responsibility.
- systems/trino-gateway — canonical three-state instance.