CONCEPT Cited by 2 sources
Cluster health check¶
Definition¶
A cluster health check is a liveness / readiness probe that a router (load balancer, gateway, proxy) issues against a backend cluster to decide whether to send it traffic. The router maintains a per-cluster health state that is distinct from operator intent (active / inactive toggle) — it reflects whether the cluster can actually serve requests right now.
The three-state model¶
The Trino Gateway cluster-health UI
(sources/2026-03-24-expedia-operating-trino-at-scale-with-trino-gateway)
articulates this as a three-state trichotomy the RoutingManager
consumes:
- HEALTHY — health-checks report the cluster is ready.
RoutingManagerroutes requests to it. - UNHEALTHY — health-checks report the cluster is not ready.
RoutingManagerdoes not route requests to it. - PENDING — the cluster is still starting up. Treated as unhealthy (no routing) until it crosses to HEALTHY.
Why PENDING is load-bearing¶
A two-state model (HEALTHY / UNHEALTHY) collapses two operationally distinct cases into one:
- A cluster starting up (brief, expected, usually benign).
- A cluster that broke (alert-worthy, may require intervention).
Both yield "do not route" — but they alert differently, mature on different timescales, and need different operator responses. A three-state model surfaces the difference to both the router (so it can apply different backoff / alerting policies) and the operator (so the UI doesn't flag every deployment as a cluster-down incident).
This is the same structural concern that makes Kubernetes
readinessProbe distinct from livenessProbe — "am I starting"
vs. "am I broken" are different questions with different
remediation paths.
Intent vs. status¶
A long-standing anti-pattern is surfacing only the operator's intent (active / inactive toggle) in the UI, not the cluster's status. From the post:
"Previously, the Gateway only provided a simple active/inactive switch for each cluster, offering limited insight into the actual health status. Previously, this only allowed you to activate or deactivate cluster health checks, but did not indicate whether the cluster was ready to accept queries."
The fix is displaying both — intent (what the operator asked for) and status (what the router observes) — so drift between the two becomes visible and actionable. Expedia's PR #601 added the status surface alongside the pre-existing intent toggle.
Where else this pattern shows up¶
- Envoy health checking (active checks + outlier detection).
- PID-feedback LB (Dropbox Robinhood) — uses health signals as a guard (>15% missing reports freezes the weight-update phase).
- Kubernetes readiness/liveness/startup probes — the same trichotomy at the pod level (startup → live-not-ready → ready).
Seen in¶
- sources/2026-03-24-expedia-operating-trino-at-scale-with-trino-gateway
— the HEALTHY / UNHEALTHY / PENDING trichotomy as consumed by
the
RoutingManager; surfaced in the cluster UI via PR #601. - sources/2023-07-16-highscalability-lessons-learned-running-presto-at-meta-scale — cluster-health-check appears in two roles for Meta's Presto fleet: (1) as the readiness gate at cluster turn-up — automated cluster standup runs "a few test queries" against a new cluster and only "registers it with the Gateway" once those succeed; (2) as the coarser sibling of bad-host detection — the probe says "cluster up" while individual hosts inside that cluster are silently failing queries ("obscure JVM bugs… a steady drip of query failures"). Bad-host detection sits above the health-check layer and catches what the probe misses.
Related¶
- concepts/workload-aware-routing — health-check state is an input to routing decisions; unhealthy clusters are excluded from the eligible set before rules pick among the remainder.
- concepts/bad-host-detection — complements probe-based health checks by catching functioning-but-wrong hosts.
- patterns/query-gateway — health-check integration is a core gateway responsibility.
- patterns/automated-cluster-standup-decommission — uses test-queries-as-health-check as the promotion gate before Gateway registration.
- systems/trino-gateway — canonical three-state instance.
- systems/meta-presto-gateway — Meta's internal Gateway uses cluster-health-check at standup time and bad-host-detection at runtime.