CONCEPT Cited by 2 sources

Bad-host detection¶

Definition¶

Bad-host detection is the discipline of identifying individual hosts in a fleet that are functioning badly without being unresponsive — they pass cluster-level health checks, still accept work, but produce a disproportionate share of failures, corrupt results, or slow responses. These hosts are a source of grey failure: the cluster is "up" from a probe's view, but a subset of traffic is silently failing.

This is conceptually distinct from cluster health check, which tests whether a host responds. A bad host still responds — it just responds wrong.

Named root causes (from the Meta source)¶

sources/2023-07-16-highscalability-lessons-learned-running-presto-at-meta-scale names two classes of bad-host root cause for the Presto fleet:

"Hardware-level issues which hadn't yet been caught by fleet-wide monitoring systems due to lack of coverage."
"Obscure JVM bugs which would sometimes lead to a steady drip of query failures."

In both cases, the host works well enough to accept queries but well enough to fail them too.

Detection mechanism¶

The Meta approach has three steps:

Per-query failure attribution. Each failing query is attributed to the host that caused it, "where possible."
Anomaly-rate alerting. Alerts fire when the attributed failure count for a specific host exceeds a threshold (implicit: a statistical anomaly vs peer hosts).
Auto-drain. Automation drains the host from the Presto fleet (see patterns/bad-host-auto-drain), preventing further failures.

The key design choice is the attribution step: without it, all the signals at the cluster level tell you the cluster is fine.

Where this sits in the health-check hierarchy¶

Layer	Detects	Misses
Cluster health check (probe)	Dead / unreachable host	Functioning-but-wrong host
Bad-host detection (attribution + anomaly)	Host with anomalous failure share	First-few failures (requires volume)

Bad-host detection complements, rather than replaces, probe-based health checks. Most hyperscale operators need both.

Extended cost model: synchronised workloads¶

Meta's sources/2024-06-16-meta-maintaining-large-scale-ai-capacity-at-meta strengthens the cost argument for GPU training:

"Bad hosts are very bad: Since many jobs require all hosts to be synchronized, bad hosts that are a bit slower, have some non-fatal hardware, or have networking issues are extremely damaging."

The Presto case (above) is query-level: a bad host damages its own share of queries. The GPU-training case is synchronised-job-level: one bad host damages the whole job's progress for every participating host, because collective-communication steps stall on the slowest participant.

Cost flips from proportional (Presto: 1 bad host / N total = ~1/N query failure rate) to superlinear (training: 1 bad host makes all N hosts wait or fail). Bad-host detection is therefore even more load-bearing in synchronised-workload fleets than in stateless-serving fleets.

This also couples tightly to host-consistency sliding upgrade: when the lower-layer firmware/driver is being slid through the fleet, bad-host detection is the mechanism that catches a host whose new lower-layer version turns out to be subtly incompatible with the pinned upper layer, even if the pre-return verification passed.

Seen in¶

sources/2023-07-16-highscalability-lessons-learned-running-presto-at-meta-scale — the originating source; Meta's Presto query fleet. Canonical query-level wiki instance.
sources/2024-06-16-meta-maintaining-large-scale-ai-capacity-at-meta — Meta's GPU training fleet names "bad hosts are very bad" as one of the five demanding properties of GPU training; extends the cost model to synchronised workloads. Canonical synchronised-job-level wiki instance.

concepts/cluster-health-check — the coarser probe-based sibling.
concepts/grey-failure — the class of failure this addresses.
concepts/blast-radius — bad-host drain limits blast radius per host.
concepts/fleet-patching — bad-host detection is the safety net for patch rollouts that miss compatibility issues.
concepts/host-consistency-sliding-upgrade — bad-host detection catches failures the pre-return verification gate missed.
patterns/bad-host-auto-drain — the remediation pattern.
patterns/maintenance-train — the operational primitive that drains and replaces a flagged bad host.
systems/meta-presto-gateway — the Gateway is the drain-point in the Presto variant.
systems/opsplanner — Meta's orchestrator owns the drain path in the GPU-training variant.