Skip to content

CONCEPT Cited by 2 sources

Bad-host detection

Definition

Bad-host detection is the discipline of identifying individual hosts in a fleet that are functioning badly without being unresponsive — they pass cluster-level health checks, still accept work, but produce a disproportionate share of failures, corrupt results, or slow responses. These hosts are a source of grey failure: the cluster is "up" from a probe's view, but a subset of traffic is silently failing.

This is conceptually distinct from cluster health check, which tests whether a host responds. A bad host still responds — it just responds wrong.

Named root causes (from the Meta source)

sources/2023-07-16-highscalability-lessons-learned-running-presto-at-meta-scale names two classes of bad-host root cause for the Presto fleet:

  1. "Hardware-level issues which hadn't yet been caught by fleet-wide monitoring systems due to lack of coverage."
  2. "Obscure JVM bugs which would sometimes lead to a steady drip of query failures."

In both cases, the host works well enough to accept queries but well enough to fail them too.

Detection mechanism

The Meta approach has three steps:

  1. Per-query failure attribution. Each failing query is attributed to the host that caused it, "where possible."
  2. Anomaly-rate alerting. Alerts fire when the attributed failure count for a specific host exceeds a threshold (implicit: a statistical anomaly vs peer hosts).
  3. Auto-drain. Automation drains the host from the Presto fleet (see patterns/bad-host-auto-drain), preventing further failures.

The key design choice is the attribution step: without it, all the signals at the cluster level tell you the cluster is fine.

Where this sits in the health-check hierarchy

Layer Detects Misses
Cluster health check (probe) Dead / unreachable host Functioning-but-wrong host
Bad-host detection (attribution + anomaly) Host with anomalous failure share First-few failures (requires volume)

Bad-host detection complements, rather than replaces, probe-based health checks. Most hyperscale operators need both.

Extended cost model: synchronised workloads

Meta's sources/2024-06-16-meta-maintaining-large-scale-ai-capacity-at-meta strengthens the cost argument for GPU training:

"Bad hosts are very bad: Since many jobs require all hosts to be synchronized, bad hosts that are a bit slower, have some non-fatal hardware, or have networking issues are extremely damaging."

The Presto case (above) is query-level: a bad host damages its own share of queries. The GPU-training case is synchronised-job-level: one bad host damages the whole job's progress for every participating host, because collective-communication steps stall on the slowest participant.

Cost flips from proportional (Presto: 1 bad host / N total = ~1/N query failure rate) to superlinear (training: 1 bad host makes all N hosts wait or fail). Bad-host detection is therefore even more load-bearing in synchronised-workload fleets than in stateless-serving fleets.

This also couples tightly to host-consistency sliding upgrade: when the lower-layer firmware/driver is being slid through the fleet, bad-host detection is the mechanism that catches a host whose new lower-layer version turns out to be subtly incompatible with the pinned upper layer, even if the pre-return verification passed.

Seen in

Last updated · 319 distilled / 1,201 read