PATTERN Cited by 1 source
Bad-host auto-drain¶
Attribute each query failure to the host that caused it, alert when a single host's failure-attribution count exceeds a threshold, and auto-drain that host from the serving fleet. Sits above the standard cluster-health-check layer: catches hosts that still respond to health probes but return incorrect or failing results for real queries.
Why not just health checks¶
A standard cluster-health-check asks the host "are you alive?" — typically a synthetic ping or lightweight probe. Some bad-host failure modes slip past:
- Hardware issues not yet caught by fleet-wide monitoring — e.g. a disk controller returning corrupted data; the host is up, responsive, but produces wrong answers.
- JVM bugs that drip — an "obscure JVM bug" that crashes a fraction of queries on a specific host with no visible pattern on standard probes.
In both cases, the host passes liveness but fails real work. Only per-host query-failure attribution surfaces the pattern.
Machinery required¶
- Per-query host attribution: for each failed query, record the host (or hosts) that contributed to the failure where possible. This is non-trivial in distributed query engines because a single query touches many workers.
- Alert threshold: fire when a host's attributed-failure count over a time window exceeds some statistically anomalous value.
- Auto-drain hook: automation that removes the flagged host from the serving fleet (in Presto's case, deregistering it from the Gateway so no new queries route to it) until investigation or replacement.
Seen in¶
- sources/2023-07-16-highscalability-lessons-learned-running-presto-at-meta-scale — Meta's Presto fleet. Named root causes: "hardware-level issues which hadn't yet been caught by fleet-wide monitoring systems due to lack of coverage" and "obscure JVM bugs which would sometimes lead to a steady drip of query failures." Meta attributes each query failure to the host that caused it, alerts on abnormal per-host failure counts, and auto-drains the host from the Presto fleet. Closes the gap between standard cluster-health-check machinery (which the bad host still passes) and the real user-facing failure signal (which correlates to that host).