Skip to content

PATTERN Cited by 1 source

Prioritized black-box health check

Pattern

Detect silent hangs in multi-process inference engines via periodic minimal-end-to-end black-box probe requests, but assign those probes the highest scheduling priority inside the engine so they complete even under heavy load. On probe failure, trigger Kubernetes liveness probe to restart the pod.

Canonical wiki disclosure (Source: sources/2026-05-27-databricks-reliable-llm-inference-at-scale):

"We detect this with periodic black-box health checks: minimal end-to-end requests sent when no real requests have completed recently. If a health check fails, the Kubernetes liveness probe restarts the server. […] However, under high load, health checks themselves can time out, causing the liveness probe to kill servers that are actually healthy. This risks cascading failures. To solve this, we assign health check requests the highest scheduling priority, ensuring they complete even under heavy load. With prioritized health checks, the full cycle of detecting a hang, killing the unhealthy server, and recovering takes less than 5 minutes. False liveness probe failures dropped from several per week to zero."

When to use it

  • Multi-process inference engines where silent hangs can leave the front-end accepting requests while workers are stuck.
  • GPU-loaded servers where the engine is regularly near saturation; probe priority-inversion under load is a real risk.
  • Liveness probe wired to pod restart so a failed probe has a recovery action attached, not just an alert.

When NOT to use it

  • Stateless CPU services where synchronous-exception-on-failure is the dominant pattern — the probe priority is unnecessary.
  • Engines without an internal scheduler primitive that can honour request-level priority — fall back to other liveness primitives (process-level metrics, dead-mans-switch, etc.).

Structural shape

                  ┌─────────────────────────────────────┐
                  │  Probe sender (sidecar / LB /        │
                  │  monitoring agent)                   │
                  │  - emits when no real request has    │
                  │    completed recently                │
                  │  - tags request: priority=highest    │
                  └────────────────┬────────────────────┘
                                   │ HTTP/gRPC
                  ┌─────────────────────────────────────┐
                  │  Inference Engine front-end          │
                  │  - reads priority tag                │
                  │  - schedules probe ahead of normal   │
                  │    requests in worker queue          │
                  └────────────────┬────────────────────┘
                  ┌─────────────────────────────────────┐
                  │  Worker (GPU / CPU)                  │
                  │  - executes probe end-to-end         │
                  │  - returns response                  │
                  └────────────────┬────────────────────┘
                                   │ if no response within timeout
                  ┌─────────────────────────────────────┐
                  │  Kubernetes liveness probe fires →   │
                  │  pod restarted, traffic re-routed    │
                  └─────────────────────────────────────┘

The pattern requires three pieces:

  1. A probe sender that issues minimal requests when "no real requests have completed recently" — i.e. when normal traffic would have caught a hang as a latency-SLO breach but the server is too quiet for that path to fire.
  2. Priority-scheduled execution path inside the engine — the probe is tagged at admission and scheduled ahead of normal requests in the worker queue.
  3. A liveness-probe → restart action wired to probe failure — typically Kubernetes' built-in liveness probe pointing at the probe-success endpoint.

Why probe priority matters (the priority-inversion failure mode)

Without priority scheduling, the standard liveness-probe shape is:

  • Probe is a regular request.
  • Server is overloaded with real traffic.
  • Probe enters the queue, waits behind real requests.
  • Probe eventually times out (its time budget is small).
  • Liveness probe fires → pod gets restarted while it was actually healthy, just busy.
  • Restarted pod's traffic redistributes, increasing load on remaining pods.
  • More pods now look "unhealthy" by the same priority-inversion mechanism.
  • Cascading liveness-probe storm — the cluster melts.

This is a classic priority-inversion failure mode: a low-priority queue position for a high-priority correctness signal causes the correctness signal to fail under exactly the conditions where it matters most.

The fix is explicit priority elevation — the probe gets its priority bumped to highest, ahead of all customer traffic, so its time budget is reliably honoured.

What this pattern is not

  • Not a static monitoring probe — those run from outside the cluster on a fixed cadence and don't account for traffic load. This pattern's probes are triggered by silence (no real requests recently completed) and prioritised inside the engine.
  • Not a replacement for synchronous error reporting — synchronous errors should still surface normally. This pattern catches the failure mode where synchronous error reporting itself is broken (silent hang).
  • Not a stand-in for capacity planning — under-capacity will still cause real latency-SLO violations and customer impact; this pattern only protects against priority-inverted false positives, not capacity-driven true positives.

Implementation altitude question

The post does not specify where the priority is enforced:

  • At the inference engine's request scheduler — the engine has a priority queue and respects the tag. Cleanest integration but requires engine support.
  • At the Python server (uvicorn / equivalent) — request-level priority routing inside the front-end. Easier to retrofit but doesn't help if the worker process is the bottleneck.
  • At the OS scheduler (cgroup / nice / SCHED_FIFO) — process- level priority elevation. Coarse-grained; works only if the probe is a dedicated process.

The most likely answer for vLLM-class engines is engine-scheduler support — vLLM has a priority queue primitive. For other engines, the team would need to retrofit similar.

Composition

  • Above in the failure-mitigation stack: patterns/replication-restart-as-liveness-probe (replication state as liveness signal at the storage altitude — different altitude, complementary).
  • Below: Kubernetes liveness probe action, which is the standard kill-and-restart mechanism.
  • Adjacent: concepts/dead-mans-switch — the same shape applied differently (engine-emitted heartbeat, absence triggers alert).
  • Adjacent: concepts/multi-endpoint-quorum-health-check — uses quorum agreement across multiple probes instead of priority-elevation to handle the same false-positive risk.

Quantitative payoff

Disclosed: <5 minutes detect→kill→recover cycle, false liveness-probe failures: several/week → zero after prioritisation.

Both numbers are load-bearing:

  • The 5-minute cycle is short enough that hangs are mitigated before they cascade — fast enough to keep the platform's reliability SLO (4-9's, per the Superhuman-pod 2026-05-08 post, consistent with this design).
  • The zero-false-positive number proves the priority-elevation fix works as intended — the cure didn't introduce a worse problem.

Risks and mitigations

  • Probe payload too small to exercise the silent-hang code paths — a minimal probe might pass while real requests fail. Mitigation: probe shape includes the structural-output / multi- modal patterns that historically triggered hangs, parameterised per-deployment.
  • Probe priority becomes itself a denial-of-service vector if the probe is too aggressive — many high-priority probes can starve real traffic. Mitigation: probe rate-limit; only fire when no real requests have completed recently.
  • Engine doesn't honour priority — a buggy or non-priority- aware engine treats the tag as advisory. Mitigation: validate engine priority semantics in load tests before relying on this pattern.
  • Probe success masks other failure modes — the probe is a weak signal for latency SLO breaches that aren't full hangs. Mitigation: pair with separate latency monitoring; don't make probe-success the only health signal.

Open questions

  • Probe payload shape — minimal end-to-end, but the exact prompt / output budget / model invocation pattern is not specified.
  • Probe trigger threshold"when no real requests have completed recently". The "recently" threshold is not named.
  • Priority-elevation implementation altitude — see above.
  • Probe failure-detection threshold — single-failure or N-of-M trigger? Not specified.

Seen in

Last updated · 542 distilled / 1,571 read