CONCEPT Cited by 1 source
Silent hang (LLM inference server)¶
Definition¶
A silent hang is a failure mode in LLM inference engines where the server stops responding to requests but does not surface an error — no exception logged, no health-check failure, no metric-anomaly until the request queue drains and timeouts cascade. The engine is "alive" in the OS sense (process running, port listening) but "dead" in the application sense (no progress on in-flight or new requests).
Canonical wiki disclosure (Source: sources/2026-05-27-databricks-reliable-llm-inference-at-scale):
"One failure mode we encounter is silent hangs. Requests involving edge cases (structured output, multimodal inputs) can trigger unhandled errors in the multi-process architecture of inference engines, causing servers to stop responding without surfacing errors."
Why LLM inference servers are particularly prone to this¶
LLM inference engines (vLLM and similar substrates) typically run a multi-process architecture:
- Front-end HTTP / gRPC server process (handles request admission, formatting, streaming).
- Scheduler / batch-builder process (forms decode batches).
- Worker processes per GPU or per shard (do forward passes).
- Sub-processes for image preprocessing, tokenisation, etc.
Cross-process communication is via queues, pipes, or shared memory. The structural source of silent hangs:
- An unhandled exception in a worker process can leave the queue in an inconsistent state — the front-end keeps accepting requests, the worker no longer drains them, and there is no exception path back to the front-end to surface the error.
- Edge-case requests (the post names structured output and multimodal inputs specifically) hit code paths that are less exercised and more likely to throw.
- Internal back-pressure is harder to detect than process-death — the process is still alive, just not making progress.
Compare to classical CPU services where a failed request raises synchronously and the failure mode is visible. LLM engines' async-pipeline-with-internal-queues structure makes hangs invisible by default.
Adjacent failure modes (not silent hangs)¶
| Failure mode | Visibility |
|---|---|
| Process crash / OOM kill | Exit code, kubelet logs — visible immediately |
| Synchronous exception | Stack trace, error metric — visible immediately |
| Slow-but-progressing | Latency-SLI degradation — visible to latency monitoring |
| Silent hang | Nothing visible until queues fill / timeouts fire |
Silent hang is specifically the "nothing visible" case. The defining property: internal observability fails to fire, so external probing becomes the only detection mechanism.
Detection: prioritised black-box health checks¶
The Databricks fix is a client-side black-box probe — "periodic black-box health checks: minimal end-to- end requests sent when no real requests have completed recently" — crucially, with highest scheduling priority so the probe completes even when the engine is overloaded:
"Under high load, health checks themselves can time out, causing the liveness probe to kill servers that are actually healthy. This risks cascading failures. To solve this, we assign health check requests the highest scheduling priority, ensuring they complete even under heavy load."
The end-to-end shape:
- Probe sender (typically a sidecar or LB-adjacent component) issues a minimal request whenever the server hasn't completed a real request in a while.
- Engine treats the probe as a highest-priority request — it gets routed to the front of the scheduling queue.
- If the probe completes, server is healthy.
- If the probe times out or returns error, Kubernetes liveness probe restarts the pod.
The whole detect→kill→recover cycle takes <5 minutes. False liveness-probe failures dropped from several per week to zero.
See patterns/prioritized-black-box-health-check.
Why "trigger an exception in a worker" → "stop responding without¶
surfacing"
The canonical chain (inferred from the post and from open-source vLLM-class engine architecture):
- Front-end accepts a request, places it on the scheduler queue.
- Scheduler dispatches the request to a worker.
- Worker hits an unhandled error on, say, a structured-output constrained-decoding path.
- The worker's exception is caught by an outer except handler that logs and continues — but the worker's request slot in the scheduler is not properly released.
- The scheduler thinks the worker is busy with the failed request and stops sending it new requests.
- With workers gradually getting stuck, the scheduler queue fills up.
- Front-end keeps accepting new requests because the front-end has no visibility into the scheduler queue depth or worker liveness.
- No external symptom until the queue depth metric hits a threshold or a real request times out — both lagging indicators.
The black-box probe short-circuits this by issuing requests out-of-band that have to traverse the whole stack and complete; if they don't, the failure is definitively localised to this server rather than to the workload.
Composition with neighbouring concepts¶
- concepts/dead-mans-switch — silent hang is the failure mode that motivates dead-man's-switch / heartbeat-absence monitoring at the engine level. The Databricks design sends the probe from outside; an alternative is engine-emitted heartbeats.
- concepts/multimodal-cpu-bottleneck — multimodal CPU bottleneck is not a silent hang on its own; it manifests as spikes in error rates and timeouts. But sustained multimodal CPU-saturation can evolve into a silent hang if it pushes a worker process into an inconsistent state (the post lists multimodal inputs as a known silent-hang trigger).
- patterns/replication-restart-as-liveness-probe — the Cloudflare-side analogue at the storage layer; uses replication state as the liveness signal instead of an external probe.
Open questions¶
- Which inference engines exhibit silent hangs? "Open source and proprietary in-house engines" both — engine-agnostic. No specific bug references.
- Specific edge-case classes beyond structured output and multimodal — not enumerated.
- Probe frequency — "when no real requests have completed recently" is the trigger, but the recently threshold is not named.
- Probe payload shape — minimal end-to-end request, but the exact prompt / output budget / model invocation pattern is not specified. Implementation-level question.
- Priority-scheduling implementation altitude — see patterns/prioritized-black-box-health-check for the open question of where the priority is enforced (engine scheduler vs Python server vs cgroup).
Seen in¶
- sources/2026-05-27-databricks-reliable-llm-inference-at-scale — first canonical wiki disclosure of silent hangs as a structural failure mode of multi-process LLM inference engines, and of prioritised black-box health checks as the production detection mechanism. <5 minute detect→kill→recover cycle. Several per week → zero false liveness-probe failures after prioritisation.
Related¶
- concepts/black-box-validation / concepts/client-side-black-box-probe — the validation family the probe belongs to.
- concepts/multimodal-cpu-bottleneck — adjacent failure mode; can evolve into silent hang.
- concepts/dead-mans-switch — the heartbeat-absence pattern silent-hang detection composes with.
- patterns/prioritized-black-box-health-check — the productionised detection pattern.
- patterns/replication-restart-as-liveness-probe — adjacent liveness-probe primitive at a different storage altitude.
- systems/databricks-axon — the router that lives upstream of the hang-prone inference runtime.
- systems/databricks-model-serving — the parent platform.
- systems/kubernetes — the substrate whose liveness probe is the recovery mechanism.