CONCEPT Cited by 1 source
Failpoint¶
Definition¶
A failpoint is a labeled, instrumentation-time hook embedded in production source code that can — under test or chaos-driven configuration — pause execution, return an error, sleep, panic, or otherwise simulate a hard-to-reproduce failure at that exact point. Failpoints provide deterministic fault-injection at code granularity rather than the coarser-grained process-kill / network-partition / disk-wipe approaches.
The canonical implementation patterns:
- Compile-time failpoints. Hooks compiled in only for chaos / test builds; production binaries do not carry the overhead.
- Runtime failpoints. Hooks always present; activation is controlled by a configuration toggle (often via an admin RPC or environment variable).
The Rust ecosystem ships failpoints as a widely-used crate; the
Postgres ecosystem ships INJECTION_POINT() macros; many
distributed systems carry equivalent in-house frameworks. The
Lakebase reliability post uses "failpoints" as a generic term
without naming a specific implementation.
Why failpoints are distinct from coarser fault injection¶
| Mechanism | Granularity | Determinism |
|---|---|---|
| Kill the process | Whole binary | Crashes anywhere — non-deterministic |
| Sever network | Whole connection | All traffic — coarse |
| Wipe disk | Whole device | All I/O fails — coarse |
| Failpoint | Single line / function call | Deterministic |
The structural advantage is bug reproduction: a hard-to-hit race in code path X can be triggered deterministically by activating a failpoint at the exact moment that produces the race, rather than hoping a coarse kill-the-process drill happens to land at the right point.
Lakebase framing¶
Verbatim (Source: sources/2026-05-27-databricks-how-the-lakebase-architecture-stays-resilient-to-cloud-failures):
"We use failpoints liberally in our code to inject hard-to-reproduce errors, such as a crash at the worst possible time. This is driven by an internal fault-injection framework that can target a single process or coordinate cluster-wide faults across an entire cell."
The two scopes the framework supports:
- Single-process failpoint activation — drive deterministic per-process failures.
- Cluster-wide coordinated failpoint activation — coordinate failpoints across multiple processes in a cell to exercise distributed-coordination edge cases (e.g. "crash this Pageserver at the same instant the Safekeeper acks this commit").
The cluster-wide-coordinated mode is what distinguishes sophisticated distributed-systems failpoint frameworks from naive per-process ones.
Common failpoint use cases¶
- Crash-at-worst-time — fail just after a partial write but before the durability commit. Tests crash recovery.
- Slow-this-call — inject latency at one point. Tests timeout paths.
- Return-error-once — return a transient error N times then succeed. Tests retry logic.
- Reorder-events — pause one branch and let another race past it. Tests concurrency invariants.
- Resource-exhaustion — return out-of-memory / out-of-disk at a precise point. Tests degradation paths.
Composability with the chaos regime¶
Failpoints sit one level below the coarser drills (process kill, AZ network partition). The Lakebase regime composes them: failpoints exercise per-component edge cases while running the workload that the whole-AZ partition drill exercises at the AZ scope.
Validation harness¶
Lakebase pairs failpoint-driven fault injection with correctness validation via SQLancer and SQLsmith (Source: same):
"While failure injection is running, we validate internal data consistency, that no committed transaction is lost, and that every component recovers to a consistent state on its own."
The validation harness is what closes the loop: failpoint drives the fault, SQLancer / SQLsmith / internal tools verify the system's state postcondition is still correct.
Caveats¶
- Failpoint hygiene matters. Stale failpoints accumulate in codebases and become a footgun if accidentally activated in production. Mature regimes annotate failpoints with purpose + expected resolution and prune unused ones.
- Compile-time overhead. Even no-op failpoints add instruction-cache pressure if not optimised out. Languages / compilers vary in their elision quality.
- Production-vs-test boundary. Failpoints in production binaries are a security surface — access to the failpoint admin RPC is effectively "can crash this process at this line" and must be authenticated + audited.
- Cluster-wide coordination is hard. Coordinated multi-process failpoint activation requires a control plane of its own, time synchronisation, and robust cluster-state observability. Many in-house frameworks settle for per-process activation for simplicity.
- Internal framework not detailed in the source. Lakebase says "an internal fault-injection framework" but does not name it, describe its activation model, or detail the cluster-wide-coordination mechanism.
Seen in¶
- sources/2026-05-27-databricks-how-the-lakebase-architecture-stays-resilient-to-cloud-failures — canonical wiki framing in the systems/lakebase / systems/neon release-gate fault-injection regime. "Crash at the worst possible time" as the use case the post emphasises. Cluster-wide coordination across an entire cell is named.
Related¶
- concepts/chaos-engineering — parent discipline
- concepts/random-instance-failure-injection — coarser-grained sibling
- concepts/whole-az-network-partition-simulation — coarser AZ-scope sibling
- concepts/availability-zone-failure-drill — historical AZ-scope drill
- systems/lakebase / systems/neon — canonical instances
- systems/postgresql — has its own
INJECTION_POINTframework in upstream Postgres - patterns/continuous-fault-injection-in-production — discipline pattern