CONCEPT Cited by 1 source

Failpoint¶

Definition¶

A failpoint is a labeled, instrumentation-time hook embedded in production source code that can — under test or chaos-driven configuration — pause execution, return an error, sleep, panic, or otherwise simulate a hard-to-reproduce failure at that exact point. Failpoints provide deterministic fault-injection at code granularity rather than the coarser-grained process-kill / network-partition / disk-wipe approaches.

The canonical implementation patterns:

Compile-time failpoints. Hooks compiled in only for chaos / test builds; production binaries do not carry the overhead.
Runtime failpoints. Hooks always present; activation is controlled by a configuration toggle (often via an admin RPC or environment variable).

The Rust ecosystem ships failpoints as a widely-used crate; the Postgres ecosystem ships INJECTION_POINT() macros; many distributed systems carry equivalent in-house frameworks. The Lakebase reliability post uses "failpoints" as a generic term without naming a specific implementation.

Why failpoints are distinct from coarser fault injection¶

Mechanism	Granularity	Determinism
Kill the process	Whole binary	Crashes anywhere — non-deterministic
Sever network	Whole connection	All traffic — coarse
Wipe disk	Whole device	All I/O fails — coarse
Failpoint	Single line / function call	Deterministic

The structural advantage is bug reproduction: a hard-to-hit race in code path X can be triggered deterministically by activating a failpoint at the exact moment that produces the race, rather than hoping a coarse kill-the-process drill happens to land at the right point.

Lakebase framing¶

Verbatim (Source: sources/2026-05-27-databricks-how-the-lakebase-architecture-stays-resilient-to-cloud-failures):

"We use failpoints liberally in our code to inject hard-to-reproduce errors, such as a crash at the worst possible time. This is driven by an internal fault-injection framework that can target a single process or coordinate cluster-wide faults across an entire cell."

The two scopes the framework supports:

Single-process failpoint activation — drive deterministic per-process failures.
Cluster-wide coordinated failpoint activation — coordinate failpoints across multiple processes in a cell to exercise distributed-coordination edge cases (e.g. "crash this Pageserver at the same instant the Safekeeper acks this commit").

The cluster-wide-coordinated mode is what distinguishes sophisticated distributed-systems failpoint frameworks from naive per-process ones.

Common failpoint use cases¶

Crash-at-worst-time — fail just after a partial write but before the durability commit. Tests crash recovery.
Slow-this-call — inject latency at one point. Tests timeout paths.
Return-error-once — return a transient error N times then succeed. Tests retry logic.
Reorder-events — pause one branch and let another race past it. Tests concurrency invariants.
Resource-exhaustion — return out-of-memory / out-of-disk at a precise point. Tests degradation paths.

Composability with the chaos regime¶

Failpoints sit one level below the coarser drills (process kill, AZ network partition). The Lakebase regime composes them: failpoints exercise per-component edge cases while running the workload that the whole-AZ partition drill exercises at the AZ scope.

Validation harness¶

Lakebase pairs failpoint-driven fault injection with correctness validation via SQLancer and SQLsmith (Source: same):

"While failure injection is running, we validate internal data consistency, that no committed transaction is lost, and that every component recovers to a consistent state on its own."

The validation harness is what closes the loop: failpoint drives the fault, SQLancer / SQLsmith / internal tools verify the system's state postcondition is still correct.

Caveats¶

Failpoint hygiene matters. Stale failpoints accumulate in codebases and become a footgun if accidentally activated in production. Mature regimes annotate failpoints with purpose + expected resolution and prune unused ones.
Compile-time overhead. Even no-op failpoints add instruction-cache pressure if not optimised out. Languages / compilers vary in their elision quality.
Production-vs-test boundary. Failpoints in production binaries are a security surface — access to the failpoint admin RPC is effectively "can crash this process at this line" and must be authenticated + audited.
Cluster-wide coordination is hard. Coordinated multi-process failpoint activation requires a control plane of its own, time synchronisation, and robust cluster-state observability. Many in-house frameworks settle for per-process activation for simplicity.
Internal framework not detailed in the source. Lakebase says "an internal fault-injection framework" but does not name it, describe its activation model, or detail the cluster-wide-coordination mechanism.

Seen in¶

sources/2026-05-27-databricks-how-the-lakebase-architecture-stays-resilient-to-cloud-failures — canonical wiki framing in the systems/lakebase / systems/neon release-gate fault-injection regime. "Crash at the worst possible time" as the use case the post emphasises. Cluster-wide coordination across an entire cell is named.

concepts/chaos-engineering — parent discipline
concepts/random-instance-failure-injection — coarser-grained sibling
concepts/whole-az-network-partition-simulation — coarser AZ-scope sibling
concepts/availability-zone-failure-drill — historical AZ-scope drill
systems/lakebase / systems/neon — canonical instances
systems/postgresql — has its own INJECTION_POINT framework in upstream Postgres
patterns/continuous-fault-injection-in-production — discipline pattern