Skip to content

CONCEPT Cited by 1 source

Failpoint

Definition

A failpoint is a labeled, instrumentation-time hook embedded in production source code that can — under test or chaos-driven configuration — pause execution, return an error, sleep, panic, or otherwise simulate a hard-to-reproduce failure at that exact point. Failpoints provide deterministic fault-injection at code granularity rather than the coarser-grained process-kill / network-partition / disk-wipe approaches.

The canonical implementation patterns:

  • Compile-time failpoints. Hooks compiled in only for chaos / test builds; production binaries do not carry the overhead.
  • Runtime failpoints. Hooks always present; activation is controlled by a configuration toggle (often via an admin RPC or environment variable).

The Rust ecosystem ships failpoints as a widely-used crate; the Postgres ecosystem ships INJECTION_POINT() macros; many distributed systems carry equivalent in-house frameworks. The Lakebase reliability post uses "failpoints" as a generic term without naming a specific implementation.

Why failpoints are distinct from coarser fault injection

Mechanism Granularity Determinism
Kill the process Whole binary Crashes anywhere — non-deterministic
Sever network Whole connection All traffic — coarse
Wipe disk Whole device All I/O fails — coarse
Failpoint Single line / function call Deterministic

The structural advantage is bug reproduction: a hard-to-hit race in code path X can be triggered deterministically by activating a failpoint at the exact moment that produces the race, rather than hoping a coarse kill-the-process drill happens to land at the right point.

Lakebase framing

Verbatim (Source: sources/2026-05-27-databricks-how-the-lakebase-architecture-stays-resilient-to-cloud-failures):

"We use failpoints liberally in our code to inject hard-to-reproduce errors, such as a crash at the worst possible time. This is driven by an internal fault-injection framework that can target a single process or coordinate cluster-wide faults across an entire cell."

The two scopes the framework supports:

  1. Single-process failpoint activation — drive deterministic per-process failures.
  2. Cluster-wide coordinated failpoint activation — coordinate failpoints across multiple processes in a cell to exercise distributed-coordination edge cases (e.g. "crash this Pageserver at the same instant the Safekeeper acks this commit").

The cluster-wide-coordinated mode is what distinguishes sophisticated distributed-systems failpoint frameworks from naive per-process ones.

Common failpoint use cases

  • Crash-at-worst-time — fail just after a partial write but before the durability commit. Tests crash recovery.
  • Slow-this-call — inject latency at one point. Tests timeout paths.
  • Return-error-once — return a transient error N times then succeed. Tests retry logic.
  • Reorder-events — pause one branch and let another race past it. Tests concurrency invariants.
  • Resource-exhaustion — return out-of-memory / out-of-disk at a precise point. Tests degradation paths.

Composability with the chaos regime

Failpoints sit one level below the coarser drills (process kill, AZ network partition). The Lakebase regime composes them: failpoints exercise per-component edge cases while running the workload that the whole-AZ partition drill exercises at the AZ scope.

Validation harness

Lakebase pairs failpoint-driven fault injection with correctness validation via SQLancer and SQLsmith (Source: same):

"While failure injection is running, we validate internal data consistency, that no committed transaction is lost, and that every component recovers to a consistent state on its own."

The validation harness is what closes the loop: failpoint drives the fault, SQLancer / SQLsmith / internal tools verify the system's state postcondition is still correct.

Caveats

  • Failpoint hygiene matters. Stale failpoints accumulate in codebases and become a footgun if accidentally activated in production. Mature regimes annotate failpoints with purpose + expected resolution and prune unused ones.
  • Compile-time overhead. Even no-op failpoints add instruction-cache pressure if not optimised out. Languages / compilers vary in their elision quality.
  • Production-vs-test boundary. Failpoints in production binaries are a security surface — access to the failpoint admin RPC is effectively "can crash this process at this line" and must be authenticated + audited.
  • Cluster-wide coordination is hard. Coordinated multi-process failpoint activation requires a control plane of its own, time synchronisation, and robust cluster-state observability. Many in-house frameworks settle for per-process activation for simplicity.
  • Internal framework not detailed in the source. Lakebase says "an internal fault-injection framework" but does not name it, describe its activation model, or detail the cluster-wide-coordination mechanism.

Seen in

Last updated · 542 distilled / 1,571 read