Skip to content

CONCEPT Cited by 3 sources

Program correctness

Definition

Program correctness is the property that a program produces the intended result for every valid input, in every reachable state, under every allowed interleaving. It is distinct from — and much stronger than — the property that a program produces the intended result for the inputs that were tested. Almost every real-world bug lives in the gap between the two.

The Bloch principle: no single technique is sufficient

From the canonical wiki reference (Joshua Bloch, 2006 / republished 2025-01-11):

It is not sufficient merely to prove a program correct; you have to test it too. Moreover, to be really certain that a program is correct, you have to test it for all possible input values, but this is seldom feasible. With concurrent programs, it's even worse: You have to test for all internal states, which is, for all practical purposes, impossible.

We programmers need all the help we can get, and we should never assume otherwise. Careful design is great. Testing is great. Formal methods are great. Code reviews are great. Static analysis is great. But none of these things alone are sufficient to eliminate bugs: They will always be with us.

The binary-search bug — (low + high) / 2 overflowing for arrays of 2^30 elements or more — was in a program that had been proved correct, tested, and reviewed. It shipped in the JDK for roughly nine years; it survived in Programming Pearls (1986 / 2000) for two decades. Every individual technique passed. The bug nonetheless existed.

The two gaps

Real-world bugs exploit two gaps:

  1. Input-space gap. The test suite covers a finite set of inputs; the program is required to be correct on an infinite / astronomically larger set. The binary-search bug needed low + high > 2^31 − 1, which is a measure-zero fraction of int × int pairs but inevitable at Google-scale data.
  2. State-space gap. Even on a fixed input, concurrent programs have a superpolynomial number of interleavings. You cannot test your way through them.

Exhaustive testing is typically infeasible; directed / property- based testing (covering input classes, not examples) and deterministic-simulation testing (collapsing the interleaving axis to a reproducible seed) exist to attack these gaps structurally.

Techniques — tiered, not exclusive

Technique What it catches What it misses
Types Misuse of units, nullability, ownership (Rust) Arithmetic overflow, logic bugs within a type
Unit tests Reproducing a hand-picked scenario Every scenario you didn't pick
Code review "Does a human-visible pattern look wrong?" Invariant-level bugs masked by surface correctness
Static analysis Mechanical patterns (null-deref, unused var, some overflow shapes) Semantic bugs with no syntactic signature
Property-based tests Invariants generated across inputs (patterns/property-based-testing) What the property omits
Deterministic simulation Reproducible concurrency bugs (concepts/deterministic-simulation) Bugs outside the simulated interface
Lightweight formal verification Executable spec vs. implementation divergence (concepts/lightweight-formal-verification) Bugs outside the spec
Heavyweight formal methods Proofs (TLA+, Coq) over the model Model vs. real-system drift
Automated reasoning SMT-level invariants (concepts/automated-reasoning) What the invariant file omits

A mature correctness posture stacks several — guardrails, not a single silver bullet. See patterns/durability-review for the S3 framing of this as explicit process.

Invariants as the substrate

The Bloch post opens with Bentley's first CMU Algorithms lecture — "the key lesson was to carefully consider the invariants in your programs" — and ends with the observation that the binary-search bug was precisely a broken invariant: mid was claimed to be the "average of low and high, truncated down", which is true in the mathematical integers and false in the 32-bit int machine representation. See patterns/invariant-driven-programming for the discipline of naming invariants and checking them against substrate semantics, not intended semantics.

The scale-makes-latent-bugs-manifest pattern

A recurring wiki shape: a bug sits dormant for years because no deployed workload triggers its precondition, then all at once becomes hot because scale moved.

The common shape: correctness under tested inputs ≠ correctness under production inputs ≠ correctness under all inputs. Scale shifts which of the last two is load-bearing.

Seen in

Last updated · 200 distilled / 1,178 read