CONCEPT Cited by 3 sources

Program correctness¶

Definition¶

Program correctness is the property that a program produces the intended result for every valid input, in every reachable state, under every allowed interleaving. It is distinct from — and much stronger than — the property that a program produces the intended result for the inputs that were tested. Almost every real-world bug lives in the gap between the two.

The Bloch principle: no single technique is sufficient¶

From the canonical wiki reference (Joshua Bloch, 2006 / republished 2025-01-11):

It is not sufficient merely to prove a program correct; you have to test it too. Moreover, to be really certain that a program is correct, you have to test it for all possible input values, but this is seldom feasible. With concurrent programs, it's even worse: You have to test for all internal states, which is, for all practical purposes, impossible.

We programmers need all the help we can get, and we should never assume otherwise. Careful design is great. Testing is great. Formal methods are great. Code reviews are great. Static analysis is great. But none of these things alone are sufficient to eliminate bugs: They will always be with us.

The binary-search bug — (low + high) / 2 overflowing for arrays of 2^30 elements or more — was in a program that had been proved correct, tested, and reviewed. It shipped in the JDK for roughly nine years; it survived in Programming Pearls (1986 / 2000) for two decades. Every individual technique passed. The bug nonetheless existed.

The two gaps¶

Real-world bugs exploit two gaps:

Input-space gap. The test suite covers a finite set of inputs; the program is required to be correct on an infinite / astronomically larger set. The binary-search bug needed low + high > 2^31 − 1, which is a measure-zero fraction of int × int pairs but inevitable at Google-scale data.
State-space gap. Even on a fixed input, concurrent programs have a superpolynomial number of interleavings. You cannot test your way through them.

Exhaustive testing is typically infeasible; directed / property- based testing (covering input classes, not examples) and deterministic-simulation testing (collapsing the interleaving axis to a reproducible seed) exist to attack these gaps structurally.

Techniques — tiered, not exclusive¶

Technique	What it catches	What it misses
Types	Misuse of units, nullability, ownership (Rust)	Arithmetic overflow, logic bugs within a type
Unit tests	Reproducing a hand-picked scenario	Every scenario you didn't pick
Code review	"Does a human-visible pattern look wrong?"	Invariant-level bugs masked by surface correctness
Static analysis	Mechanical patterns (null-deref, unused var, some overflow shapes)	Semantic bugs with no syntactic signature
Property-based tests	Invariants generated across inputs (patterns/property-based-testing)	What the property omits
Deterministic simulation	Reproducible concurrency bugs (concepts/deterministic-simulation)	Bugs outside the simulated interface
Lightweight formal verification	Executable spec vs. implementation divergence (concepts/lightweight-formal-verification)	Bugs outside the spec
Heavyweight formal methods	Proofs (TLA+, Coq) over the model	Model vs. real-system drift
Automated reasoning	SMT-level invariants (concepts/automated-reasoning)	What the invariant file omits

A mature correctness posture stacks several — guardrails, not a single silver bullet. See patterns/durability-review for the S3 framing of this as explicit process.

Invariants as the substrate¶

The Bloch post opens with Bentley's first CMU Algorithms lecture — "the key lesson was to carefully consider the invariants in your programs" — and ends with the observation that the binary-search bug was precisely a broken invariant: mid was claimed to be the "average of low and high, truncated down", which is true in the mathematical integers and false in the 32-bit int machine representation. See patterns/invariant-driven-programming for the discipline of naming invariants and checking them against substrate semantics, not intended semantics.

The scale-makes-latent-bugs-manifest pattern¶

A recurring wiki shape: a bug sits dormant for years because no deployed workload triggers its precondition, then all at once becomes hot because scale moved.

Binary search overflow (Bloch, 2006) — 20 years dormant; triggered when arrays crossed 2^30 elements (big-data era).
Canva's count-billions pipeline (Source: sources/2024-04-29-canva-scaling-to-count-billions) — instance-size doubling every 8–10 months masked an O(N) dedup until storage economics forced a rewrite.
GitHub immutable-releases gap (Source: sources/2026-01-15-github-when-protections-outlive-their-purpose) — defence-in-depth protection outlived its purpose; silently broke cached new-version paths as traffic mix changed.
Kafka Streams sub-topology partitioning (Source: sources/2025-11-11-expedia-kafka-streams-sub-topology-partition-colocation) — latent correctness assumption violated only under specific sub-topology partition ratios.

The common shape: correctness under tested inputs ≠ correctness under production inputs ≠ correctness under all inputs. Scale shifts which of the last two is load-bearing.

Seen in¶

sources/2025-01-11-google-nearly-all-binary-searches-and-mergesorts-are-broken-2006 — canonical: proof + test + review all passed, bug survived; binary-search overflow ~20 years dormant. Meta-lesson: "we must program carefully, defensively, and remain ever vigilant."
sources/2025-02-25-allthingsdistributed-building-and-operating-s3 — S3's ShardStore response: lightweight formal verification as an industrialized, non-PhD-accessible, every-commit correctness discipline; durability reviews as a process-level guardrail.
sources/2024-05-31-dropbox-testing-sync-at-dropbox-2020 — Dropbox Nucleus's response: property-based testing + deterministic simulation as the concurrent-program equivalent — tests as universal properties rather than example inputs.