CONCEPT Cited by 3 sources
Program correctness¶
Definition¶
Program correctness is the property that a program produces the intended result for every valid input, in every reachable state, under every allowed interleaving. It is distinct from — and much stronger than — the property that a program produces the intended result for the inputs that were tested. Almost every real-world bug lives in the gap between the two.
The Bloch principle: no single technique is sufficient¶
From the canonical wiki reference (Joshua Bloch, 2006 / republished 2025-01-11):
It is not sufficient merely to prove a program correct; you have to test it too. Moreover, to be really certain that a program is correct, you have to test it for all possible input values, but this is seldom feasible. With concurrent programs, it's even worse: You have to test for all internal states, which is, for all practical purposes, impossible.
We programmers need all the help we can get, and we should never assume otherwise. Careful design is great. Testing is great. Formal methods are great. Code reviews are great. Static analysis is great. But none of these things alone are sufficient to eliminate bugs: They will always be with us.
The binary-search bug — (low + high) / 2 overflowing for arrays of
2^30 elements or more — was in a program that had been proved
correct, tested, and reviewed. It shipped in the JDK for
roughly nine years; it survived in Programming Pearls (1986 / 2000)
for two decades. Every individual technique passed. The bug
nonetheless existed.
The two gaps¶
Real-world bugs exploit two gaps:
- Input-space gap. The test suite covers a finite set of inputs;
the program is required to be correct on an infinite / astronomically
larger set. The binary-search bug needed
low + high > 2^31 − 1, which is a measure-zero fraction ofint × intpairs but inevitable at Google-scale data. - State-space gap. Even on a fixed input, concurrent programs have a superpolynomial number of interleavings. You cannot test your way through them.
Exhaustive testing is typically infeasible; directed / property- based testing (covering input classes, not examples) and deterministic-simulation testing (collapsing the interleaving axis to a reproducible seed) exist to attack these gaps structurally.
Techniques — tiered, not exclusive¶
| Technique | What it catches | What it misses |
|---|---|---|
| Types | Misuse of units, nullability, ownership (Rust) | Arithmetic overflow, logic bugs within a type |
| Unit tests | Reproducing a hand-picked scenario | Every scenario you didn't pick |
| Code review | "Does a human-visible pattern look wrong?" | Invariant-level bugs masked by surface correctness |
| Static analysis | Mechanical patterns (null-deref, unused var, some overflow shapes) | Semantic bugs with no syntactic signature |
| Property-based tests | Invariants generated across inputs (patterns/property-based-testing) | What the property omits |
| Deterministic simulation | Reproducible concurrency bugs (concepts/deterministic-simulation) | Bugs outside the simulated interface |
| Lightweight formal verification | Executable spec vs. implementation divergence (concepts/lightweight-formal-verification) | Bugs outside the spec |
| Heavyweight formal methods | Proofs (TLA+, Coq) over the model | Model vs. real-system drift |
| Automated reasoning | SMT-level invariants (concepts/automated-reasoning) | What the invariant file omits |
A mature correctness posture stacks several — guardrails, not a single silver bullet. See patterns/durability-review for the S3 framing of this as explicit process.
Invariants as the substrate¶
The Bloch post opens with Bentley's first CMU Algorithms lecture —
"the key lesson was to carefully consider the invariants in your
programs" — and ends with the observation that the binary-search
bug was precisely a broken invariant: mid was claimed to be the
"average of low and high, truncated down", which is true in the
mathematical integers and false in the 32-bit int machine
representation. See patterns/invariant-driven-programming for the
discipline of naming invariants and checking them against substrate
semantics, not intended semantics.
The scale-makes-latent-bugs-manifest pattern¶
A recurring wiki shape: a bug sits dormant for years because no deployed workload triggers its precondition, then all at once becomes hot because scale moved.
- Binary search overflow (Bloch, 2006) — 20 years dormant; triggered
when arrays crossed
2^30elements (big-data era). - Canva's count-billions pipeline (Source: sources/2024-04-29-canva-scaling-to-count-billions) — instance-size doubling every 8–10 months masked an
O(N)dedup until storage economics forced a rewrite. - GitHub immutable-releases gap (Source: sources/2026-01-15-github-when-protections-outlive-their-purpose) — defence-in-depth protection outlived its purpose; silently broke cached new-version paths as traffic mix changed.
- Kafka Streams sub-topology partitioning (Source: sources/2025-11-11-expedia-kafka-streams-sub-topology-partition-colocation) — latent correctness assumption violated only under specific sub-topology partition ratios.
The common shape: correctness under tested inputs ≠ correctness under production inputs ≠ correctness under all inputs. Scale shifts which of the last two is load-bearing.
Seen in¶
- sources/2025-01-11-google-nearly-all-binary-searches-and-mergesorts-are-broken-2006 — canonical: proof + test + review all passed, bug survived; binary-search overflow ~20 years dormant. Meta-lesson: "we must program carefully, defensively, and remain ever vigilant."
- sources/2025-02-25-allthingsdistributed-building-and-operating-s3 — S3's ShardStore response: lightweight formal verification as an industrialized, non-PhD-accessible, every-commit correctness discipline; durability reviews as a process-level guardrail.
- sources/2024-05-31-dropbox-testing-sync-at-dropbox-2020 — Dropbox Nucleus's response: property-based testing + deterministic simulation as the concurrent-program equivalent — tests as universal properties rather than example inputs.