Skip to content

PATTERN Cited by 1 source

Test the ambiguous invariant

Test the ambiguous invariant is the discipline of writing automated tests for behaviour that your code-base de-facto relies on even when the relevant spec does not formally require it. The pattern applies wherever a standard is silent or advisory, but downstream consumers have hardened around one reading of it — so "what the spec allows" and "what you can safely ship" diverge.

When the pattern applies

The warning signs:

  1. The spec uses non-normative words ("preface", "typically", "normally") — see RFC normative language.
  2. A meaningful fraction of deployed consumers depend on one reading — see backward compatibility's long-tail-of-clients discipline.
  3. Your implementation has historically produced one behaviour, but the code doesn't enforce it — so a refactor can silently produce the other behaviour.
  4. You cannot reach the broken clients to update them (deployed firmware, embedded devices, old libraries).

All four conditions were present for Cloudflare's systems/cloudflare-1-1-1-1-resolver|1.1.1.1 CNAME-ordering in its DNS responses: RFC 1034 uses the word "preface" (non-normative); glibc getaddrinfo + Cisco Catalyst DNSC depend on CNAME-first; Cloudflare's code produced CNAME-first but didn't enforce it; and the broken clients include Linux userspace and switch firmware that will never be updated.

The mechanism

Write a test that asserts the invariant at the boundary — the wire format, API shape, or data layout downstream consumers see — regardless of how your internal code evolves. Examples:

  • DNS resolver: assert that A/AAAA records in a response never appear before the CNAMEs that alias them.
  • HTTP server: assert that Content-Length is emitted before Transfer-Encoding if both are present (for HTTP/1.1 clients that parse headers sequentially).
  • File-format writer: assert that a deprecated magic prefix still appears at byte 0, because old parsers expect it.
  • API response: assert that a field that was once nullable is always non-null now, because older SDKs don't handle null.

The test must be at the output surface, not at an internal function — the whole point is to catch refactors that preserve internal semantics but change external shape.

Case study: 2026-01-08 1.1.1.1 incident

From Cloudflare's post-mortem:

In our case, we did originally implement the specification so that CNAMEs appear first. However, we did not have any tests asserting the behavior remains consistent due to the ambiguous language in the RFC.

The 2025-12-02 memory-optimisation refactor to PartialChain::fill_cache changed

let mut answer_rrs = Vec::with_capacity(entry.answer.len() + self.records.len());
answer_rrs.extend_from_slice(&self.records); // CNAMEs first
answer_rrs.extend_from_slice(&entry.answer);
entry.answer = answer_rrs;

to

entry.answer.extend(self.records); // CNAMEs appended — now LAST

The two are functionally equivalent under RFC 1034's "order is not significant" reading. A unit test on the cache code would have passed either version (no ordering assertion). A boundary test — "query a domain with a CNAME chain, parse the response, assert the CNAME records appear before the A records" — would have caught it. Cloudflare's stated remediation includes writing exactly this test.

Where it fits in the remediation stack

  • Pre-commit: lint/static analysis rarely catches invariant violations unless the invariant is explicit in the type system (Rust's Option<T> + exhaustive match, TypeScript's branded types). An ordering invariant on a Vec<ResourceRecord> has no static representation.
  • Test suite: this pattern. The cheapest durable fix.
  • Canary / staged rollout: patterns/staged-rollout will eventually catch invariant violations if the affected population is large enough to show up in metrics before the deploy completes. In Cloudflare's case, the broken clients were a small fraction of traffic and uncorrelated with POP selection — so every pre-90 % checkpoint passed clean and the defect landed fleet-wide.
  • Runtime fail-open: fail-open handling can soften the impact of an invariant violation, but doesn't stop the client-side crash.

Companion patterns

  • patterns/fast-rollback — the post-detection recovery path. Cloudflare got from incident-declaration to revert-start in 8 min (18:19 → 18:27 UTC) because the change was single-commit single-path; the test-the-ambiguous-invariant pattern is how you avoid needing fast rollback in the first place.
  • patterns/staged-rollout — the defence-in-depth partner: invariant tests catch the correctness bar, staged rollouts catch the operational bar. Both fail on the same incident if the broken population is small and correctness isn't boundary- tested.

Generalisation: "the spec permits it, but we can't ship it"

The meta-rule: every ambiguous reading of a spec that your code has settled on should get a boundary test. The act of writing the test often forces the team to explicitly name the implicit invariant — which is itself documentation for future maintainers who didn't know the convention was load-bearing.

Last updated · 200 distilled / 1,178 read