CONCEPT Cited by 1 source
Nullable-column backfill amplification¶
What it is¶
A nullable-column backfill amplification is a specific operational hazard in CRDT-backed distributed databases where adding a nullable column to a table forces the CRDT engine to backfill a value (even NULL) into the change log for every existing row — and, if the table is gossiped, the entire fleet re-replicates the backfill simultaneously.
The amplification factor can be brutal: a one-line DDL change produces a fleet-wide state-reconciliation storm orders of magnitude larger than the developer expected.
Mechanism¶
cr-sqlite (and similar CRDT layers) track row-level changes to support convergent merging. Adding a nullable column is a schema change, but the CRDT layer treats it as a row-level change for every row — every row now has a new column whose value has not yet been tracked.
To make the new column mergeable, cr-sqlite writes a change
record (typically to crsql_changes) for every row. If the
table has tens of millions of rows (every Fly Machine record,
say), every worker in the cluster simultaneously produces
tens of millions of change records, and the gossip layer
broadcasts them all.
"You made a trivial-seeming schema change to a CRDT table hooked up to a global gossip system. Now, when the deploy runs, thousands of high-powered servers around the world join a chorus of database reconciliation messages that melts down the entire cluster." (sources/2025-10-22-flyio-corrosion)
Why it surprises operators¶
The developer's mental model for "add nullable column" is "cheap, no data rewrite, backward-compatible." That mental model is correct for a single relational database. It is false for a CRDT-replicated table — every replica does the work, and the work propagates over the wire.
The cost scales with rows × replicas × row-size, not just
rows × row-size.
Mitigation — stage schema changes¶
There is no silver-bullet fix; live with the amplification shape:
- Stage the rollout. Add the column on a small subset of rows first; ramp replicas slowly.
- Keep CRDT tables lean. Push large or high-cardinality data off the CRDT path.
- Time DDL changes to low-traffic windows. The reconciliation storm still happens, but at least user traffic isn't contending with it.
- Prefer compute-over-data. If a new field is derivable, compute it at read time rather than materialising it as a column.
Fly.io has not (as of 2025-10-22) published a definitive playbook; the post documents the failure mode as lesson-learned rather than structurally-solved.
Seen in¶
- sources/2025-10-22-flyio-corrosion — canonical primary source. Fly.io added a nullable column to a large Corrosion table; the backfill "played out as if every Fly Machine on our platform had suddenly changed state simultaneously, just to fuck us."
Related¶
- concepts/crdt
- systems/cr-sqlite — the CRDT layer whose design makes the amplification unavoidable.
- systems/corrosion-swim — the gossip layer that broadcasts the amplified writes.
- concepts/schema-evolution — the general problem this is a specific CRDT-family instance of.