Skip to content

CONCEPT Cited by 1 source

Nullable-column backfill amplification

What it is

A nullable-column backfill amplification is a specific operational hazard in CRDT-backed distributed databases where adding a nullable column to a table forces the CRDT engine to backfill a value (even NULL) into the change log for every existing row — and, if the table is gossiped, the entire fleet re-replicates the backfill simultaneously.

The amplification factor can be brutal: a one-line DDL change produces a fleet-wide state-reconciliation storm orders of magnitude larger than the developer expected.

Mechanism

cr-sqlite (and similar CRDT layers) track row-level changes to support convergent merging. Adding a nullable column is a schema change, but the CRDT layer treats it as a row-level change for every row — every row now has a new column whose value has not yet been tracked.

To make the new column mergeable, cr-sqlite writes a change record (typically to crsql_changes) for every row. If the table has tens of millions of rows (every Fly Machine record, say), every worker in the cluster simultaneously produces tens of millions of change records, and the gossip layer broadcasts them all.

"You made a trivial-seeming schema change to a CRDT table hooked up to a global gossip system. Now, when the deploy runs, thousands of high-powered servers around the world join a chorus of database reconciliation messages that melts down the entire cluster." (sources/2025-10-22-flyio-corrosion)

Why it surprises operators

The developer's mental model for "add nullable column" is "cheap, no data rewrite, backward-compatible." That mental model is correct for a single relational database. It is false for a CRDT-replicated table — every replica does the work, and the work propagates over the wire.

The cost scales with rows × replicas × row-size, not just rows × row-size.

Mitigation — stage schema changes

There is no silver-bullet fix; live with the amplification shape:

  • Stage the rollout. Add the column on a small subset of rows first; ramp replicas slowly.
  • Keep CRDT tables lean. Push large or high-cardinality data off the CRDT path.
  • Time DDL changes to low-traffic windows. The reconciliation storm still happens, but at least user traffic isn't contending with it.
  • Prefer compute-over-data. If a new field is derivable, compute it at read time rather than materialising it as a column.

Fly.io has not (as of 2025-10-22) published a definitive playbook; the post documents the failure mode as lesson-learned rather than structurally-solved.

Seen in

  • sources/2025-10-22-flyio-corrosion — canonical primary source. Fly.io added a nullable column to a large Corrosion table; the backfill "played out as if every Fly Machine on our platform had suddenly changed state simultaneously, just to fuck us."
Last updated · 200 distilled / 1,178 read