CONCEPT Cited by 1 source
Incremental backup on commit¶
Definition¶
Incremental backup on commit is a backup discipline in which every commit of a system's durable state triggers a backup of only the files that changed since the last commit, rather than periodic archival of the whole state. It becomes practical when the underlying storage uses immutable files (append-only, never rewritten), since "changed since last backup" reduces to "not yet in the remote" — a cheap diff.
Canonical instance: Yelp Nrtsearch 1.0.0¶
Yelp Nrtsearch 1.0.0 (2025-05-08) upgraded from periodic full backup of the primary's index archive to incremental backup on every commit:
"Lucene segments are immutable, so when we perform a backup, we only need to upload the new files since the last backup. On every commit, Nrtsearch checks the files in S3, determines the missing files, and uploads them." (Source: sources/2025-05-08-yelp-nrtsearch-100-incremental-backups-lucene-10)
Added per-commit cost: "a few ms to 20 seconds depending on the size of the data," which Yelp characterises as negligible. In exchange, the primary's local disk stops being the source of truth for committed data — S3 is.
Why commit is the right boundary¶
A commit is the system's own durability checkpoint — clients have been told "this is safe." If the backup runs on every commit, durability of committed data extends transitively to the remote storage. Backups on any coarser cadence (hourly, daily) leave an unbackup'd window during which a primary-disk failure is a data loss event, forcing either a full reindex (Nrtsearch before 1.0) or replaying writes from an upstream log.
The immutability precondition¶
Incremental-per-commit only works if the on-disk unit of durability is immutable once written. In Lucene, that's the segment file — new data creates new segments; merges create new segments from old ones; no file is ever rewritten in place. The diff "which files exist locally that don't exist remotely" is then a trivial listing operation. Mutable-file storage engines (InnoDB page rewrites, PostgreSQL heap updates) can't use this shape directly — they need to either operate at a log level (WAL shipping) or at a block level (snapshot diffs), both of which are quite different mechanisms with different properties.
Consequences¶
- Primary disk becomes ephemeral. No reason for the primary to have a durable disk when S3 has every committed file.
- Replica bootstrap is from object storage, not from the primary. Replicas download the segment set from S3 at their own pace with parallel download; the primary is not on the bootstrap path for bulk data.
- Full backups serve a different need. If incremental-on-commit feeds replica bootstrap, full consistent snapshots can be a less-frequent, dedicated disaster- recovery artifact (Yelp: direct S3→S3 copies of committed data).
- Ingestion-heavy workloads don't force backup frequency up. In the old archive model, ingestion-heavy clusters needed frequent full backups so bootstrapped replicas didn't spend forever catching up from the primary. Incremental backup decouples replica-catchup cost from backup cadence.
Adjacent concepts¶
- concepts/immutable-segment-file — the enabling precondition.
- concepts/wal-replication — a different shape of per-commit durability, operating at the log level rather than the file level.
- concepts/immutable-object-storage — the remote substrate's property.
- patterns/incremental-s3-backup-of-immutable-files — the pattern this concept embeds in.
Seen in¶
- sources/2025-05-08-yelp-nrtsearch-100-incremental-backups-lucene-10 — canonical wiki instance. Every Lucene commit in Nrtsearch 1.0.0 uploads only new segment files to S3. Added per-commit cost is a few ms to 20s; in exchange, the primary's local SSD stops being the source of truth for committed data, and replica bootstrap is 5× faster via parallel download.