CONCEPT Cited by 2 sources
Immutable segment file¶
Definition¶
An immutable segment file is an on-disk unit of index storage that is written once, never modified, and eventually replaced (via merging) or deleted (via garbage collection) but never mutated in place. It is the on-disk unit of persistence for Apache Lucene and every search engine built on it (Elasticsearch, OpenSearch, Nrtsearch).
Shape in Lucene¶
- A Lucene index is a directory of segments.
- Each segment is produced by flushing an in-memory buffer (indexing tail) to disk.
- Segments are periodically merged into larger segments in the background — the merge produces a new segment file, leaves the old ones unchanged (until a later bookkeeping deletes the inputs).
- Documents are logically deleted by marking them in a tombstone file; physical deletion happens at merge time.
From Lucene 10 release notes (codecs): segments are "sub-index in a Lucene index which can be searched independently".
Why immutability is load-bearing¶
Immutable segment files are a quiet but foundational architectural choice that unlocks three independent downstream capabilities:
1. Cheap segment-level replication / backup¶
Because files never change, the question "what's different from my last snapshot?" reduces to "what files exist here that weren't there before?" — a list operation. This is what makes incremental backup on commit practical.
Canonical wiki instance: Yelp Nrtsearch's incremental per-commit S3 upload (2025-05-08) — "Lucene segments are immutable, so when we perform a backup, we only need to upload the new files since the last backup."
2. Cross-cluster replication at segment boundaries¶
Elasticsearch's Cross Cluster Replication (CCR) replicates data at Lucene segment granularity: the follower cluster never sees in-memory buffer contents or yet-unflushed documents; it only sees whole, persisted, never-to-change segments. GitHub's 2026 GHES search rewrite exploits this. (Source: sources/2026-03-03-github-how-we-rebuilt-the-search-architecture-for-high-availability)
3. Lock-free search during writes¶
Readers can operate on a committed segment set without coordinating with the writer that's producing new segments — the read-path view is a snapshot of "segments that existed at open time." This is why Lucene's near-real-time search can return results from a segment the instant it's flushed, without any lock on ongoing indexing.
Adjacent¶
- concepts/immutable-object-storage — the remote substrate property that pairs with immutable on-disk files for S3-style backup.
- concepts/write-once-read-many — the generalised shape.
- concepts/tombstone — the companion primitive for logical deletion.
Seen in¶
- sources/2025-05-08-yelp-nrtsearch-100-incremental-backups-lucene-10 — immutability is the explicit enabler for Nrtsearch 1.0.0's incremental-backup-on- commit architecture. "Lucene segments are immutable, so when we perform a backup, we only need to upload the new files since the last backup."
- sources/2026-03-03-github-how-we-rebuilt-the-search-architecture-for-high-availability — GitHub's GHES search rewrite exploits segment-granularity CCR; segments as the durable, immutable replication boundary.