Skip to content

CONCEPT Cited by 2 sources

Immutable segment file

Definition

An immutable segment file is an on-disk unit of index storage that is written once, never modified, and eventually replaced (via merging) or deleted (via garbage collection) but never mutated in place. It is the on-disk unit of persistence for Apache Lucene and every search engine built on it (Elasticsearch, OpenSearch, Nrtsearch).

Shape in Lucene

  • A Lucene index is a directory of segments.
  • Each segment is produced by flushing an in-memory buffer (indexing tail) to disk.
  • Segments are periodically merged into larger segments in the background — the merge produces a new segment file, leaves the old ones unchanged (until a later bookkeeping deletes the inputs).
  • Documents are logically deleted by marking them in a tombstone file; physical deletion happens at merge time.

From Lucene 10 release notes (codecs): segments are "sub-index in a Lucene index which can be searched independently".

Why immutability is load-bearing

Immutable segment files are a quiet but foundational architectural choice that unlocks three independent downstream capabilities:

1. Cheap segment-level replication / backup

Because files never change, the question "what's different from my last snapshot?" reduces to "what files exist here that weren't there before?" — a list operation. This is what makes incremental backup on commit practical.

Canonical wiki instance: Yelp Nrtsearch's incremental per-commit S3 upload (2025-05-08) — "Lucene segments are immutable, so when we perform a backup, we only need to upload the new files since the last backup."

2. Cross-cluster replication at segment boundaries

Elasticsearch's Cross Cluster Replication (CCR) replicates data at Lucene segment granularity: the follower cluster never sees in-memory buffer contents or yet-unflushed documents; it only sees whole, persisted, never-to-change segments. GitHub's 2026 GHES search rewrite exploits this. (Source: sources/2026-03-03-github-how-we-rebuilt-the-search-architecture-for-high-availability)

3. Lock-free search during writes

Readers can operate on a committed segment set without coordinating with the writer that's producing new segments — the read-path view is a snapshot of "segments that existed at open time." This is why Lucene's near-real-time search can return results from a segment the instant it's flushed, without any lock on ongoing indexing.

Adjacent

Seen in

Last updated · 550 distilled / 1,221 read