SYSTEM Cited by 1 source
Apache Lucene¶
Apache Lucene is the Java full-text search library that underpins Elasticsearch and its managed fork Amazon OpenSearch Service. Lucene's on-disk unit of persistence is the segment — an immutable inverted-index file produced by flushing an in-memory buffer; segments are periodically merged into larger segments in the background.
This page is a stub; most wiki references to Lucene come through Elasticsearch / OpenSearch.
Why segments are load-bearing in the wiki¶
Segment-level replication boundary for CCR¶
Elasticsearch's Cross Cluster Replication (see concepts/cross-cluster-replication) replicates data once it's been persisted to Lucene segments. This gives CCR a durable, immutable replication unit: the follower cluster doesn't see in-memory buffer contents or yet-unflushed documents; it only sees whole, persisted segments. GitHub's 2026 GHES search rewrite exploits this: the leader cluster's segments are the durable truth, the follower cluster replays them. (Source: sources/2026-03-03-github-how-we-rebuilt-the-search-architecture-for-high-availability)
Segment-level is coarser than per-document but finer than per-index-snapshot — it's the right grain for streaming near-real-time replication with durability guarantees.
Stub caveats¶
- Not covered here: Lucene's index-compaction / merge policy,
scoring-function internals (BM25, DFR, ...), codecs, index-time
vs query-time analyzer chains, Lucene's relationship to
IndexWriter/IndexReader, NRT (near-real-time) search.
Seen in¶
- sources/2026-03-03-github-how-we-rebuilt-the-search-architecture-for-high-availability — Lucene segments as the durable replication unit underpinning Elasticsearch CCR in GHES 3.19.1.
Related¶
- systems/elasticsearch — distributed search engine over Lucene.
- systems/amazon-opensearch-service — AWS's managed fork, same Lucene core.
- concepts/cross-cluster-replication — replicates data at the Lucene-segment granularity.