Skip to content

SYSTEM Cited by 1 source

Apache Lucene

Apache Lucene is the Java full-text search library that underpins Elasticsearch and its managed fork Amazon OpenSearch Service. Lucene's on-disk unit of persistence is the segment — an immutable inverted-index file produced by flushing an in-memory buffer; segments are periodically merged into larger segments in the background.

This page is a stub; most wiki references to Lucene come through Elasticsearch / OpenSearch.

Why segments are load-bearing in the wiki

Segment-level replication boundary for CCR

Elasticsearch's Cross Cluster Replication (see concepts/cross-cluster-replication) replicates data once it's been persisted to Lucene segments. This gives CCR a durable, immutable replication unit: the follower cluster doesn't see in-memory buffer contents or yet-unflushed documents; it only sees whole, persisted segments. GitHub's 2026 GHES search rewrite exploits this: the leader cluster's segments are the durable truth, the follower cluster replays them. (Source: sources/2026-03-03-github-how-we-rebuilt-the-search-architecture-for-high-availability)

Segment-level is coarser than per-document but finer than per-index-snapshot — it's the right grain for streaming near-real-time replication with durability guarantees.

Stub caveats

  • Not covered here: Lucene's index-compaction / merge policy, scoring-function internals (BM25, DFR, ...), codecs, index-time vs query-time analyzer chains, Lucene's relationship to IndexWriter / IndexReader, NRT (near-real-time) search.

Seen in

Last updated · 200 distilled / 1,178 read