SYSTEM Cited by 2 sources
Apache Lucene¶
Apache Lucene is the Java full-text search library that underpins Elasticsearch and its managed fork Amazon OpenSearch Service. Lucene's on-disk unit of persistence is the segment — an immutable inverted-index file produced by flushing an in-memory buffer; segments are periodically merged into larger segments in the background.
This page is a stub; most wiki references to Lucene come through Elasticsearch / OpenSearch.
Why segments are load-bearing in the wiki¶
Segment-level replication boundary for CCR¶
Elasticsearch's Cross Cluster Replication (see concepts/cross-cluster-replication) replicates data once it's been persisted to Lucene segments. This gives CCR a durable, immutable replication unit: the follower cluster doesn't see in-memory buffer contents or yet-unflushed documents; it only sees whole, persisted segments. GitHub's 2026 GHES search rewrite exploits this: the leader cluster's segments are the durable truth, the follower cluster replays them. (Source: sources/2026-03-03-github-how-we-rebuilt-the-search-architecture-for-high-availability)
Segment-level is coarser than per-document but finer than per-index-snapshot — it's the right grain for streaming near-real-time replication with durability guarantees.
Immutable segments as backup unit¶
Beyond replication, Lucene segment immutability is the property that makes incremental-on-commit backup to object storage cheap. Yelp Nrtsearch 1.0.0 exploits this: every commit diffs the local segment dir against S3 and uploads only the new files. Once Nrtsearch has S3 as the source of truth for committed segments, the primary's local disk can be ephemeral SSD rather than EBS. (Source: sources/2025-05-08-yelp-nrtsearch-100-incremental-backups-lucene-10)
See concepts/immutable-segment-file for the general wiki treatment of segment immutability and its downstream architectural consequences.
Lucene 10 (2024+) — vector search and SIMD¶
Lucene added HNSW vector search in version 9 and continued expanding vector-search capabilities through version 10 (10.1.0, the version Yelp Nrtsearch 1.0.0 ships with). The headline 10.x additions relevant to the wiki:
- HNSW vector index (systems/hnsw / concepts/hnsw-index) for float and byte vectors. Multiple similarity types: cosine (with optional auto-normalisation), dot product, euclidean, maximum inner product.
- Scalar quantization (concepts/scalar-quantization) —
configurable memory/recall tradeoff for float vectors; maps
each dimension to
int8/float16. - Vector search inside nested documents.
- Intra-merge parallelism for HNSW graph merge.
- SIMD vector-instruction acceleration via Java 21's Vector API + foreign-memory API (concepts/simd-vectorization).
- Intra-single-segment parallel search — enables a single query to parallelise across CPU cores within a single segment, which removes one of the historical motivations for virtual sharding (using multiple small clusters to get CPU parallelism that a single cluster couldn't provide).
The Lucene 10 feature set is the architectural enabler for Yelp Nrtsearch's move from 8.4.0 to 10.1.0 as part of the 2025-05 1.0.0 release.
Stub caveats¶
- Not covered here: Lucene's index-compaction / merge policy,
scoring-function internals (BM25, DFR, ...), codecs, index-time
vs query-time analyzer chains, Lucene's relationship to
IndexWriter/IndexReader, NRT (near-real-time) search.
Seen in¶
- sources/2026-03-03-github-how-we-rebuilt-the-search-architecture-for-high-availability — Lucene segments as the durable replication unit underpinning Elasticsearch CCR in GHES 3.19.1.
- sources/2025-05-08-yelp-nrtsearch-100-incremental-backups-lucene-10 — Lucene 10.1.0 + Java 21 as the platform for Yelp Nrtsearch 1.0.0. HNSW vector search (up to 4096 elements), scalar quantization, SIMD acceleration, and the roadmap for intra-single-segment parallel search are all Lucene-level capabilities this release surfaces to applications.
Related¶
- systems/elasticsearch — distributed search engine over Lucene.
- systems/amazon-opensearch-service — AWS's managed fork, same Lucene core.
- systems/nrtsearch — Yelp's open-source Lucene-based search engine; 2025-05 1.0.0 ships Lucene 10.1.0.
- systems/hnsw — graph-based ANN index; Lucene's native vector-search implementation.
- concepts/cross-cluster-replication — replicates data at the Lucene-segment granularity.
- concepts/immutable-segment-file — general wiki treatment of segment immutability.
- concepts/hnsw-index — Lucene's vector index shape.
- concepts/scalar-quantization — memory/recall tradeoff for Lucene 10 float-vector storage.
- concepts/simd-vectorization — Lucene 10 exploits Java 21 SIMD instructions for faster vector math.