YELP 2025-05-08

Yelp — Nrtsearch 1.0.0: Incremental Backups, Lucene 10, and More¶

Summary¶

Four years after adopting their in-house Lucene-based search engine Nrtsearch in production (and >90% of Elasticsearch traffic migrated to it), Yelp Engineering cut Nrtsearch 1.0.0. The release is a consolidation of three production-shaped changes: (1) an incremental-backup-on-commit architecture that backs up every commit to S3 as individual Lucene segment files, which lets the primary move from network-attached EBS to ephemeral local SSD without data loss and gives replicas a 5× bootstrap speedup via parallel S3 download; (2) an upgrade from Lucene 8.4.0 to Lucene 10.1.0 (with Java 21), unlocking HNSW vector search, SIMD vector-instruction acceleration, and future intra-single-segment parallel search; (3) a state-management overhaul that decouples cluster/index state from data commits, makes state changes durable per-request, moves state's source of truth from EBS to S3, and adds hot reload on replicas so schema changes no longer require a restart fleet-wide. Post also lands aggregations integrated with parallel search (fixing a Lucene-facets-single-threaded tail-latency issue), expanded plugin types, and more exposed Lucene query types. Roadmap: NRT replication via S3 instead of gRPC, and replacing virtual-sharding with intra-segment parallel search.

Key takeaways¶

Every Lucene commit now uploads only the new segment files to S3 — incremental backup on commit. Lucene segments are immutable, so Nrtsearch diffs the local directory against S3 on commit and uploads only the files that aren't there yet. The cost is "generally a few ms to 20 seconds depending on the size of the data, which is trivial enough to not cause any issues." This is the load-bearing enabler for every other architectural change in the post: commit-durability now lives in S3, not on the primary's disk, so the primary no longer needs a durable disk. Canonical wiki instance of patterns/incremental-s3-backup-of-immutable-files over archive-based periodic backup.
Primary nodes moved from network-attached EBS to ephemeral local SSD. Two prior blockers — (a) data loss between the last full backup and the most recent commit if the primary restarted; (b) slow index download on a new node — were resolved by incremental backup (a) and parallel multi-file S3 download + local SSD (b), yielding a 5× download-speed increase. Yelp's previous EBS drawbacks are enumerated load-bearingly: EBS was the source of truth (loss = full reindex), EBS dismount/mount transitions were "not as smooth as expected," and ingestion-heavy clusters needed frequent full backups to keep replica catch-up time short. Opens Yelp's ninth distinct post-axis on the wiki (search-engine infrastructure / Lucene-platform / storage-tiering).
Full consistent snapshots are still taken, but no longer via the primary. Yelp now takes them as direct S3-to-S3 copies of the latest committed data between locations, since the incremental-backup S3 prefix already contains every committed segment. Snapshots are now for disaster recovery only (not replica bootstrap), so they can be less frequent than before. Canonical disclosure that replica-bootstrap and disaster-recovery backups are two different needs with different frequency cadences.
Replicas now bootstrap from S3 directly, using parallel downloads to saturate network bandwidth. Replicas download multiple segment files from S3 concurrently, combined with local SSD write, yielding the 5× speedup over serialised download. Canonical patterns/parallel-s3-download-for-bootstrap wiki instance; generalises beyond search to any cold-start-from-object-storage case where you're network-bound on a single-threaded client.
Lucene 10 brings HNSW vector search to Nrtsearch, alongside SIMD + foreign-memory acceleration (via Java 21). Yelp exposes kNN vector search for float + byte vectors up to 4096 elements, with cosine (auto-normalized optional), dot product, euclidean, and maximum inner product similarity. Advanced features: scalar quantization (accuracy/memory tradeoff), vector search on fields in nested documents, configurable intra-merge parallelism to speed up graph-merge, and optional SIMD vector-instruction acceleration. Canonical wiki surface of Yelp's Nrtsearch as a production HNSW consumer, joining the Lucene/Elasticsearch/OpenSearch/Faiss/Vespa/Qdrant HNSW-consumer group.
State management was redesigned around an immutable index-state representation. Before: state changes were batched into a commit, flushed to EBS, backed up to S3, and replicas restarted — a multi-step process that could lose updates on primary restart and couldn't propagate changes without a cluster restart. After: state changes are committed per-request (no batching with data commits), stored remotely on S3 as the source of truth, and hot-reloaded on replicas. Internally, changes merge into a new immutable state object that atomically replaces the old reference after the state-backend commit succeeds — clients see a consistent state for the lifetime of a single request. Two distinct problems solved: (a) decoupled state commit from data commit so backwards-compatible schema changes don't block on unrelated data commits; (b) atomic immutable state makes same-request consistency trivial.
Cluster state is now rich enough that servers self-start the necessary indices. Before: the Kubernetes operator had to know which indices to start. After: cluster state tracks each index's "started" state + a unique per-index identifier, so servers know what to do on bootstrap without external coordination. Canonical "push responsibility out of the operator into the system" simplification.
Aggregations are now integrated with Lucene's parallel search. Lucene facets run single-threaded after the parallel document-collection phase finishes, adding noticeable latency for large indices with complex/nested aggregations. Nrtsearch's replacement tracks aggregation state per index slice during collection, then merges slice states recursively at reduce-time, yielding parallel aggregations. Supported: term (text + numeric), filter, top-hits, min, max; nested aggregations merge recursively. Canonical wiki datum on Lucene-facets-as-tail-latency-source and the parallel-integration fix.
More plugin types exposed. Script (custom Java scoring + Lucene JavaScript expressions for simple cases), Rescorer (custom rescoring on a candidate subset), Highlight, Hits Logger (training-data collection for ranking models), Fetch Task (result-level enrichment), Aggregation. Yelp frames plugins as the extensibility surface that lets applications ship custom scoring without forking Nrtsearch.
Roadmap names two concrete items: NRT replication via S3 instead of gRPC; virtual sharding replaced by Lucene 10 intra-single-segment parallel search. The gRPC NRT pull currently makes the primary a bottleneck for replica scale; moving to S3 as the replication substrate would let replicas scale independently. Virtual sharding (split a large index into multiple clusters, scatter-gather at the coordinator) was a workaround for Lucene's prior lack of intra-segment parallelism — Lucene 10 now has it natively, so the in-process path should eventually replace the multi-cluster coordinator path.

Architecture¶

Previous architecture (pre-1.0.0)¶

 indexing clients ─┐
                   │  index/commit/delete
                   ▼
          ┌─────────────────┐         periodic full backup
          │ Primary (EBS)   │────────────────► S3 (archives)
          │ Lucene segments │                     │
          │ on EBS volume   │                     │
          └────────┬────────┘                     │
                   │ NRT updates (gRPC)           │ download
                   ▼                              ▼
          ┌─────────────────┐              ┌──────────────┐
          │ Replicas (EBS)  │◄─────────────│ On bootstrap │
          │                 │  pull updates│              │
          └─────────────────┘  from primary└──────────────┘

 Drawbacks:
  · EBS = source of truth; EBS loss/corruption = full reindex
  · EBS movement not as smooth as expected on primary restart
  · Ingestion-heavy clusters need frequent full backups
    (replicas must not spend too long catching up post-bootstrap)

Updated architecture (1.0.0)¶

 indexing clients ─┐
                   │
                   ▼
          ┌─────────────────────┐
          │ Nrtsearch           │  forwards index/commit/delete
          │ Coordinator         │──────┐
          │ (also scatter-      │      │
          │  gathers search)    │      │
          └────────▲────────────┘      │
                   │ search responses  │
                   │ (merged)          ▼
                   │        ┌──────────────────────┐
                   │        │ Primary (LOCAL SSD)  │
                   │        │                      │─ on commit, diff+upload
                   │        │                      │   new segments only
                   │        └────────┬─────────────┘
                   │                 │
                   │                 ▼
                   │       ┌──────────────────────────┐
                   │       │ S3 (per-segment objects) │
                   │       │ ← source of truth for    │
                   │       │   committed data         │
                   │       │ ← source of truth for    │
                   │       │   cluster/index state    │
                   │       └──────────┬───────────────┘
                   │                  │ parallel download at bootstrap
                   │                  │
                   │        ┌─────────▼────────────┐
                   └────────│ Replicas (LOCAL SSD) │
                 search     │                      │
                 requests   │ - S3 bootstrap       │
                            │ - NRT pull from      │
                            │   primary (gRPC)     │
                            │ - hot-reload state   │
                            │   on change          │
                            └──────────────────────┘

 Properties:
  · S3 is the committed-data + state source of truth
  · Primary's local SSD is ephemeral (commit durability in S3)
  · Replica bootstrap = 5× faster via parallel download + SSD
  · Full snapshots: direct S3→S3 copy, less frequent, DR-only
  · Large indices: split into multiple clusters; coordinator
    scatter-gathers (virtual sharding, to be replaced by
    intra-single-segment parallel search from Lucene 10)

State management (new)¶

 State change request
         │
         ▼
 ┌────────────────────┐
 │ Primary            │  1. build new immutable state = old ⊕ change
 │                    │  2. commit to state backend (S3 remote, or EBS)
 │                    │  3. atomic swap of state reference
 │                    │  4. acknowledge request
 └────────┬───────────┘
          │ notify
          ▼
 ┌────────────────────┐
 │ Replicas           │  hot-reload new state (no restart)
 └────────────────────┘

 Guarantees:
  · State changes durable before request returns (no loss on restart)
  · Client requests see stable state for request lifetime (immutable)
  · State propagation: no cluster restart needed
  · Data commit and state commit are independent

Operational numbers¶

>90% of Elasticsearch traffic at Yelp has been migrated to Nrtsearch (as of 2025-05).
4+ years in production; Nrtsearch 1.0.0 is the first major release since inception.
5× increase in bootstrap download speed from parallel S3 downloads + local SSD vs. the prior serialised-download / EBS-mount path.
"A few ms to 20 seconds" — added per-commit latency from the incremental S3 upload; framed as negligible ingestion-side.
Primary startup — "under a minute" on the previous EBS architecture (via volume re-attach); the new local-SSD path is not directly compared numerically, but is implicitly competitive given the 5× bootstrap speedup.
Vector search ceiling — up to 4096 elements per vector for both float and byte vector types.
Lucene version: 10.1.0 (from 8.4.0). Java version: 21 (enables SIMD vector instructions + foreign memory API).
Aggregations supported at 1.0: term (text + numeric), filter, top-hits, min, max (no mention of percentiles, histograms, date-histograms, cardinality-distinct, or significant-terms — likely followed in later minor releases or plugin-provided).

Caveats¶

No absolute throughput / latency numbers. The post is a release-note-shaped architecture overview. No QPS, no p99 latency, no memory footprint, no fleet-size disclosure. Yelp's pre-existing sources/2021-09-27-yelp-nrtsearch-yelps-fast-scalable-and-cost-effective-search-engine|2021 Nrtsearch launch post and sources/2023-10-01-yelp-coordinator-the-gateway-for-nrtsearch|2023 Coordinator post are the back-references for quantitative shape (not yet ingested as of this turn).
No A/B benchmark vs. Elasticsearch. The 1.0 post doesn't re-benchmark against Elasticsearch — the 2021 post is the canonical comparison. Yelp's ongoing migration is asserted, not re-justified.
Aggregation correctness caveat. The recursive slice-state-merge mechanism for nested aggregations is described at the pattern level, not the mathematical level. For aggregations whose slice-level state can't be cleanly merged (e.g. median, cardinality via HLL, percentiles via t-digest), integration would require the state structures to support merge operations — not directly addressed in the post. The supported list (term, filter, top-hits, min, max) happens to be the subset with trivially-mergeable state.
Vector search freshness caveat. HNSW indexes, as wiki systems/hnsw and concepts/hnsw-index document, are RAM-bound and require periodic rebuilds rather than incremental updates. The post does not disclose how Yelp handles this — whether they rebuild periodically per segment merge, or whether Lucene 10's HNSW support has added incremental-update semantics. Lucene segments being immutable means new docs write new segments with new graphs, which composes naturally with immutable-segment backup, but the merge-time cost for large vector graphs is not quantified.
Virtual sharding deprecation timeline unspecified. Yelp names the roadmap (replace virtual sharding with intra-single-segment parallel search) but gives no timeline and doesn't quantify how many Nrtsearch clusters currently use virtual sharding versus single-cluster deployments.
S3-based NRT replication is future work. The 1.0 release still uses gRPC-based primary→replica NRT updates; the S3-based replacement is named as roadmap, not delivered.

Source¶

companies/yelp
systems/nrtsearch — the named system
systems/lucene — underlying search library (Lucene 10)
systems/elasticsearch — the predecessor Yelp is migrating off
systems/hnsw — HNSW graph-ANN, now available via Lucene 10
systems/aws-s3 — committed-data + state source of truth
systems/aws-ebs — the retired network-attached block substrate
concepts/incremental-backup-on-commit
concepts/immutable-segment-file
concepts/ephemeral-local-disk-vs-ebs
concepts/immutable-index-state
concepts/scalar-quantization
concepts/hnsw-index
concepts/scatter-gather-query
concepts/simd-vectorization
patterns/incremental-s3-backup-of-immutable-files
patterns/parallel-s3-download-for-bootstrap
patterns/hot-reload-over-restart-replicas
patterns/decoupled-state-commit-from-data-commit
patterns/coordinator-fronted-sharded-search