Yelp¶
Yelp Engineering blog (engineeringblog.yelp.com) is a Tier-3 source on the sysdesign-wiki. Yelp operates the local-business-discovery platform (reviews, ratings, search, photos) for US / Canada / parts of Europe; the platform combines a curated business-graph (millions of SKUs in the catalog sense), user-generated reviews, and a search stack that routes the raw query through a query-understanding layer before retrieval and ranking.
Per AGENTS.md Tier-3 guidance, Yelp posts are ingested selectively — only where they explicitly cover distributed- systems internals, scaling trade-offs, infrastructure architecture, production incidents, storage/networking/streaming design, or — as with the 2025-02-04 LLM post — a concrete production serving-infrastructure architecture built around LLMs (distinct from pure ML research).
Wiki anchor¶
Six on-scope Yelp ingests establish the wiki's Yelp coverage across five distinct stack altitudes (seventh axis added with the 2026-04-07 Cassandra 4.x upgrade ingest — datastore platform / database upgrade; eighth axis added with the 2025-05-08 Nrtsearch 1.0.0 ingest — search-engine / Lucene- platform / storage-tiering):
- 2025-02-04 — search query understanding with LLMs (LLM / serving-infra axis). Yelp's first-party disclosure of the production serving architecture for three query- understanding tasks (segmentation, spell correction, review- highlight phrase expansion). Canonicalises Yelp's reusable three-phase productionisation playbook (Formulation → Proof of Concept → Scaling Up) and a three-tier serving cascade (pre-computed head cache → offline fine-tuned GPT-4o-mini for 95%+ of traffic → BERT/T5 realtime tail). As a serving architecture, it is the earliest wiki canonical instance of head- cache-plus-tail applied to LLM-driven search query understanding — pre-dating Instacart's 2025-11-13 Intent Engine canonicalisation by nine months.
- 2025-02-19 — Revenue Automation Series: Building
Revenue Data Pipeline (financial-systems / data-platform
axis). Yelp's second Revenue Automation Series post (after
the 2024-12 billing-system modernisation) on how Yelp built
a batch Revenue Data Pipeline feeding a third-party
Revenue Recognition SaaS ("REVREC service") to close
the books ~50% faster. Documents the methodology stack
(Glossary Dictionary → Data Gap Analysis → system-design
evaluation across four architectures), architectural
selection (Data Lake + Spark ETL wins), and concrete Spark
implementation (internal
spark-etlpackage with feature- DAG abstraction, YAML topology-inferred DAG, checkpoint-to- scratch debugging, PySpark UDFs for complex business logic). Three other architectures (MySQL+Python batch, Warehouse+dbt, Event Streams) are explicitly rejected with load-bearing reasons. - 2025-04-15 — Journey to Zero Trust Access (corporate- security / networking axis). Yelp's Corporate Systems + Client Platform Engineering teams retire Ivanti Pulse Secure as the employee VPN in favour of Netbird, an open-source WireGuard-based ZTA platform. Five named selection pillars: Okta/OIDC, simple UI, open source, throughput/latency, fault tolerance. Load-bearing architectural disclosures: WireGuard mesh topology with router peers provides <2s transparent failover; OIDC+device-posture access gate replaces SAML-via-Pulse flow; open source provides both response agency and realised upstream contribution ("multiple changes ... pushed upstream to Netbird's main branch from Yelpers"). Canonical instance of concepts/vpn-to-zta-migration as a motion rather than a flip — Netbird coexists with Yelp's MTLS-based Edge Gateway, with VPN utilisation "reducing ... to more granular use cases in the future". Opens Yelp's corporate-security-and-networking axis on the wiki, distinct from the 2025-02-04 search-ML axis and the 2025-02-19 financial-systems axis.
- 2025-05-27 — Revenue Automation Series: Testing an Integration with Third-Party System (financial-systems / integration-testing deepening). Third Revenue Automation Series post — documents how Yelp verified the pipeline built in 2025-02-19 rather than building anything new. Six-step strategy: (1) a parallel Staging Pipeline consuming production data but publishing to Glue catalog tables on S3 queryable immediately via Redshift Spectrum — bypassing the ~10-hour Redshift Connector latency that makes same-day verification through the production path infeasible; (2) manual test-data backport from production edge cases to dev fixtures; (3) dual-cadence integrity checks (99.99% contract- invoice match threshold for the monthly against billing- system truth; daily lightweight SQL against staging for fast iteration); (4) Schema Validation Batch polling the REVREC mapping API before each upload to guard against partner-side schema drift; (5) SFTP standardised over REST after testing both (reliability + file-size ceiling: 500k-700k records/file SFTP vs 50k/file REST → 4-5 files/day vs 15 files/day); (6) documented escalation for third-party SFTP server / upload-job / storage failures. Deepens the 2025-02-19 axis rather than opening a new one.
- 2026-05-21 — How Partition Access Visualizations
Reduced our Data Lake S3 Cost by 33% (storage / data-
engineering axis deepening; sixth Yelp axis canonicalised
on the wiki as data-platform / storage-cost-engineering).
First-party disclosure of Yelp's
partition-
access-visualisation tooling, built on top of the
2025-09-26 SAL pipeline as
a downstream consumer. The visualisation primitive: plot
partition values (e.g.
dt=yyyy-mm-dd) on the y-axis against access-event time on the x-axis, coloured by IAM role — three signature shapes emerge unaided: diagonal y=x (daily batch consumer), vertical line (backfill scanning many partitions at one moment), scatter (ad hoc inspection by internal engineering). Canonical instance of granular usage attribution as the gating observability primitive for data-platform efficiency wins ("granular usage attribution solves this problem by enabling clear insight into how data is consumed and unlocks opportunities for significant cost efficiencies") and of patterns/access-pattern-visualization-for-data-stewardship as the visualisation-as-tooling pattern that replaces stakeholder conversations / stale documentation with an objective continuously-updated artifact. The substrate is a single SQL aggregation over compacted SAL keyed by(table, partition, iam_role, event_time)and filtered tooperation = 'REST.GET.OBJECT'— the load-bearing four- tuple that drives all downstream use cases. Headline outcome: 33 % S3 cost reduction on Yelp's petabyte- scale analytics data lake, decomposed qualitatively across (a) IT-by-default storage-class adoption (objects not accessed 30 days → 40 % off; 90 days → 81 % off — "approaches the cost of S3 Glacier!") and (b) the new named Default Access Retention primitive — a middle-ground retention shape between deletion and cold-tier-by-default in which data beyond an explicit access window remains in S3 on Intelligent Tiering but is gated behind a restrictive bucket IAM policy that requires a Terraform-PR + cost-acknowledgement (S3 Inventory cost dashboard estimating the projected access cost across current storage classes) workflow with tiered approval levels (patterns/iam-policy-gated-cold-tier-access). Two named benefits of DAR: (i) accidental queries do not reset IT's tiering clock ("storage cost is guaranteed to decrease after the initial 30 day period of Intelligent Tiering"); (ii) data consumers acknowledge cost before cold-tier scans (load-bearing PB-scale Archive-Instant-Access cost-bomb disclosure: "For our largest tables, full table scans could add significant S3 costs by accessing PBs of data from cheap Intelligent Tiers like Archive Instant Access. This is not obvious to users who are writing SQL to inspect data!"). Two new comparison-axis concepts: concepts/cold-storage-minimum-duration-tax (the structural failure mode of cold-tier classes — "impose minimum storage durations and retrieval fees that can negate savings if you access data more than you expected to" — generalises beyond S3 Glacier to Azure Archive / GCS Archive); and bet- asymmetry between IT (cost moves with actual access in either direction without penalty) and Glacier (savings require correctly forecasting access; mis-forecasting eats the savings and charges retrieval fees on top). Third pillar: usage-driven Apache Iceberg migration prioritisation — the same observability substrate that drives storage-class decisions also ranks the migration backlog by active-table value ("focus our migration efforts on active tables and partitions that would add the most customer value... provide Apache Iceberg's read performance benefits to the most valuable use cases first"). Reframes the 2025-09-26 SAL-axis as a multi-purpose observability substrate amortising across at least five named use cases: (1) permission-debugging, (2) cost attribution, (3) access-based retention, (4) partition-access visualisation, (5) Iceberg migration prioritisation. Yelp's named contributors: Rishi Madan (development); Infrastructure Security team (Vincent Thibault, Quentin Long, Nurdan Almazbekov) for enabling SAL across Yelp's AWS infrastructure. Future-work flag: "investing in other areas of our data infrastructure to further enhance lineage and granular usage attribution" — pairs lineage and granular usage attribution as complementary investments. Sixth Yelp axis — distinct from the 2025-02-04 LLM / search-serving-infra axis, the 2025-02-19 / 2025-05-27 financial-systems axis, the 2025-04-15 corporate-security axis, the 2025-07-08 SDUI / client-platform-framework axis, and the 2025-09-26 SAL pipeline axis (deepens that axis with a downstream- consumer disclosure rather than opening anew). - 2025-09-26 — S3 server access logs at scale (storage /
data-engineering axis; opens a fifth Yelp axis on the
wiki). First-party retrospective on operationalising
S3 Server Access Logs
(SAL) at fleet scale ("TiBs of S3 server access logs per
day"). Canonicalises the
Yelp S3 SAL pipeline: daily Tron batch
that converts raw-text SAL objects to
Parquet via
Athena INSERTs
(patterns/raw-to-columnar-log-compaction) achieving
85 % storage + 99.99 % object-count reduction; weekly
access-based table joining
S3 Inventory with a week of SAL
for prefix-granularity
retention; tag-based lifecycle expiration via
S3 Batch Operations
("the only scalable way to delete per object" —
patterns/object-tagging-for-lifecycle-expiration).
Load-bearing architectural disclosures:
Glue partition projection
with
enumover managed partitions (patterns/projection-partitioning-over-managed-partitions); idempotent Athena INSERT via self-LEFT-JOIN onrequestidfor shared- resource retry-safety; SAL's best-effort delivery measured at < 0.001 % > 2-day late. Three parsing hazards canonicalised: user-controlled log fields (unescaped quotes / SQLi / shellshock inrequest_uri/referrer/user_agent) with Yelp's optional non-capturing tail regex fix; URL-encoding idiosyncrasy on S3 keys (most ops double-encode;BATCH.DELETE.OBJECT/S3.EXPIRE.OBJECTsingle-encode). Preferred over AWS's CloudTrail Data Events on cost: "$1 per million data events — that could be orders of magnitude higher!" First Yelp storage-axis ingest; opens axis distinct from the 2025-02-04 LLM / search-serving-infra axis, the 2025-02-19 / 2025-05-27 financial-systems axis, and the 2025-04-15 corporate-security axis. - 2025-07-08 — Exploring CHAOS: Building a Backend for
Server-Driven UI (SDUI / client-platform-framework axis;
opens a fourth Yelp axis on the wiki). First-party
unpacking of the CHAOS backend — the
server-driven-UI framework that authors per-request view
configurations (views + layouts + components + actions)
that Yelp's iOS, Android, and web clients render.
Canonical instance of concepts/server-driven-ui. Three
architectural layers disclosed: (1) GraphQL surface —
a Yelp-internal Apollo
Federation subgraph implemented in Python via
Strawberry, fronting
multiple team-owned REST backends all conforming to a
common CHAOS REST API. Canonical instance of
patterns/federated-graphql-subgraph-per-domain. (2)
Build pipeline —
ChaosConfigBuilder→ViewBuilder→LayoutBuilder→FeatureProvidercomposition with a six-stage feature- provider lifecycle (registers→is_qualified_to_load→load_data→resolve→is_qualified_to_present→result_presenter) executed as a two-loop parallel async build (loop 1 fans out upstream calls, loop 2 awaits + composes — max-latency not sum-latency). (3) Advanced primitives — preloaded view flows (subsequent_views()+ thechaos.open-subsequent-view.v1action) for predictable sequential navigation without extra round-trips (customer- support FAQ menu example) and view placeholders (ViewPlaceholderV1) for lazy-loaded nested views served by different CHAOS backends (Reminders embedded in Yelp for Business home screen). Three load-bearing correctness mechanisms canonicalised: (a) Register-based client capability matching — first-matchCondition(platform=[...], library=[required components and actions])decides whether a feature is included for this client or dropped, the mechanism that keeps old app versions rendering while new components ship; (b) JSON-string parameters for schema stability — element content carried as opaque JSON inside a stable GraphQL schema so new elements / versions ship without schema churn, with backend Python dataclasses type-checking the payload; (c) error isolation per feature wrapper — an@error_decoratorwraps every FeatureProvider; exceptions drop the feature (not the view), unless markedIS_ESSENTIAL_PROVIDER = True; telemetry logs feature name + owner + exception + request context for threshold-based alerting. Post flags that "the latest CHAOS backend framework introduces the next generation of builders using Python asyncio" — the two-loop iteration is a transitional design. No operational numbers (latency, RPS, cache hit rates); walkthrough not retrospective. Follow-up at 2026-04-22 (see below) unpacks Konbini + Cookbook — the cross- platform codegen layer underneath CHAOS that makes the component vocabulary consistent across four platforms. - 2026-04-22 — How Yelp Keeps Server-Driven UI Consistent
Across Four Platforms (SDUI axis deepening; does not
open a new axis — extends the 2025-07-08 SDUI axis with
the codegen / design-system-integration sub-layer). First-
party unpacking of Konbini —
Yelp's auto-generated library family that bridges
CHAOS to
Cookbook (Yelp's cross-platform
design system). From one JSON interface definition per
Cookbook component, a Jenkins pipeline
generates four platform-specific libraries (Kotlin /
Swift / Python / TypeScript) that stay in sync because
they're derived from the same source. Canonical instance
of patterns/single-json-spec-to-multi-platform-codegen.
Also the wiki's canonical disclosure of two backward-
compat mechanisms working at finer granularity than
Register-based feature gating: (a)
client spec version —
each client bundles a spec file enumerating the component
versions it supports (e.g.
android@23.0:{"cookbook.Button": "1.0"}); every request sends the spec context; backend picks compatible component versions per request. Canonical instance of patterns/spec-version-negotiation-for-backward-compat. (b) Component-version migrate functions — on every breaking change (e.g. Buttontext: String → FormattedTextbumping0.8 → 1.0), Konbini auto-generates a stubmigrate(V1) -> V0method that developers must implement to backport new instances to the previous major. Generated default throws; opting out of backward-compat is explicit. Canonical instance of patterns/migrate-function-for-component-downgrade. Also canonicalises Yelp's{name, raw_value}design-token wire format — tokens curated in a separate designer-owned repo, published as JSON, serialised on the wire carrying both semantic name and platform-resolvable raw value. No operational numbers (component count, build time, fleet spec-version distribution). This ingest positions Yelp as the wiki's canonical disclosure of a design-system-aware SDUI framework — most other on-wiki SDUI descriptions (Airbnb Ghost Platform, Lyft SDUI — both off-wiki) don't document the upstream design-system coupling. - 2026-02-02 — Back-Testing Engine for Ad Budget Allocation
(ad-tech / experimentation axis; opens a sixth Yelp axis
on the wiki). First-party disclosure of the
Back-Testing Engine —
an eight-component hybrid back-testing + simulation system
that evaluates proposed changes to the
Ad Budget Allocation
algorithms against historical campaign data before
committing to A/B tests. Canonical instance of
concepts/filter-before-ab-test (back-testing filters the
candidate space; A/B validates the survivors) and
concepts/hybrid-backtesting-with-ml-counterfactual
(historical inputs + alternative code path + ML-predicted
counterfactual outcomes). Five load-bearing architectural
moves canonicalised: (1) Production code as Git Submodules
(patterns/production-code-as-submodule-for-simulation) —
Budgeting and Billing repos included as submodules pointing
at branches under test; "blurs the line between prototyping
and production"; (2) CatBoost regressors as
counterfactual-outcome predictor (systems/catboost,
concepts/counterfactual-outcome-prediction) — same models
for all candidates so cross-candidate deltas are attributable
to the algorithm, not predictor noise; (3) Poisson sampling
over expected values (concepts/poisson-sampling-for-integer-outcomes)
— converts the regressor's smooth averages into realistic
integer counts; (4) Scikit-Optimize Bayesian search over
a YAML-declared
search space — 25
max_evalsbudget, learns from prior candidates; grid + listed search also supported but "not really an optimizer, just a wrapper that yields the next candidate"; (5) MLflow as experiment store + visualization substrate — first non-LLM MLflow Seen-in on the wiki. Overfitting-to-historical-data named explicitly as a limitation (concepts/overfitting-to-historical-data), mitigated organisationally by keeping A/B tests in the loop. Opens ad-tech / experimentation axis distinct from the five prior axes. - 2026-04-07 — Zero downtime Upgrade: Yelp's Cassandra 4.x Upgrade Story (datastore-platform / database-upgrade axis; opens a seventh Yelp axis on the wiki). Yelp Database Reliability Engineering upgraded > 1,000 Cassandra nodes from 3.11 to 4.1 on Kubernetes with zero downtime
- zero incidents + zero client-code changes. Canonical
wiki instance of
in-place over new-DC at fleet scale (rejected on time +
cost +
EACH_QUORUM-preservation grounds). Five upgrade patterns canonicalised from this post: (1) version-specific images per Git branch (3.11 and 4.1 images published from dedicated branches, env-var selected at bootstrap); (2) pre-flight / flight / post-flight three-stage upgrade (schema-agreement gate + anti-entropy-repair pause → one-node-at-a-time rolling with dual-Stargate → repairs - schema-changes re-enabled); (3)
dual-run
version-specific proxies for Stargate around Cassandra
4.1's
MigrationCoordinatorschema-fetch change, with service-mesh alias so clients see one endpoint; (4) benchmark in your own environment (own-env measurement of 4% p99 + 11% mean + 11% throughput aligned with the DataStax whitepaper direction of travel, built confidence that later unlocked the 58% p99 reduction observed in production); (5) production qualification criteria upfront (six-criterion list — performance / functional / security / rollback / observability / component-health — evaluated per cluster). Three Cassandra-specific concepts canonicalised: init-container IP-gossip pre-migration sequencing IP + version changes into two distinct gossip events (CASSANDRA-19244); CDC commit-log write- point change (flush → mutation,CASSANDRA-12148) that breaks CDC consumers at the major-version boundary; schema disagreement on CDC-enabled clusters remediated by dummy multi-node schema changes. Two general-purpose concepts canonicalised: mixed-version cluster as a named operational state; transient mid-upgrade regression vs genuine regression distinction. First wiki first-party operator retrospective on a Cassandra major upgrade — every prior Cassandra Seen-in was third-party explainer or downstream user. Also surfaces Yelp's Cassandra Source Connector (CDC → Kafka) and the Stargate proxy as distinct wiki systems. - 2025-05-08 — Nrtsearch 1.0.0: Incremental Backups, Lucene 10, and More (search-engine / Lucene-platform / storage-tiering axis; opens an eighth Yelp axis on the wiki). Four years after adopting the in-house Lucene-based search engine Nrtsearch in production (>90% of Elasticsearch traffic migrated), Yelp cuts the 1.0 release. Single load-bearing architectural change: an incremental-backup- on-commit architecture that backs up every Lucene commit to S3 as individual segment files (patterns/incremental-s3-backup-of-immutable-files) — enabled by Lucene segment immutability — makes S3 the source of truth for committed data. Downstream consequences that this unlocks: (a) primary moves from EBS to ephemeral local SSD with three pre-1.0 EBS drawbacks resolved (source-of-truth fragility, mount-transition issues, ingestion-heavy backup- frequency pressure); (b) replicas bootstrap via parallel multi-file S3 download + local SSD for a 5× download- speed improvement over the previous serialised path; (c) full consistent snapshots reduce to direct S3→S3 copies, less frequent, DR-only. Upgrade to Lucene 10.1.0 + Java 21 adds HNSW vector search (4096-element max, multiple similarity types, concepts/scalar-quantization for memory/recall tradeoff, nested-document support, intra-merge parallelism), SIMD acceleration via the Java 21 Vector API + foreign-memory API, and the roadmap target of intra-single-segment parallel search (replacing concepts/virtual-sharding). State-management overhaul: immutable index state + decoupled state commit + hot reload on replicas, collapsing the prior four-step change-propagation pipeline (primary modify → commit → backup → restart replicas) to two steps (primary modify → replicas hot-reload). Aggregations now parallel via slice- level-state plus reduce-time recursive merge (fixing Lucene facets' single-threaded tail-latency issue). Load-bearing operational frames documented: commit added latency "a few ms to 20 seconds depending on the size of the data"; coordinator fan-out (patterns/coordinator-fronted-sharded-search) over virtually-sharded clusters for large indices; gRPC-based NRT pull primary→replica still in use but flagged as a roadmap replacement with S3-based NRT replication. No QPS / latency numbers — release-note shape, not retrospective. First-party wiki instance of S3-as-source-of-truth for search-index data, distinct from the S3-as-CDC-log-store and S3-as-config-fanout-bus axes elsewhere on the wiki's aws-s3 page. Pairs with the 2025-04-07 Cassandra ingest as Yelp's two datastore-platform operator retrospectives on the wiki, and with the 2025-09-26 SAL ingest as Yelp's two S3-as-primary-store disclosures.
Key systems¶
Customer Success / chatbot / RAG axis (2026-05-27)¶
- systems/yelp-cs-chatbot — Yelp's LLM-Assisted Customer
Success Chatbot, replacing the legacy two-step menu-tree +
fixed-phrase chatbot. Two-half architecture:
(1) LLM workflow router
classifies inbound queries into 5 specialised workflows
(QA / Billing / Refund / Cancel / Review) bucketed by
frequency × risk; only the QA workflow does free-form LLM
generation, Cancel + Review return templated responses "due
to high financial/legal risk", Billing returns deterministic
UI, Refund guides through a form. (2) RAG
pipeline for the QA workflow over ~370 Support Center
articles with a
metadata-only embedding strategy —
(title, summary, each top header, historical-intent strings)independently embedded by text-embedding- ada-002 (1,536 dim) into a 8 MB FAISS vectorstore loaded in-memory at container start during health check (patterns/in-memory-vectorstore-loaded-at-container-start). Updates daily via AWS S3 batch CSV pipeline; index rebuilt + embeddings recomputed every container start. Retrieval: top-5 unique articles via over-fetch (k = max_items × 5) + similarity threshold + dedupe-by-article-id ( whole-article retrieval via metadata segments; patterns/whole-article-retrieval-via-metadata-segments). ~94% recall@5 on Yelp's evaluation dataset. Three-axis output gate: trust & safety / valid URL / character limit; the URL check addresses LLM hyperlink hallucination via per-response allowlist validation. A/B-test outcome: doubled resolution rate vs the legacy chatbot.
LLM / search-serving-infra axis (2025-02-04)¶
- systems/yelp-query-understanding — the named LLM-powered query-understanding pipeline. Multiple tasks (segmentation + spell correction, review-highlight phrase expansion) with RAG side-inputs (business names viewed for query; top business categories from in-house predictive model) feeding into a cascaded serve path.
- systems/yelp-search — the parent production context;
consumer of the query-understanding outputs (location rewrite
into geobox; token-probability of
nametag into ranking).
Financial-systems / data-platform axis (2025-02-19)¶
- systems/yelp-revenue-data-pipeline — the named batch
data pipeline that feeds a third-party Revenue Recognition
SaaS. Data Lake + Spark ETL architecture; daily MySQL
snapshots → S3 →
spark-etl-orchestrated feature DAG → REVREC-template output to S3 → REVREC service → 50% faster book-close. - systems/yelp-spark-etl — Yelp's internal PySpark
orchestration package. Feature-based DAG abstraction
(web-API-shaped source + transformation sub-types), YAML
topology-inferred DAG declaration, checkpoint-to-scratch
debugging via
--checkpointflag + systems/jupyterhub. - systems/yelp-billing-system — Yelp's custom order-to- cash system; stub page pending ingest of the 2024-12 billing-system modernisation post. Upstream source-of-truth for revenue contracts / invoices / fulfillment events.
- systems/yelp-staging-pipeline — the parallel verification pipeline (2025-05-27). Runs the same code path on production data, publishes to AWS Glue tables on S3, queryable immediately via Redshift Spectrum. Bypasses the ~10-hour Redshift Connector latency for same-day verification loops.
- systems/yelp-schema-validation-batch — the pre-upload guard (2025-05-27). Polls the third-party REVREC mapping API (REST) before each upload and aborts on schema mismatch on any of three axes (date format, column name, column data type).
- systems/yelp-redshift-connector — Yelp's named Data Connector that publishes from Data Pipeline streams to Redshift. The ~10-hour latency of this connector is the specific constraint that motivates the staging-pipeline + Glue + Spectrum bypass.
Corporate-security / networking axis (2025-04-15)¶
- systems/netbird — the open-source WireGuard-based Zero Trust Access platform Yelp chose; five named selection pillars (Okta/OIDC, simple UI, open source, high throughput, fault tolerance). Canonical instance of mesh topology with router peers for transparent <2s failover. Yelp has contributed upstream fixes.
- systems/wireguard — the data-plane protocol under Netbird; Yelp's deployment contributed a new corporate-ZTA altitude Seen-in on the wiki's WireGuard page (distinct from Fly.io's gateway-mesh altitude).
- systems/okta — Yelp's OIDC identity provider for the ZTA substrate; enforces device-posture policies.
- systems/pulse-secure — the retired predecessor VPN ("a more reliable solution" was needed). Only-instance wiki page.
Storage / data-engineering axis (2025-09-26, deepened 2026-05-21)¶
- systems/yelp-partition-access-visualization — Yelp's
partition-access-visualisation tooling (canonicalised
2026-05-21). Sits on top of the SAL pipeline as a downstream
consumer; aggregates
(table, partition, iam_role, event_time)fromREST.GET.OBJECTslice; produces two named charts (Partitions Accessed Vs Time and Partitions Accessed). Substrate for stakeholder discovery, storage-class routing, Default Access Retention decisioning, and Apache Iceberg migration prioritisation. Drove 33 % S3 cost reduction on Yelp's petabyte-scale data lake. - systems/aws-s3-intelligent-tiering — Yelp's default storage class for analytics datasets with unpredictable access patterns (canonicalised on the wiki via the 2026-05-21 post). Auto-tiers between Frequent / Infrequent / Archive Instant Access based on observed access age; 30 days no-access → 40 % cost reduction; 90 days → 81 % ("approaches the cost of S3 Glacier!"). Bet-symmetric — re-promotes on access without penalty.
- systems/aws-s3-glacier — comparison point for IT (canonicalised on the wiki via the 2026-05-21 post). Yelp argues against Glacier-by-default for unpredictable-access data because of the minimum-duration tax — minimum storage durations + per- retrieval fees can negate savings if access turns out to be more frequent than expected.
- systems/yelp-s3-sal-pipeline — the named Yelp system for operationalising S3 Server Access Logs (SAL) at fleet scale. Daily Tron batch compacting TiBs/day of raw-text SAL into Parquet via Athena; weekly access-based retention table; tag-based lifecycle expiration via S3 Batch Operations or direct per-object tagging. 85 % storage + 99.99 % object-count reduction from compaction. Multi- purpose observability substrate — 2026-05-21 confirms five named use cases: permission-debugging, cost attribution, access-based retention, partition-access visualisation, Iceberg migration prioritisation.
- systems/tron — Yelp's in-house batch processing system (open-source: github.com/Yelp/Tron); orchestrates the daily SAL compaction and the weekly access-based-retention build. First wiki page.
- systems/s3-batch-operations — AWS's per-bucket batch-
job primitive for
PutObjectTaggingfanout. Flat $0.25 per bucket per job — the biggest cost contributor for low-volume buckets, driving Yelp's two-scale dispatch rule (direct tag for low-volume, Batch Ops for high-volume). - systems/s3-inventory — the daily object-listing used as one side of the access-based retention join with a week of SAL.
SDUI / client-platform-framework axis (2025-07-08, extended 2026-04-22)¶
- systems/yelp-chaos — the named server-driven-UI framework. Per-request composition of views + layouts + components + actions, delivered as a JSON-stable GraphQL configuration. Six-stage FeatureProvider lifecycle runs features in a two-loop parallel async build; per-feature error isolation keeps bad features from taking down the whole view. Advanced primitives: preloaded view flows for predictable navigation, view placeholders for lazy nested views served by different backends.
- systems/apollo-federation — the GraphQL substrate. Yelp's Supergraph router composes the CHAOS Subgraph with every other per-domain subgraph. Canonical wiki instance of patterns/federated-graphql-subgraph-per-domain; CHAOS adds the twist that the subgraph itself fronts multiple team-owned REST backends, not a single data store.
- systems/strawberry-graphql — Python GraphQL library Yelp picked for the CHAOS Subgraph to "leverage type-safe schema definitions and Python's type hints." First wiki instance.
- systems/yelp-konbini — the auto-generated library
family (Kotlin / Swift / Python / TypeScript) that bridges
CHAOS to Cookbook. Canonical
wiki instance of
patterns/single-json-spec-to-multi-platform-codegen —
one JSON interface definition per component, four
published libraries per Jenkins pipeline run, guaranteed
parameter-name parity across backend and all client
platforms. Also canonicalises
client spec files
(
android@23.0bundled into each client library) andmigrate()methods for major-version component downgrades. - systems/yelp-cookbook — the underlying cross-platform design system (buttons, forms, cards, etc.) implemented natively on iOS / Android / web. Konbini is its fourth platform — enabling backend services to instantiate Cookbook components server-side.
- systems/jenkins — triggers Konbini's codegen on every
push to the
component_interfacesrepo.
Ad-tech / experimentation axis (2026-02-02)¶
- systems/yelp-back-testing-engine — the named eight-component system that simulates proposed ad-budget- allocation algorithm changes against historical campaign data. Canonicalises the hybrid back-testing + simulation methodology (historical inputs + alternative code path + ML-predicted counterfactual outcomes) and positions back-testing as the discovery phase upstream of A/B validation.
- systems/yelp-ad-budget-allocation — the parent system the Back-Testing Engine simulates; splits each campaign's monthly budget between on-platform Yelp inventory and the off-platform Yelp Ad Network, with day-by-day budget decisions that depend on previous days' outcomes (the feedback loop that makes naive aggregate-math simulation wrong).
- systems/scikit-optimize — the Bayesian-optimization library Yelp uses as the Engine's default optimizer; the other search types (grid, listed) are "just wrappers".
- systems/catboost — the gradient-boosted regressors predicting impressions / clicks / leads from budget + campaign features. Non-parametric by design to capture diminishing returns; shared across candidates for fair comparison.
- systems/mlflow — the experiment store + visualization substrate; first non-LLM-evaluation MLflow Seen-in on the wiki, reinforcing MLflow as domain-general experiment database rather than LLM-specific.
Datastore-platform / database-upgrade axis (2026-04-07)¶
- systems/apache-cassandra — the target datastore. Yelp's
1,000-node Cassandra fleet runs on Kubernetes via a Cassandra operator; this ingest is the wiki's first first-party operator retrospective on a Cassandra major-version upgrade.
- systems/stargate-cassandra-proxy — the DataStax open-source Cassandra data-proxy; runs as two version- specific fleets in parallel during Yelp's upgrade window under a single service-mesh alias.
- systems/cassandra-source-connector — Yelp's in-house CDC → Kafka bridge; two components split on rollout (DataPipeline Materializer fleet-wide pre-upgrade; CDC Publisher in-lockstep per pod).
- systems/kubernetes-init-containers — the Kubernetes
primitive Yelp uses to sequence the simultaneous
new IP + new Cassandra version change into two
distinct gossip-observable events
(
CASSANDRA-19244). - systems/spark-cassandra-connector — listed as a component that had to be made 4.1-compatible; Yelp's use documented in an earlier 2024-09 post on direct Spark→Cassandra ingestion for ML pipelines.
- systems/yelp-pushplan-automation — Yelp's declarative Cassandra schema-change system; user-initiated schema changes were disabled for the duration of each cluster upgrade.
Search-engine / Lucene-platform axis (2025-05-08)¶
- systems/nrtsearch — Yelp's open-source Lucene-based search engine; 1.0.0 release docs the incremental-backup- on-commit architecture, ephemeral-local-SSD primary, Lucene 10 / Java 21 vector- search + SIMD enablement, state-management overhaul (immutable index state + hot reload), parallel aggregations, and Coordinator- fronted virtually-sharded deployment shape.
- systems/lucene — the underlying library; 1.0.0 upgrades to Lucene 10.1.0 from 8.4.0. Brings HNSW vector search, scalar quantization, SIMD via Java 21, and the roadmap target of intra-single-segment parallel search.
- systems/hnsw — graph-based ANN exposed via Lucene 10; Yelp is now a canonical HNSW production consumer via Nrtsearch.
- systems/elasticsearch — the predecessor Yelp is migrating off; >90% of traffic already on Nrtsearch as of 2025-05.
- systems/aws-s3 — new source of truth for committed segments and cluster/index state.
- systems/aws-ebs — retired as the primary's durable substrate. Pre-1.0 Nrtsearch's three EBS drawbacks (source-of-truth fragility, mount transitions, backup- frequency pressure) are the load-bearing motivation.
Key concepts and patterns¶
LLM / search-serving-infra axis (2025-02-04)¶
- concepts/query-frequency-power-law-caching — the infrastructure primitive that makes LLMs economically viable for query understanding; pre-compute expensive-LLM output for head queries above a frequency threshold.
- concepts/implicit-query-location-rewrite — the canonical downstream-consumer of segmentation output; rewrites the geobox sent to the search backend when location-intent confidence is high.
- concepts/review-highlight-phrase-expansion — the creative-generation task family (expand a query into semantically-adjacent phrases for review-snippet matching).
- concepts/token-probability-as-ranking-signal — the
trick of retaining LLM output-token probabilities past the
discrete decision as a continuous feature for downstream
ranking; Yelp's segmentation
name-tag probability feeds business-name matching. - concepts/llm-segmentation-over-ner — why LLMs supplant traditional Named-Entity-Recognition models for query segmentation: flexible class customisation + no internal-taxonomy leakage into the training problem.
- patterns/three-phase-llm-productionization — Yelp's Formulation → Proof of Concept → Scaling Up playbook; the reusable shape behind any LLM-at-production-scale rollout.
- patterns/rag-side-input-for-structured-extraction — the generalised pattern behind Yelp's two RAG instances (business names as segmentation grounding; business categories as review-highlight grounding).
Corporate-security / networking axis (2025-04-15)¶
- concepts/zero-trust-authorization — the strategic framing Yelp adopted explicitly: ZTA as direction of travel, "reducing VPN utilization and creating more fine grained access control structures in the future, as opposed to broad, binary policies on huge subnets and network segments."
- concepts/vpn-to-zta-migration — the motion; Netbird as intermediate state, MTLS Edge Gateway as end state; new concept canonicalised on Yelp's ingest.
- concepts/wireguard-mesh-topology — the HA primitive (clients have 1-to-many binding to router peers).
- concepts/router-peer — Netbird's named egress-peer role; new concept canonicalised on Yelp's ingest.
- concepts/sso-authentication — Yelp's LDAP → SAML → OIDC auth ladder; OIDC chosen because it supports device- posture policies.
- patterns/oidc-plus-device-posture-access-gate — Yelp's Okta+Netbird integration shape; new pattern canonicalised on Yelp's ingest.
- patterns/open-source-for-security-response-agency — the OSS lever Yelp cited first: "if critical security issues ever arose, we would not be beholden to the maintainers alone." New pattern canonicalised on Yelp's ingest.
- patterns/upstream-contribution-parallel-to-in-house-integration — Yelp's realised contribution loop to Netbird main.
Financial-systems / data-platform axis (2025-02-19)¶
- concepts/revenue-recognition-automation — the domain primitive; why the whole Revenue Data Pipeline exists. Governed by ASC 606 / IFRS 15; automated via SaaS vendors (Yelp names the integration abstractly as "REVREC service"); produces real-time revenue reports + 50% book-close speed-up.
- concepts/glossary-dictionary-requirement-translation — Yelp's three-step methodology for converting accountant- voice requirements to engineering-voice: Glossary Dictionary + Purpose + Example Calculation + Engineering Rewording.
- concepts/data-gap-analysis — the two-axis output format (immediate approximation / composite data vs long- term structural fix) for reconciling Yelp's custom data model against a third-party template.
- concepts/pyspark-udf-for-complex-business-logic — when to reach for UDFs over window-function-only implementations. Canonical worked example: multi-priority discount application with 5 business rules.
- concepts/spark-etl-feature-dag — the feature abstraction (web-API-shaped source + transformation features) that structures Yelp's Spark ETL pipelines.
- concepts/checkpoint-intermediate-dataframe-debugging — materialising intermediate DataFrames to a scratch S3 path as a substitute for breakpoint debugging on distributed + lazy Spark.
- concepts/mysql-snapshot-to-s3-data-lake — the reproducibility primitive. Frozen daily snapshot → same input → same output regardless of rerun timing.
- patterns/daily-mysql-snapshot-plus-spark-etl — Yelp's chosen Revenue Data Pipeline architecture after explicitly rejecting MySQL+Python batch, Warehouse+dbt, and Event Streams.
- patterns/source-plus-transformation-feature-decomposition — the two-sub-type feature split (thin source-snapshot features + rich transformation features) that makes the Spark DAG maintainable.
- patterns/business-to-engineering-requirement-translation — the portable methodology lifted from revenue-recognition to any cross-functional requirement handoff.
- patterns/yaml-declared-feature-dag-topology-inferred
— list features as YAML nodes; let the runtime
topologically sort from each feature's declared
sourcesfield. DRY config; one-line feature addition.
Financial-systems / integration-testing axis (2025-05-27)¶
- concepts/staging-pipeline — parallel pipeline configuration running the code-under-test on production data, publishing to a verification-friendly substrate. Canonicalised on this ingest.
- concepts/data-integrity-checker — the cadence-aware monitor that compares pipeline output against an independent truth; Yelp's four named metrics (match rate, mismatch, left orphans, duplicates) become the canonical set.
- concepts/redshift-connector-latency — the ~10-hour publication delay from Yelp's Data Pipeline streams to Redshift that motivates the Glue+Spectrum bypass; new concept canonicalised on this ingest.
- concepts/test-data-generation-for-edge-cases — the discipline of backporting production edge cases to dev fixtures; new concept canonicalised on this ingest.
- concepts/data-upload-format-validation — the runtime pre-upload schema check against a third-party mapping API; new concept canonicalised on this ingest.
- patterns/parallel-staging-pipeline-for-prod-verification — the repeatable pattern; new pattern canonicalised on this ingest.
- patterns/monthly-plus-daily-dual-cadence-integrity-check — two cadences, different substrates, different metric sets; new pattern canonicalised on this ingest.
- patterns/schema-validation-pre-upload-via-mapping-api — runtime-per-upload schema check distinct from deploy- time validation (Datadog's pattern); new pattern canonicalised on this ingest.
- patterns/sftp-for-bulk-daily-upload — SFTP over REST for bulk daily upload to a third-party, against the modern default; three named axes (reliability, file-size, setup); new pattern canonicalised on this ingest.
Storage / data-engineering axis (2025-09-26)¶
- concepts/s3-server-access-logs — the AWS primitive
Yelp operationalises at fleet scale. Best-effort delivery,
25+-field line format,
SimplePrefixvsPartitionedPrefix+EventTimedelivery options. New concept canonicalised on this ingest. - concepts/partition-projection — Glue/Athena
partitioning primitive that avoids
MSCK REPAIR/ metastore-lookup overhead;enumvsinjectedtypes; 1M partition cap onenumif unconstrained. New concept canonicalised on this ingest. - concepts/best-effort-log-delivery — the delivery- semantics tier Yelp accepts on SAL; measured < 0.001 % > 2-day late. Load-bearing for why prefix-granularity retention is safe. New concept canonicalised on this ingest.
- concepts/athena-shared-resource-contention — Athena's
shared-cluster model with
TooManyRequestsException+ per- account / per-region DML concurrency quotas; retry-safe job design is mandatory. New concept canonicalised on this ingest. - concepts/user-controlled-log-fields — the general
hazard Yelp documents on SAL (
request_uri,referrer,user_agentcarry unescaped arbitrary bytes). New concept canonicalised on this ingest. - concepts/url-encoding-idiosyncrasy-s3-keys — most SAL
operations double-encode
key;BATCH.DELETE.OBJECT/S3.EXPIRE.OBJECTsingle-encode. Naiveurl_decode(url_decode(key))unsafe. New concept canonicalised on this ingest. - patterns/raw-to-columnar-log-compaction — daily compact-to-Parquet pattern; Yelp's 85 % storage + 99.99 % object-count reduction is the canonical datapoint. New pattern canonicalised on this ingest.
- patterns/object-tagging-for-lifecycle-expiration — tag
each object + tag-based lifecycle policy; the only scalable
per-object deletion primitive at fleet scale; composes with
S3 Batch Operations
PutObjectTagging(Delete is not a supported action). New pattern canonicalised on this ingest. - patterns/idempotent-athena-insertion-via-left-join —
self-LEFT-JOIN on target's unique column to make
INSERT INTO ... SELECTretry-safe; partition filters duplicated inONandWHEREfor planner-pruning. New pattern canonicalised on this ingest. - patterns/projection-partitioning-over-managed-partitions
— choose partition projection when prefix template is
known; avoids
MSCK REPAIRchurn + metastore-lookup planning latency. New pattern canonicalised on this ingest. - patterns/s3-access-based-retention — inventory ⋈ SAL at prefix granularity; equality-join beats LIKE-join (70k rows: 5 min → 2 sec); prefix granularity is what makes best-effort SAL delivery acceptable as access signal. New pattern canonicalised on this ingest.
- patterns/optional-non-capturing-tail-regex — wrap
user-controlled tail fields of a log regex in
(?:<rest>)?so the non-user-controlled prefix always parses; empty parsed rows are the failure signal. New pattern canonicalised on this ingest.
Ad-tech / experimentation axis (2026-02-02)¶
- concepts/filter-before-ab-test — the experimentation- workflow position Yelp occupies with the Back-Testing Engine: cheap pre-filter (back-testing) before expensive validation (A/B testing); A/B is preserved for final validation rather than discovery. New concept canonicalised on this ingest.
- concepts/hybrid-backtesting-with-ml-counterfactual — the methodology Yelp named explicitly as "not a pure back-testing approach, but rather a hybrid that combines elements of both simulation and back-testing". Historical inputs + alternative code path + ML-predicted counterfactual outcomes. New concept canonicalised on this ingest.
- concepts/counterfactual-outcome-prediction — the sub-concept: CatBoost regressors predict outcomes that never actually happened under the alternative treatment; non-parametric so they capture diminishing returns on budget. New concept canonicalised on this ingest.
- concepts/poisson-sampling-for-integer-outcomes — the trick that converts the regressor's smooth averages into realistic integer counts, restoring live-system stochasticity to the simulation. New concept canonicalised on this ingest.
- concepts/bayesian-optimization-over-parameter-space — sequential model-based optimization; Yelp's default via Scikit-Optimize. Grid + listed are "just wrappers that yield the next candidate". New concept canonicalised on this ingest.
- concepts/overfitting-to-historical-data — the named risk. Yelp's mitigation is organisational (keep A/B tests in the loop), not technical. New concept canonicalised on this ingest.
- patterns/production-code-as-submodule-for-simulation — the fidelity primitive. Budgeting and Billing repos as Git Submodules pointing at branches under test; "blurs the line between prototyping and production". New pattern canonicalised on this ingest.
- patterns/historical-replay-with-ml-outcome-predictor — the full simulation-loop shape (historical inputs + alternative code via submodule + ML outcome predictor + Poisson sampling); generalises to dynamic pricing, recommendation-ranking, bandit-policy domains. New pattern canonicalised on this ingest.
- patterns/yaml-declared-experiment-config — the configuration surface (date range, search space, search strategy, metric, max_evals) Yelp's Back-Testing Engine consumes. New pattern canonicalised on this ingest; sibling of the 2025-02-19 yaml-declared-feature-dag (same Yelp YAML-config discipline applied to different problem).
Datastore-platform / database-upgrade axis (2026-04-07)¶
- concepts/rolling-upgrade — the upgrade idiom; Yelp's Cassandra 4.x ingest extends the PlanetScale database-tier framing to a gossip-based NoSQL fleet at > 1,000 nodes.
- concepts/in-place-vs-new-dc-upgrade — the architectural
choice. New concept canonicalised on this ingest with
Yelp's explicit reasoning for rejecting new-DC at fleet
scale (time, cost,
EACH_QUORUMpreservation). - concepts/mixed-version-cluster — the upgrade state as a named operational concept. New concept canonicalised on this ingest.
- concepts/performance-regression-from-mid-upgrade-state — transient vs genuine regression distinction that Yelp's pre-migration observability dashboards made diagnosable. New concept canonicalised on this ingest.
- concepts/init-container-ip-gossip-pre-migration — the
Kubernetes sequencing trick for pods-get-new-IPs deployments
(
CASSANDRA-19244). New concept canonicalised on this ingest. - concepts/cassandra-cdc-commit-log — the 3.x → 4.x
write-point semantics change
(
CASSANDRA-12148) that forced Yelp's CDC Connector rewrite. New concept canonicalised on this ingest. - concepts/schema-disagreement — the distributed- datastore failure mode surfaced on Yelp's CDC-enabled clusters post-upgrade; Yelp's empirical remediation is dummy multi-node schema changes. New concept canonicalised on this ingest.
- concepts/anti-entropy-repair-pause — pre-flight / post-flight bookend of the Cassandra upgrade. New concept canonicalised on this ingest.
- concepts/checkpointed-automation-script — Yelp's upgrade driver runs in auto-proceed or per-step-confirmation mode; tunes risk to cluster-criticality. New concept canonicalised on this ingest.
- concepts/observability-before-migration — extended with the datastore-upgrade application: per-version dashboards caught the Stargate 2.x range-query regression in non-prod before it reached production.
- patterns/version-specific-images-per-git-branch — the core "no-hard-block" lever: 3.11 and 4.1 images published from dedicated Git branches, selected at bootstrap via env var. New pattern canonicalised on this ingest.
- patterns/pre-flight-flight-post-flight-upgrade-stages — the three-stage discipline (reversible gate → commit → reversible restore). New pattern canonicalised on this ingest.
- patterns/dual-run-version-specific-proxies — Stargate fleet split across 3.11-persistence + 4.1-persistence with service-mesh alias, keeping the last 3.11 Cassandra node deliberately on 3.11 until the 3.11 Stargate fleet is drained. New pattern canonicalised on this ingest.
- patterns/benchmark-in-own-environment-before-upgrade — don't trust upstream benchmarks alone. Yelp measured 4% p99 + 11% mean + 11% throughput in own env, production later measured 58% p99 reduction on key clusters. New pattern canonicalised on this ingest.
- patterns/production-qualification-criteria-upfront — six-criterion upfront list (perf / functional / security / rollback / observability / component-health). New pattern canonicalised on this ingest.
SDUI / client-platform-framework axis — Konbini (2026-04-22)¶
- concepts/single-source-interface-spec — one JSON
file per Cookbook component in Yelp's
component_interfacesrepo; four platform libraries (Kotlin, Swift, Python, TypeScript) generated from that single source. New concept canonicalised on this ingest. - concepts/client-spec-version — each client bundles a
spec file listing the component versions it supports
(e.g.
android@23.0:{"cookbook.Button": "1.0"}); request-timespec_name@spec_versionlets the backend pick compatible versions per request. Complement to Register-based matching: Registers gate which feature is served, spec-version gates which version of each component in that feature. New concept canonicalised on this ingest. - concepts/component-version-migrate-function — on
every breaking change (Button
text: String → FormattedTextbumping0.8 → 1.0), Konbini auto- generates a stubmigrate(V1) -> V0method that developers must implement. Default throws; opt-out of backward-compat is explicit. New concept canonicalised on this ingest. - concepts/breaking-change-requires-major-bump — the versioning discipline that pairs with migrate functions. New concept canonicalised on this ingest.
- concepts/design-token-as-named-reference — the
{name, raw_value}wire format for design tokens (colours, icons, gradients, shadows); name stable across platforms, raw_value platform-resolvable default. Tokens curated in a separate designer-owned repo. New concept canonicalised on this ingest. - patterns/single-json-spec-to-multi-platform-codegen —
the core Konbini pattern; canonical wiki first instance.
Jenkins-triggered codegen on every push to
component_interfaces. Generated code never edited by hand. - patterns/spec-version-negotiation-for-backward-compat
— per-request negotiation: client sends its bundled
spec version; backend picks compatible component versions;
falls back to
migrate()to downgrade. New pattern canonicalised on this ingest. - patterns/migrate-function-for-component-downgrade —
fail-closed default (
NotSupportedError) + field-level developer-written backport. New pattern canonicalised on this ingest.
Search-engine / Lucene-platform axis (2025-05-08)¶
- concepts/incremental-backup-on-commit — the core architectural primitive. Every Lucene commit in Nrtsearch 1.0.0 lists the S3 segment prefix, computes the diff, and uploads only new segment files. New concept canonicalised on this ingest.
- concepts/immutable-segment-file — the Lucene-level precondition that makes incremental-per-commit backup work. Segments never change after flush; merges produce new segments without rewriting old ones. New concept canonicalised on this ingest (extends existing Lucene treatment with the architectural-downstream framing).
- concepts/ephemeral-local-disk-vs-ebs — the explicit tradeoff Yelp navigates. Three pre-1.0 EBS drawbacks enumerated load-bearingly; local-SSD + S3 source-of-truth resolves all three. New concept canonicalised on this ingest.
- concepts/immutable-index-state — the state-management primitive. State changes merge into a fresh immutable object, committed to the state backend, then atomically swapped in. Lock-free readers + intra-request consistency
- atomic visibility fall out. New concept canonicalised on this ingest.
- concepts/scalar-quantization — Lucene 10's accuracy/ memory tradeoff for HNSW-indexed float vectors. New concept canonicalised on this ingest.
- concepts/virtual-sharding — Yelp's mechanism for splitting large indices into multiple Nrtsearch clusters fronted by a Coordinator. Named by the post as a workaround for pre-Lucene-10 lack of intra-single-segment parallel search. New concept canonicalised on this ingest.
- patterns/incremental-s3-backup-of-immutable-files — the canonical pattern. Applies beyond search to any immutable-file + commit-boundary stack (Parquet writers, LSM sstables, log-segment backups). New pattern canonicalised on this ingest.
- patterns/parallel-s3-download-for-bootstrap — concurrent multi-file GET + local SSD write yields the 5× bootstrap speedup that makes the move off EBS viable. New pattern canonicalised on this ingest.
- patterns/hot-reload-over-restart-replicas — state propagation via hot reload instead of rolling restart collapses a four-step change-propagation into two steps. New pattern canonicalised on this ingest.
- patterns/decoupled-state-commit-from-data-commit — state commits run per-request (not bundled with data commits), eliminating the lost-update-on-restart failure mode and removing artificial coupling to data-commit cadence. New pattern canonicalised on this ingest.
- patterns/coordinator-fronted-sharded-search — the deployment shape for large Nrtsearch indices that need virtual sharding. New pattern canonicalised on this ingest.
Key model zoo (named by the 2025-02-04 post)¶
- systems/gpt-4 — the formulation-phase LLM; also used to generate golden datasets for distillation.
- systems/o1-preview / systems/o1-mini — reserved for "newer and more complex tasks that require logical reasoning".
- systems/gpt-4o-mini — the fine-tuned offline student; ~100× cost reduction vs. direct GPT-4 prompt at equivalent quality on query-understanding tasks.
- systems/bert / systems/t5 — the realtime tail- query models; production serving tier for never-seen-before queries that miss the cache.
Recent articles¶
- 2026-05-27 —
Beyond the Menu Tree: How Yelp Built a Smarter Customer
Success Chatbot with AI. Customer Success / chatbot /
RAG axis (eighth Yelp wiki axis canonicalised — first
dedicated production-LLM-applied-to-customer-support
disclosure). Yelp's LLM-Assisted CS Chatbot replaces the
legacy two-step menu-tree + fixed-phrase chatbot.
Two-half architecture: (1)
LLM workflow router classifies inbound queries into 5
specialised workflows (QA / Billing / Refund / Cancel /
Review) bucketed along two axes — frequency of inbound
requests AND risk class (churn / legal / financial). Only
the QA workflow does free-form LLM generation; Cancel +
Review return templated responses "due to high
financial/legal risk"; Billing returns deterministic UI;
Refund guides through a form. The LLM is the router;
specialised handlers do the work — canonical
patterns/specialized-workflow-router-with-llm-intent-detection.
(2) RAG pipeline for the QA workflow — four-step inference
(embed query → similarity-search vectorstore → inject top-5
unique articles into LLM prompt → output validation gate). The
load-bearing architectural disclosure is the
vectorstore-construction strategy: rather than chunking
articles into paragraph-sized segments (default RAG
prescription) or embedding whole articles, Yelp embeds the
article's metadata as separate segments —
"the title, the summary, and each distinct top header etc."
— "each treated as a separate text input and individually
embedded into the vectorstore". Whole-article remains the
retrieval unit (~370 articles, not 370× chunks); only the
embedding signal is metadata-derived. Canonical
metadata-only embedding /
whole-article retrieval /
patterns/whole-article-retrieval-via-metadata-segments.
Why metadata-only: verbatim disclosure of
embedding signal
dilution — "embedding large texts, such as entire articles
or long paragraphs, can dilute the information signal […]
Concatenating too much text was observed to cause the
semantic distances between vectors to get 'farther apart' in
the embedding space because the key phrase we wanted to detect
was mixed with too many unrelated words". Worked example:
"comparing a user's query 'reset password' against a vector
representing an entire 500-word article on account management
(which only mentions 'reset password' once) yields a poor
match score because the signal is diluted." Sweet-spot
framing: chunk-too-large dilutes; chunk-too-small "resulted
in too many false candidates"; metadata-only sits between.
Operational substrate:
text-embedding-ada-002 1,536-dim unit vectors;
vectorstore total ~8 MB after FAISS quantization; loaded
in-memory at container start during health check (patterns/in-memory-vectorstore-loaded-at-container-start
— no remote vector DB; sub-millisecond similarity search;
zero cold-start latency in request path). Refresh substrate:
daily S3 batch
CSV pipeline — scheduled job fetches updated articles from
internal endpoint → markdown → CSV with metadata + headers →
AWS S3 daily; container downloads CSV at startup, builds
FAISS index + computes fresh ada-002 embeddings during health
check. Three structural choices: batch-not-stream;
data-not-index in S3 (CSV is durable artifact, FAISS index
rebuilt every restart — avoids index-format-versioning);
load-at-health-check not lazy-on-first-request.
Retrieval algorithm:
k = max_items_per_article × 5nearest-vector over-fetch + empirical similarity threshold + dedupe-by-article-id → top-5 unique articles. Disclosed retrieval quality: ~94% recall@5 on Yelp's evaluation dataset. Hyperlink hallucination is canonicalised as "one of the most notable unexpected challenges" — sub-class of llm-hallucination specific to fabricated URLs. Mitigation: per-response allowlist validation — extract URLs from retrieved-context articles, validate every URL in LLM output against allowlist, strip/reject anything else; "genuinely originates from one of the retrieved Support Center articles and is not invented by the LLM". The hyperlink check is one of three output validation gate axes (trust & safety / valid URL / character limit). A/B-test outcome: "we doubled the chatbot resolution rate based on the A/B test result." No per-workflow breakdown, no latency / cost numbers, no QA-LLM SKU disclosure (only embedding model named). Future-work flags: dataset expansion, keyword experimentation, additional documents (chat transcripts with human support agents), retrieval-context-size sweet spot. Sibling Yelp LLM altitudes: this CS Chatbot's specialised-handler-per-task shape mirrors the 2025-02-04 search-query-understanding architecture (systems/yelp-query-understanding) — both organise the same primitive at different altitudes (search-query subtypes vs customer-support intent classes). Created (2 systems + 4 concepts + 5 patterns + 1 hallucination sub-class concept): systems/yelp-cs-chatbot (canonical home), systems/openai-text-embedding-ada-002 (1,536-dim embedding model — sibling to existing systems/openai-text-embedding-3-large), concepts/retrieval-augmented-generation (extended with Yelp customer-support face), concepts/embedding-signal-dilution, concepts/metadata-only-embedding, concepts/whole-article-retrieval, concepts/llm-workflow-router, concepts/llm-hyperlink-hallucination, patterns/specialized-workflow-router-with-llm-intent-detection, patterns/whole-article-retrieval-via-metadata-segments, patterns/in-memory-vectorstore-loaded-at-container-start, patterns/daily-s3-vectorstore-update-pipeline, patterns/hyperlink-allowlist-validation-on-llm-output. Extended: systems/faiss (Yelp customer-support face; distinct from Meta Groups Search L2 ANN altitude and SilverTorch Faiss-GPU baseline framing), concepts/llm-hallucination (hyperlink-hallucination sub-class cross-link), concepts/retrieval-augmented-generation (Yelp customer-support face). Author acknowledgements: Yelp Customer & Sales Intelligence Team + product partners + Biz Customer Experience + Sales Infrastructure + Web Foundation. - 2026-05-21 —
How Partition Access Visualizations Reduced our Data Lake
S3 Cost by 33%. Storage / data-engineering axis
deepening (sixth Yelp wiki axis canonicalised — data-
platform / storage-cost-engineering, structurally a
downstream consumer of the 2025-09-26 SAL pipeline rather
than a new substrate). Plot partitions accessed (y, e.g.
dt=yyyy-mm-dd) vs access-event time (x), colour by IAM role: three signature shapes emerge unaided — diagonal y=x (daily batch), vertical line (backfill), scatter (ad hoc). Canonical instance of concepts/granular-usage-attribution as the gating observability primitive for data-platform efficiency wins; canonical instance of patterns/access-pattern-visualization-for-data-stewardship and patterns/iam-role-attribution-from-s3-access-logs. Outcome: 33 % S3 cost reduction on Yelp's petabyte- scale data lake, driven by IT-by-default storage-class adoption (40 % at 30-day no-access; 81 % at 90-day — "approaches the cost of S3 Glacier!") and the new Default Access Retention primitive. DAR is the named middle ground between deletion-based retention and cold-tier-by-default: data outside the access window remains on Intelligent Tiering but is gated behind a restrictive bucket IAM policy (patterns/iam-policy-gated-cold-tier-access) that requires a Terraform-PR + cost-acknowledgement (S3 Inventory dashboard estimating projected cost) workflow with tiered approval levels. Two named DAR benefits — accidental queries don't reset IT's tiering clock; explicit cost acknowledgement before PB-scale Archive-Instant-Access reads (load-bearing disclosure: "For our largest tables, full table scans could add significant S3 costs by accessing PBs of data from cheap Intelligent Tiers like Archive Instant Access. This is not obvious to users who are writing SQL to inspect data!"). Two new comparison- axis concepts: concepts/cold-storage-minimum-duration-tax (the structural failure mode of cold-tier classes — generalises beyond systems/aws-s3-glacier to Azure Archive / GCS Archive); the bet-asymmetry frame for IT-vs-Glacier (IT is bet-symmetric; Glacier punishes forecast errors). Third pillar: usage- driven Apache Iceberg migration prioritisation — the same observability substrate ranks the migration backlog by active-table value, "providing Apache Iceberg's read performance benefits to the most valuable use cases first." Reframes the 2025-09-26 SAL-axis as a multi- purpose observability substrate amortising across at least five named use cases: permission-debugging, cost attribution, access-based retention, partition-access visualisation, Iceberg migration prioritisation. Yelp contributors (acknowledged in the post): Rishi Madan (development); Infrastructure Security team (Vincent Thibault, Quentin Long, Nurdan Almazbekov) for enabling SAL across Yelp's AWS infrastructure. Future-work flag: "investing in other areas of our data infrastructure to further enhance lineage and granular usage attribution" — pairs lineage and granular usage attribution as complementary investments. - 2026-04-22 —
How Yelp Keeps Server-Driven UI Consistent Across Four
Platforms. SDUI axis deepening (does not open a new
axis; extends the 2025-07-08 CHAOS axis with the codegen /
design-system-integration sub-layer).
Konbini auto-generates four
platform libraries (Kotlin/Swift/Python/TypeScript) from
one JSON interface definition per
Cookbook component; Jenkins
pipeline on every push republishes the four libs.
Canonical instance of
patterns/single-json-spec-to-multi-platform-codegen.
Also canonicalises
client spec files
(
android@23.0) for per-request component-version negotiation, andmigrate()functions for backporting on breaking component-version bumps (0.8 → 1.0). Design tokens (colours, icons, gradients, shadows) travel the wire as{name, raw_value}records, curated in a separate designer-owned repo. Follow-up to the 2025-07-08 CHAOS backend post; positions Yelp as the wiki's canonical design-system- aware SDUI framework disclosure. - 2025-05-08 — Nrtsearch 1.0.0: Incremental Backups, Lucene 10, and More. Search-engine / Lucene-platform axis (opens an eighth Yelp axis on the wiki). >90% of prior Elasticsearch traffic now on Yelp's in-house Lucene engine Nrtsearch. Core unlock: incremental backup on commit — every Lucene commit uploads only new immutable segment files to S3 (patterns/incremental-s3-backup-of-immutable-files) — makes S3 the source of truth for committed data and enables the primary's move from EBS to ephemeral local SSD with 5× bootstrap-download speedup for replicas (patterns/parallel-s3-download-for-bootstrap). State- management overhaul: concepts/immutable-index-state + patterns/decoupled-state-commit-from-data-commit + patterns/hot-reload-over-restart-replicas collapses a four-step change-propagation to two steps. **Lucene 10.1.0
- Java 21 brings HNSW vector search (4096-element max, multiple similarity types, concepts/scalar-quantization, nested-document support, intra-merge parallelism), SIMD via the Java 21 Vector API, and the roadmap target of intra-single-segment parallel search (replacing concepts/virtual-sharding). Aggregations integrated with parallel search (fix Lucene facets' single-threaded tail-latency). Added per-commit cost "a few ms to 20 seconds." Coordinator-fronted (patterns/coordinator-fronted-sharded-search) virtual sharding remains the shape for large indices. No QPS / latency numbers — release-note voice.
- 2026-04-07 —
Zero downtime Upgrade: Yelp's Cassandra 4.x Upgrade Story.
Datastore-platform / database-upgrade axis (opens a
seventh Yelp axis on the wiki). Yelp Database Reliability
Engineering upgraded > 1,000
Cassandra nodes from 3.11
to 4.1 on Kubernetes with zero
downtime / zero incidents / zero client-code changes.
In-place over new-DC on time + cost +
EACH_QUORUM-preservation grounds. Init containers sequence IP + version changes through gossip (CASSANDRA-19244). Dual-run version-specific Stargate fleets span theMigrationCoordinatorschema-fetch change. CDC Source Connector split-rollout around the 4.x commit-log write- point change (CASSANDRA-12148). Own-env benchmark: 4% p99 / 11% mean / 11% throughput; production: up to 58% p99 reduction on key clusters. Six-criterion qualification list upfront; three-stage automation script (pre-flight / flight / post-flight) with auto-proceed + per-step confirmation modes. Post-upgrade schema disagreement on CDC-enabled clusters remediated by dummy multi-node schema changes. Presented at KubeCon 2025. - 2026-03-27 — Building Biz Ask Anything: From Prototype to Product. LLM product-and-serving axis, continuing the 2025-02-04 query-understanding serving axis at a higher altitude: full business-page Q&A. Nine-month prototype → production arc for Biz Ask Anything (BAA), the business-page evolution of Yelp Assistant. Data-layer reshape: three near-real-time indices (reviews, photos
- embeddings, website/menu/Ask-the-Community) with <10
min streamed freshness for reviews/photos/business-facts,
weekly batches for slower sources;
Cassandra EAV for structured facts (
business_id, field_name, field_group, value, update_ts); a single content-fetching engine returning "all or selected sources" at <100 ms p95. Four pre-retrieval classifiers run in parallel via async langchain (patterns/parallel-pre-retrieval-classifier-pipeline): Trust & Safety + Inquiry Type (both fine-tuned GPT-4.1-nano — ~few-thousand + ~7K samples respectively) + Content Source Selection + Keyword Generation (patterns/split-source-selection-from-keyword-generation); T&S rejection cancels downstream work. Answer quality mechanised as multi-dimensional LLM-as-judge graders via Langfuse (Correctness / Completeness / Evidence Relevance; Style & Tone deferred as "a lot harder"). Performance shift: prototype p75 10-20 s → external target 5 s → shipped <3 s — biggest win was streaming via FastAPI SSE (migrated off Pyramid), plus OpenAI priority tier (~20% inference speedup) and async parallel classifier chains. Cost brought to 25% of prototype via fine-tuned small models + Aho-Corasick + sliding-window snippet extraction + biz-content cleanup + dynamic prompt composition replacing the monolithic system prompt. UX: switching suggested questions from LLM-generated category-level to business- content-derived lifted engagement +~50% and lowered inability-to-answer -~26%. Recursive application of Yelp's 2025-02-04 three-phase productionisation playbook at a different altitude (whole-question Q&A rather than query-understanding pre-retrieval). - 2026-02-02 — How Yelp Built a Back-Testing Engine for Safer, Smarter Ad Budget Allocation. Ad-tech / experimentation axis (opens a fifth Yelp axis); eight-component Back-Testing Engine simulating proposed Ad Budget Allocation algorithms against historical campaign data; production code as Git Submodules for fidelity; systems/catboost regressors as counterfactual-outcome predictor with Poisson-sampled integer counts; systems/scikit-optimize Bayesian search over YAML-declared search space; systems/mlflow as experiment store — first non-LLM-evaluation MLflow Seen-in on the wiki.
- 2025-09-26 —
S3 server access logs at scale. Storage / data-
engineering axis (opens a fourth Yelp axis); TiBs/day of
SAL compacted to Parquet (85 % storage + 99.99 % object-
count reduction); daily Tron-orchestrated Athena INSERTs
with idempotent self-LEFT-JOIN; Glue partition projection
with
enumover managed partitions; tag-based lifecycle expiration via S3 Batch Operations; weekly access-based retention via inventory ⋈ SAL at prefix granularity; measured SAL best-effort delivery (< 0.001 % > 2-day late). - 2025-07-08 —
Exploring CHAOS: Building a Backend for Server-Driven UI.
SDUI / client-platform-framework axis (opens a fourth Yelp
axis); CHAOS backend unpacked —
Apollo Federation subgraph in
Python Strawberry over
multiple team-owned REST backends; six-stage
FeatureProvider
lifecycle run as a
two-loop parallel
async build;
Register-based client capability matching drops features
on unsupported clients;
JSON-string parameters keep the GraphQL schema stable;
per-feature
error wrapper drops failed features without sinking the
view (unless
IS_ESSENTIAL_PROVIDER); advanced primitives view flows and view placeholders. - 2025-05-27 — Revenue Automation Series: Testing an Integration with Third-Party System. Financial-systems / integration- testing axis; six-step testing strategy for the Revenue Data Pipeline; parallel staging pipeline on Glue+Spectrum bypassing ~10-hour Redshift Connector latency; dual-cadence integrity checks (99.99% contract match threshold); Schema Validation Batch for pre-upload schema drift; SFTP standardised over REST for bulk daily upload.
- 2025-04-15 — Journey to Zero Trust Access. Corporate-security axis; Netbird replaces Ivanti Pulse Secure as the employee VPN; WireGuard mesh topology + router peers for <2s transparent failover; OIDC+device-posture via Okta; upstream contributions to Netbird main.
- 2025-02-19 — Revenue Automation Series: Building Revenue Data Pipeline.
- 2025-02-04 — Search query understanding with LLMs: from ideation to production.
Related¶
- systems/yelp-query-understanding
- systems/yelp-search
- systems/yelp-revenue-data-pipeline
- systems/yelp-spark-etl
- systems/yelp-billing-system
- systems/netbird
- systems/wireguard
- systems/okta
- systems/pulse-secure
- concepts/query-understanding
- concepts/long-tail-query
- concepts/retrieval-augmented-generation
- concepts/llm-cascade
- concepts/revenue-recognition-automation
- concepts/glossary-dictionary-requirement-translation
- concepts/data-gap-analysis
- concepts/pyspark-udf-for-complex-business-logic
- concepts/spark-etl-feature-dag
- concepts/checkpoint-intermediate-dataframe-debugging
- concepts/mysql-snapshot-to-s3-data-lake
- concepts/zero-trust-authorization
- concepts/vpn-to-zta-migration
- concepts/wireguard-mesh-topology
- concepts/router-peer
- concepts/sso-authentication
- patterns/head-cache-plus-tail-finetuned-model
- patterns/offline-teacher-online-student-distillation
- patterns/daily-mysql-snapshot-plus-spark-etl
- patterns/source-plus-transformation-feature-decomposition
- patterns/business-to-engineering-requirement-translation
- patterns/yaml-declared-feature-dag-topology-inferred
- patterns/oidc-plus-device-posture-access-gate
- patterns/open-source-for-security-response-agency
- patterns/upstream-contribution-parallel-to-in-house-integration
- systems/yelp-staging-pipeline
- systems/yelp-schema-validation-batch
- systems/yelp-redshift-connector
- systems/aws-glue
- systems/amazon-redshift
- systems/amazon-redshift-spectrum
- concepts/staging-pipeline
- concepts/data-integrity-checker
- concepts/redshift-connector-latency
- concepts/test-data-generation-for-edge-cases
- concepts/data-upload-format-validation
- patterns/parallel-staging-pipeline-for-prod-verification
- patterns/monthly-plus-daily-dual-cadence-integrity-check
- patterns/schema-validation-pre-upload-via-mapping-api
- patterns/sftp-for-bulk-daily-upload
- systems/yelp-s3-sal-pipeline
- systems/tron
- systems/aws-s3
- systems/amazon-athena
- systems/apache-parquet
- systems/s3-batch-operations
- systems/s3-inventory
- concepts/s3-server-access-logs
- concepts/partition-projection
- concepts/best-effort-log-delivery
- concepts/athena-shared-resource-contention
- concepts/user-controlled-log-fields
- concepts/url-encoding-idiosyncrasy-s3-keys
- patterns/raw-to-columnar-log-compaction
- patterns/object-tagging-for-lifecycle-expiration
- patterns/idempotent-athena-insertion-via-left-join
- patterns/projection-partitioning-over-managed-partitions
- patterns/s3-access-based-retention
- patterns/optional-non-capturing-tail-regex
- systems/yelp-back-testing-engine
- systems/yelp-ad-budget-allocation
- systems/scikit-optimize
- systems/catboost
- systems/mlflow
- concepts/filter-before-ab-test
- concepts/hybrid-backtesting-with-ml-counterfactual
- concepts/counterfactual-outcome-prediction
- concepts/poisson-sampling-for-integer-outcomes
- concepts/bayesian-optimization-over-parameter-space
- concepts/overfitting-to-historical-data
- patterns/production-code-as-submodule-for-simulation
- patterns/historical-replay-with-ml-outcome-predictor
- patterns/yaml-declared-experiment-config
- systems/yelp-chaos
- systems/apollo-federation
- systems/strawberry-graphql
- systems/yelp-konbini
- systems/yelp-cookbook
- systems/jenkins
- concepts/server-driven-ui
- concepts/register-based-client-capability-matching
- concepts/json-string-parameters-for-schema-stability
- concepts/single-source-interface-spec
- concepts/client-spec-version
- concepts/component-version-migrate-function
- concepts/breaking-change-requires-major-bump
- concepts/design-token-as-named-reference
- patterns/federated-graphql-subgraph-per-domain
- patterns/feature-provider-lifecycle
- patterns/two-loop-parallel-async-build
- patterns/error-isolation-per-feature-wrapper
- patterns/preloaded-view-flow-for-predictable-navigation
- patterns/view-placeholder-async-embed
- patterns/single-json-spec-to-multi-platform-codegen
- patterns/spec-version-negotiation-for-backward-compat
- patterns/migrate-function-for-component-downgrade
- systems/apache-cassandra
- systems/stargate-cassandra-proxy
- systems/cassandra-source-connector
- systems/kubernetes-init-containers
- systems/kubernetes
- systems/spark-cassandra-connector
- systems/yelp-pushplan-automation
- concepts/rolling-upgrade
- concepts/mixed-version-cluster
- concepts/in-place-vs-new-dc-upgrade
- concepts/performance-regression-from-mid-upgrade-state
- concepts/init-container-ip-gossip-pre-migration
- concepts/cassandra-cdc-commit-log
- concepts/schema-disagreement
- concepts/anti-entropy-repair-pause
- concepts/checkpointed-automation-script
- concepts/observability-before-migration
- patterns/version-specific-images-per-git-branch
- patterns/pre-flight-flight-post-flight-upgrade-stages
- patterns/dual-run-version-specific-proxies
- patterns/benchmark-in-own-environment-before-upgrade
- patterns/production-qualification-criteria-upfront
- systems/nrtsearch
- systems/lucene
- systems/hnsw
- concepts/incremental-backup-on-commit
- concepts/immutable-segment-file
- concepts/ephemeral-local-disk-vs-ebs
- concepts/immutable-index-state
- concepts/scalar-quantization
- concepts/virtual-sharding
- patterns/incremental-s3-backup-of-immutable-files
- patterns/parallel-s3-download-for-bootstrap
- patterns/hot-reload-over-restart-replicas
- patterns/decoupled-state-commit-from-data-commit
- patterns/coordinator-fronted-sharded-search