Yelp¶
Yelp Engineering blog (engineeringblog.yelp.com) is a Tier-3 source on the sysdesign-wiki. Yelp operates the local-business-discovery platform (reviews, ratings, search, photos) for US / Canada / parts of Europe; the platform combines a curated business-graph (millions of SKUs in the catalog sense), user-generated reviews, and a search stack that routes the raw query through a query-understanding layer before retrieval and ranking.
Per AGENTS.md Tier-3 guidance, Yelp posts are ingested selectively — only where they explicitly cover distributed- systems internals, scaling trade-offs, infrastructure architecture, production incidents, storage/networking/streaming design, or — as with the 2025-02-04 LLM post — a concrete production serving-infrastructure architecture built around LLMs (distinct from pure ML research).
Wiki anchor¶
Six on-scope Yelp ingests establish the wiki's Yelp coverage across five distinct stack altitudes (seventh axis added with the 2026-04-07 Cassandra 4.x upgrade ingest — datastore platform / database upgrade):
- 2025-02-04 — search query understanding with LLMs (LLM / serving-infra axis). Yelp's first-party disclosure of the production serving architecture for three query- understanding tasks (segmentation, spell correction, review- highlight phrase expansion). Canonicalises Yelp's reusable three-phase productionisation playbook (Formulation → Proof of Concept → Scaling Up) and a three-tier serving cascade (pre-computed head cache → offline fine-tuned GPT-4o-mini for 95%+ of traffic → BERT/T5 realtime tail). As a serving architecture, it is the earliest wiki canonical instance of head- cache-plus-tail applied to LLM-driven search query understanding — pre-dating Instacart's 2025-11-13 Intent Engine canonicalisation by nine months.
- 2025-02-19 — Revenue Automation Series: Building
Revenue Data Pipeline (financial-systems / data-platform
axis). Yelp's second Revenue Automation Series post (after
the 2024-12 billing-system modernisation) on how Yelp built
a batch Revenue Data Pipeline feeding a third-party
Revenue Recognition SaaS ("REVREC service") to close
the books ~50% faster. Documents the methodology stack
(Glossary Dictionary → Data Gap Analysis → system-design
evaluation across four architectures), architectural
selection (Data Lake + Spark ETL wins), and concrete Spark
implementation (internal
spark-etlpackage with feature- DAG abstraction, YAML topology-inferred DAG, checkpoint-to- scratch debugging, PySpark UDFs for complex business logic). Three other architectures (MySQL+Python batch, Warehouse+dbt, Event Streams) are explicitly rejected with load-bearing reasons. - 2025-04-15 — Journey to Zero Trust Access (corporate- security / networking axis). Yelp's Corporate Systems + Client Platform Engineering teams retire Ivanti Pulse Secure as the employee VPN in favour of Netbird, an open-source WireGuard-based ZTA platform. Five named selection pillars: Okta/OIDC, simple UI, open source, throughput/latency, fault tolerance. Load-bearing architectural disclosures: WireGuard mesh topology with router peers provides <2s transparent failover; OIDC+device-posture access gate replaces SAML-via-Pulse flow; open source provides both response agency and realised upstream contribution ("multiple changes ... pushed upstream to Netbird's main branch from Yelpers"). Canonical instance of concepts/vpn-to-zta-migration as a motion rather than a flip — Netbird coexists with Yelp's MTLS-based Edge Gateway, with VPN utilisation "reducing ... to more granular use cases in the future". Opens Yelp's corporate-security-and-networking axis on the wiki, distinct from the 2025-02-04 search-ML axis and the 2025-02-19 financial-systems axis.
- 2025-05-27 — Revenue Automation Series: Testing an Integration with Third-Party System (financial-systems / integration-testing deepening). Third Revenue Automation Series post — documents how Yelp verified the pipeline built in 2025-02-19 rather than building anything new. Six-step strategy: (1) a parallel Staging Pipeline consuming production data but publishing to Glue catalog tables on S3 queryable immediately via Redshift Spectrum — bypassing the ~10-hour Redshift Connector latency that makes same-day verification through the production path infeasible; (2) manual test-data backport from production edge cases to dev fixtures; (3) dual-cadence integrity checks (99.99% contract- invoice match threshold for the monthly against billing- system truth; daily lightweight SQL against staging for fast iteration); (4) Schema Validation Batch polling the REVREC mapping API before each upload to guard against partner-side schema drift; (5) SFTP standardised over REST after testing both (reliability + file-size ceiling: 500k-700k records/file SFTP vs 50k/file REST → 4-5 files/day vs 15 files/day); (6) documented escalation for third-party SFTP server / upload-job / storage failures. Deepens the 2025-02-19 axis rather than opening a new one.
- 2025-09-26 — S3 server access logs at scale (storage /
data-engineering axis; opens a fifth Yelp axis on the
wiki). First-party retrospective on operationalising
S3 Server Access Logs
(SAL) at fleet scale ("TiBs of S3 server access logs per
day"). Canonicalises the
Yelp S3 SAL pipeline: daily Tron batch
that converts raw-text SAL objects to
Parquet via
Athena INSERTs
(patterns/raw-to-columnar-log-compaction) achieving
85 % storage + 99.99 % object-count reduction; weekly
access-based table joining
S3 Inventory with a week of SAL
for prefix-granularity
retention; tag-based lifecycle expiration via
S3 Batch Operations
("the only scalable way to delete per object" —
patterns/object-tagging-for-lifecycle-expiration).
Load-bearing architectural disclosures:
Glue partition projection
with
enumover managed partitions (patterns/projection-partitioning-over-managed-partitions); idempotent Athena INSERT via self-LEFT-JOIN onrequestidfor shared- resource retry-safety; SAL's best-effort delivery measured at < 0.001 % > 2-day late. Three parsing hazards canonicalised: user-controlled log fields (unescaped quotes / SQLi / shellshock inrequest_uri/referrer/user_agent) with Yelp's optional non-capturing tail regex fix; URL-encoding idiosyncrasy on S3 keys (most ops double-encode;BATCH.DELETE.OBJECT/S3.EXPIRE.OBJECTsingle-encode). Preferred over AWS's CloudTrail Data Events on cost: "$1 per million data events — that could be orders of magnitude higher!" First Yelp storage-axis ingest; opens axis distinct from the 2025-02-04 LLM / search-serving-infra axis, the 2025-02-19 / 2025-05-27 financial-systems axis, and the 2025-04-15 corporate-security axis. - 2025-07-08 — Exploring CHAOS: Building a Backend for
Server-Driven UI (SDUI / client-platform-framework axis;
opens a fourth Yelp axis on the wiki). First-party
unpacking of the CHAOS backend — the
server-driven-UI framework that authors per-request view
configurations (views + layouts + components + actions)
that Yelp's iOS, Android, and web clients render.
Canonical instance of concepts/server-driven-ui. Three
architectural layers disclosed: (1) GraphQL surface —
a Yelp-internal Apollo
Federation subgraph implemented in Python via
Strawberry, fronting
multiple team-owned REST backends all conforming to a
common CHAOS REST API. Canonical instance of
patterns/federated-graphql-subgraph-per-domain. (2)
Build pipeline —
ChaosConfigBuilder→ViewBuilder→LayoutBuilder→FeatureProvidercomposition with a six-stage feature- provider lifecycle (registers→is_qualified_to_load→load_data→resolve→is_qualified_to_present→result_presenter) executed as a two-loop parallel async build (loop 1 fans out upstream calls, loop 2 awaits + composes — max-latency not sum-latency). (3) Advanced primitives — preloaded view flows (subsequent_views()+ thechaos.open-subsequent-view.v1action) for predictable sequential navigation without extra round-trips (customer- support FAQ menu example) and view placeholders (ViewPlaceholderV1) for lazy-loaded nested views served by different CHAOS backends (Reminders embedded in Yelp for Business home screen). Three load-bearing correctness mechanisms canonicalised: (a) Register-based client capability matching — first-matchCondition(platform=[...], library=[required components and actions])decides whether a feature is included for this client or dropped, the mechanism that keeps old app versions rendering while new components ship; (b) JSON-string parameters for schema stability — element content carried as opaque JSON inside a stable GraphQL schema so new elements / versions ship without schema churn, with backend Python dataclasses type-checking the payload; (c) error isolation per feature wrapper — an@error_decoratorwraps every FeatureProvider; exceptions drop the feature (not the view), unless markedIS_ESSENTIAL_PROVIDER = True; telemetry logs feature name + owner + exception + request context for threshold-based alerting. Post flags that "the latest CHAOS backend framework introduces the next generation of builders using Python asyncio" — the two-loop iteration is a transitional design. No operational numbers (latency, RPS, cache hit rates); walkthrough not retrospective. - 2026-02-02 — Back-Testing Engine for Ad Budget Allocation
(ad-tech / experimentation axis; opens a sixth Yelp axis
on the wiki). First-party disclosure of the
Back-Testing Engine —
an eight-component hybrid back-testing + simulation system
that evaluates proposed changes to the
Ad Budget Allocation
algorithms against historical campaign data before
committing to A/B tests. Canonical instance of
concepts/filter-before-ab-test (back-testing filters the
candidate space; A/B validates the survivors) and
concepts/hybrid-backtesting-with-ml-counterfactual
(historical inputs + alternative code path + ML-predicted
counterfactual outcomes). Five load-bearing architectural
moves canonicalised: (1) Production code as Git Submodules
(patterns/production-code-as-submodule-for-simulation) —
Budgeting and Billing repos included as submodules pointing
at branches under test; "blurs the line between prototyping
and production"; (2) CatBoost regressors as
counterfactual-outcome predictor (systems/catboost,
concepts/counterfactual-outcome-prediction) — same models
for all candidates so cross-candidate deltas are attributable
to the algorithm, not predictor noise; (3) Poisson sampling
over expected values (concepts/poisson-sampling-for-integer-outcomes)
— converts the regressor's smooth averages into realistic
integer counts; (4) Scikit-Optimize Bayesian search over
a YAML-declared
search space — 25
max_evalsbudget, learns from prior candidates; grid + listed search also supported but "not really an optimizer, just a wrapper that yields the next candidate"; (5) MLflow as experiment store + visualization substrate — first non-LLM MLflow Seen-in on the wiki. Overfitting-to-historical-data named explicitly as a limitation (concepts/overfitting-to-historical-data), mitigated organisationally by keeping A/B tests in the loop. Opens ad-tech / experimentation axis distinct from the five prior axes. - 2026-04-07 — Zero downtime Upgrade: Yelp's Cassandra 4.x Upgrade Story (datastore-platform / database-upgrade axis; opens a seventh Yelp axis on the wiki). Yelp Database Reliability Engineering upgraded > 1,000 Cassandra nodes from 3.11 to 4.1 on Kubernetes with zero downtime
- zero incidents + zero client-code changes. Canonical
wiki instance of
in-place over new-DC at fleet scale (rejected on time +
cost +
EACH_QUORUM-preservation grounds). Five upgrade patterns canonicalised from this post: (1) version-specific images per Git branch (3.11 and 4.1 images published from dedicated branches, env-var selected at bootstrap); (2) pre-flight / flight / post-flight three-stage upgrade (schema-agreement gate + anti-entropy-repair pause → one-node-at-a-time rolling with dual-Stargate → repairs - schema-changes re-enabled); (3)
dual-run
version-specific proxies for Stargate around Cassandra
4.1's
MigrationCoordinatorschema-fetch change, with service-mesh alias so clients see one endpoint; (4) benchmark in your own environment (own-env measurement of 4% p99 + 11% mean + 11% throughput aligned with the DataStax whitepaper direction of travel, built confidence that later unlocked the 58% p99 reduction observed in production); (5) production qualification criteria upfront (six-criterion list — performance / functional / security / rollback / observability / component-health — evaluated per cluster). Three Cassandra-specific concepts canonicalised: init-container IP-gossip pre-migration sequencing IP + version changes into two distinct gossip events (CASSANDRA-19244); CDC commit-log write- point change (flush → mutation,CASSANDRA-12148) that breaks CDC consumers at the major-version boundary; schema disagreement on CDC-enabled clusters remediated by dummy multi-node schema changes. Two general-purpose concepts canonicalised: mixed-version cluster as a named operational state; transient mid-upgrade regression vs genuine regression distinction. First wiki first-party operator retrospective on a Cassandra major upgrade — every prior Cassandra Seen-in was third-party explainer or downstream user. Also surfaces Yelp's Cassandra Source Connector (CDC → Kafka) and the Stargate proxy as distinct wiki systems.
Key systems¶
LLM / search-serving-infra axis (2025-02-04)¶
- systems/yelp-query-understanding — the named LLM-powered query-understanding pipeline. Multiple tasks (segmentation + spell correction, review-highlight phrase expansion) with RAG side-inputs (business names viewed for query; top business categories from in-house predictive model) feeding into a cascaded serve path.
- systems/yelp-search — the parent production context;
consumer of the query-understanding outputs (location rewrite
into geobox; token-probability of
nametag into ranking).
Financial-systems / data-platform axis (2025-02-19)¶
- systems/yelp-revenue-data-pipeline — the named batch
data pipeline that feeds a third-party Revenue Recognition
SaaS. Data Lake + Spark ETL architecture; daily MySQL
snapshots → S3 →
spark-etl-orchestrated feature DAG → REVREC-template output to S3 → REVREC service → 50% faster book-close. - systems/yelp-spark-etl — Yelp's internal PySpark
orchestration package. Feature-based DAG abstraction
(web-API-shaped source + transformation sub-types), YAML
topology-inferred DAG declaration, checkpoint-to-scratch
debugging via
--checkpointflag + systems/jupyterhub. - systems/yelp-billing-system — Yelp's custom order-to- cash system; stub page pending ingest of the 2024-12 billing-system modernisation post. Upstream source-of-truth for revenue contracts / invoices / fulfillment events.
- systems/yelp-staging-pipeline — the parallel verification pipeline (2025-05-27). Runs the same code path on production data, publishes to AWS Glue tables on S3, queryable immediately via Redshift Spectrum. Bypasses the ~10-hour Redshift Connector latency for same-day verification loops.
- systems/yelp-schema-validation-batch — the pre-upload guard (2025-05-27). Polls the third-party REVREC mapping API (REST) before each upload and aborts on schema mismatch on any of three axes (date format, column name, column data type).
- systems/yelp-redshift-connector — Yelp's named Data Connector that publishes from Data Pipeline streams to Redshift. The ~10-hour latency of this connector is the specific constraint that motivates the staging-pipeline + Glue + Spectrum bypass.
Corporate-security / networking axis (2025-04-15)¶
- systems/netbird — the open-source WireGuard-based Zero Trust Access platform Yelp chose; five named selection pillars (Okta/OIDC, simple UI, open source, high throughput, fault tolerance). Canonical instance of mesh topology with router peers for transparent <2s failover. Yelp has contributed upstream fixes.
- systems/wireguard — the data-plane protocol under Netbird; Yelp's deployment contributed a new corporate-ZTA altitude Seen-in on the wiki's WireGuard page (distinct from Fly.io's gateway-mesh altitude).
- systems/okta — Yelp's OIDC identity provider for the ZTA substrate; enforces device-posture policies.
- systems/pulse-secure — the retired predecessor VPN ("a more reliable solution" was needed). Only-instance wiki page.
Storage / data-engineering axis (2025-09-26)¶
- systems/yelp-s3-sal-pipeline — the named Yelp system for operationalising S3 Server Access Logs (SAL) at fleet scale. Daily Tron batch compacting TiBs/day of raw-text SAL into Parquet via Athena; weekly access-based retention table; tag-based lifecycle expiration via S3 Batch Operations or direct per-object tagging. 85 % storage + 99.99 % object-count reduction from compaction.
- systems/tron — Yelp's in-house batch processing system (open-source: github.com/Yelp/Tron); orchestrates the daily SAL compaction and the weekly access-based-retention build. First wiki page.
- systems/s3-batch-operations — AWS's per-bucket batch-
job primitive for
PutObjectTaggingfanout. Flat $0.25 per bucket per job — the biggest cost contributor for low-volume buckets, driving Yelp's two-scale dispatch rule (direct tag for low-volume, Batch Ops for high-volume). - systems/s3-inventory — the daily object-listing used as one side of the access-based retention join with a week of SAL.
SDUI / client-platform-framework axis (2025-07-08)¶
- systems/yelp-chaos — the named server-driven-UI framework. Per-request composition of views + layouts + components + actions, delivered as a JSON-stable GraphQL configuration. Six-stage FeatureProvider lifecycle runs features in a two-loop parallel async build; per-feature error isolation keeps bad features from taking down the whole view. Advanced primitives: preloaded view flows for predictable navigation, view placeholders for lazy nested views served by different backends.
- systems/apollo-federation — the GraphQL substrate. Yelp's Supergraph router composes the CHAOS Subgraph with every other per-domain subgraph. Canonical wiki instance of patterns/federated-graphql-subgraph-per-domain; CHAOS adds the twist that the subgraph itself fronts multiple team-owned REST backends, not a single data store.
- systems/strawberry-graphql — Python GraphQL library Yelp picked for the CHAOS Subgraph to "leverage type-safe schema definitions and Python's type hints." First wiki instance.
Ad-tech / experimentation axis (2026-02-02)¶
- systems/yelp-back-testing-engine — the named eight-component system that simulates proposed ad-budget- allocation algorithm changes against historical campaign data. Canonicalises the hybrid back-testing + simulation methodology (historical inputs + alternative code path + ML-predicted counterfactual outcomes) and positions back-testing as the discovery phase upstream of A/B validation.
- systems/yelp-ad-budget-allocation — the parent system the Back-Testing Engine simulates; splits each campaign's monthly budget between on-platform Yelp inventory and the off-platform Yelp Ad Network, with day-by-day budget decisions that depend on previous days' outcomes (the feedback loop that makes naive aggregate-math simulation wrong).
- systems/scikit-optimize — the Bayesian-optimization library Yelp uses as the Engine's default optimizer; the other search types (grid, listed) are "just wrappers".
- systems/catboost — the gradient-boosted regressors predicting impressions / clicks / leads from budget + campaign features. Non-parametric by design to capture diminishing returns; shared across candidates for fair comparison.
- systems/mlflow — the experiment store + visualization substrate; first non-LLM-evaluation MLflow Seen-in on the wiki, reinforcing MLflow as domain-general experiment database rather than LLM-specific.
Datastore-platform / database-upgrade axis (2026-04-07)¶
- systems/apache-cassandra — the target datastore. Yelp's
1,000-node Cassandra fleet runs on Kubernetes via a Cassandra operator; this ingest is the wiki's first first-party operator retrospective on a Cassandra major-version upgrade.
- systems/stargate-cassandra-proxy — the DataStax open-source Cassandra data-proxy; runs as two version- specific fleets in parallel during Yelp's upgrade window under a single service-mesh alias.
- systems/cassandra-source-connector — Yelp's in-house CDC → Kafka bridge; two components split on rollout (DataPipeline Materializer fleet-wide pre-upgrade; CDC Publisher in-lockstep per pod).
- systems/kubernetes-init-containers — the Kubernetes
primitive Yelp uses to sequence the simultaneous
new IP + new Cassandra version change into two
distinct gossip-observable events
(
CASSANDRA-19244). - systems/spark-cassandra-connector — listed as a component that had to be made 4.1-compatible; Yelp's use documented in an earlier 2024-09 post on direct Spark→Cassandra ingestion for ML pipelines.
- systems/yelp-pushplan-automation — Yelp's declarative Cassandra schema-change system; user-initiated schema changes were disabled for the duration of each cluster upgrade.
Key concepts and patterns¶
LLM / search-serving-infra axis (2025-02-04)¶
- concepts/query-frequency-power-law-caching — the infrastructure primitive that makes LLMs economically viable for query understanding; pre-compute expensive-LLM output for head queries above a frequency threshold.
- concepts/implicit-query-location-rewrite — the canonical downstream-consumer of segmentation output; rewrites the geobox sent to the search backend when location-intent confidence is high.
- concepts/review-highlight-phrase-expansion — the creative-generation task family (expand a query into semantically-adjacent phrases for review-snippet matching).
- concepts/token-probability-as-ranking-signal — the
trick of retaining LLM output-token probabilities past the
discrete decision as a continuous feature for downstream
ranking; Yelp's segmentation
name-tag probability feeds business-name matching. - concepts/llm-segmentation-over-ner — why LLMs supplant traditional Named-Entity-Recognition models for query segmentation: flexible class customisation + no internal-taxonomy leakage into the training problem.
- patterns/three-phase-llm-productionization — Yelp's Formulation → Proof of Concept → Scaling Up playbook; the reusable shape behind any LLM-at-production-scale rollout.
- patterns/rag-side-input-for-structured-extraction — the generalised pattern behind Yelp's two RAG instances (business names as segmentation grounding; business categories as review-highlight grounding).
Corporate-security / networking axis (2025-04-15)¶
- concepts/zero-trust-authorization — the strategic framing Yelp adopted explicitly: ZTA as direction of travel, "reducing VPN utilization and creating more fine grained access control structures in the future, as opposed to broad, binary policies on huge subnets and network segments."
- concepts/vpn-to-zta-migration — the motion; Netbird as intermediate state, MTLS Edge Gateway as end state; new concept canonicalised on Yelp's ingest.
- concepts/wireguard-mesh-topology — the HA primitive (clients have 1-to-many binding to router peers).
- concepts/router-peer — Netbird's named egress-peer role; new concept canonicalised on Yelp's ingest.
- concepts/sso-authentication — Yelp's LDAP → SAML → OIDC auth ladder; OIDC chosen because it supports device- posture policies.
- patterns/oidc-plus-device-posture-access-gate — Yelp's Okta+Netbird integration shape; new pattern canonicalised on Yelp's ingest.
- patterns/open-source-for-security-response-agency — the OSS lever Yelp cited first: "if critical security issues ever arose, we would not be beholden to the maintainers alone." New pattern canonicalised on Yelp's ingest.
- patterns/upstream-contribution-parallel-to-in-house-integration — Yelp's realised contribution loop to Netbird main.
Financial-systems / data-platform axis (2025-02-19)¶
- concepts/revenue-recognition-automation — the domain primitive; why the whole Revenue Data Pipeline exists. Governed by ASC 606 / IFRS 15; automated via SaaS vendors (Yelp names the integration abstractly as "REVREC service"); produces real-time revenue reports + 50% book-close speed-up.
- concepts/glossary-dictionary-requirement-translation — Yelp's three-step methodology for converting accountant- voice requirements to engineering-voice: Glossary Dictionary + Purpose + Example Calculation + Engineering Rewording.
- concepts/data-gap-analysis — the two-axis output format (immediate approximation / composite data vs long- term structural fix) for reconciling Yelp's custom data model against a third-party template.
- concepts/pyspark-udf-for-complex-business-logic — when to reach for UDFs over window-function-only implementations. Canonical worked example: multi-priority discount application with 5 business rules.
- concepts/spark-etl-feature-dag — the feature abstraction (web-API-shaped source + transformation features) that structures Yelp's Spark ETL pipelines.
- concepts/checkpoint-intermediate-dataframe-debugging — materialising intermediate DataFrames to a scratch S3 path as a substitute for breakpoint debugging on distributed + lazy Spark.
- concepts/mysql-snapshot-to-s3-data-lake — the reproducibility primitive. Frozen daily snapshot → same input → same output regardless of rerun timing.
- patterns/daily-mysql-snapshot-plus-spark-etl — Yelp's chosen Revenue Data Pipeline architecture after explicitly rejecting MySQL+Python batch, Warehouse+dbt, and Event Streams.
- patterns/source-plus-transformation-feature-decomposition — the two-sub-type feature split (thin source-snapshot features + rich transformation features) that makes the Spark DAG maintainable.
- patterns/business-to-engineering-requirement-translation — the portable methodology lifted from revenue-recognition to any cross-functional requirement handoff.
- patterns/yaml-declared-feature-dag-topology-inferred
— list features as YAML nodes; let the runtime
topologically sort from each feature's declared
sourcesfield. DRY config; one-line feature addition.
Financial-systems / integration-testing axis (2025-05-27)¶
- concepts/staging-pipeline — parallel pipeline configuration running the code-under-test on production data, publishing to a verification-friendly substrate. Canonicalised on this ingest.
- concepts/data-integrity-checker — the cadence-aware monitor that compares pipeline output against an independent truth; Yelp's four named metrics (match rate, mismatch, left orphans, duplicates) become the canonical set.
- concepts/redshift-connector-latency — the ~10-hour publication delay from Yelp's Data Pipeline streams to Redshift that motivates the Glue+Spectrum bypass; new concept canonicalised on this ingest.
- concepts/test-data-generation-for-edge-cases — the discipline of backporting production edge cases to dev fixtures; new concept canonicalised on this ingest.
- concepts/data-upload-format-validation — the runtime pre-upload schema check against a third-party mapping API; new concept canonicalised on this ingest.
- patterns/parallel-staging-pipeline-for-prod-verification — the repeatable pattern; new pattern canonicalised on this ingest.
- patterns/monthly-plus-daily-dual-cadence-integrity-check — two cadences, different substrates, different metric sets; new pattern canonicalised on this ingest.
- patterns/schema-validation-pre-upload-via-mapping-api — runtime-per-upload schema check distinct from deploy- time validation (Datadog's pattern); new pattern canonicalised on this ingest.
- patterns/sftp-for-bulk-daily-upload — SFTP over REST for bulk daily upload to a third-party, against the modern default; three named axes (reliability, file-size, setup); new pattern canonicalised on this ingest.
Storage / data-engineering axis (2025-09-26)¶
- concepts/s3-server-access-logs — the AWS primitive
Yelp operationalises at fleet scale. Best-effort delivery,
25+-field line format,
SimplePrefixvsPartitionedPrefix+EventTimedelivery options. New concept canonicalised on this ingest. - concepts/partition-projection — Glue/Athena
partitioning primitive that avoids
MSCK REPAIR/ metastore-lookup overhead;enumvsinjectedtypes; 1M partition cap onenumif unconstrained. New concept canonicalised on this ingest. - concepts/best-effort-log-delivery — the delivery- semantics tier Yelp accepts on SAL; measured < 0.001 % > 2-day late. Load-bearing for why prefix-granularity retention is safe. New concept canonicalised on this ingest.
- concepts/athena-shared-resource-contention — Athena's
shared-cluster model with
TooManyRequestsException+ per- account / per-region DML concurrency quotas; retry-safe job design is mandatory. New concept canonicalised on this ingest. - concepts/user-controlled-log-fields — the general
hazard Yelp documents on SAL (
request_uri,referrer,user_agentcarry unescaped arbitrary bytes). New concept canonicalised on this ingest. - concepts/url-encoding-idiosyncrasy-s3-keys — most SAL
operations double-encode
key;BATCH.DELETE.OBJECT/S3.EXPIRE.OBJECTsingle-encode. Naiveurl_decode(url_decode(key))unsafe. New concept canonicalised on this ingest. - patterns/raw-to-columnar-log-compaction — daily compact-to-Parquet pattern; Yelp's 85 % storage + 99.99 % object-count reduction is the canonical datapoint. New pattern canonicalised on this ingest.
- patterns/object-tagging-for-lifecycle-expiration — tag
each object + tag-based lifecycle policy; the only scalable
per-object deletion primitive at fleet scale; composes with
S3 Batch Operations
PutObjectTagging(Delete is not a supported action). New pattern canonicalised on this ingest. - patterns/idempotent-athena-insertion-via-left-join —
self-LEFT-JOIN on target's unique column to make
INSERT INTO ... SELECTretry-safe; partition filters duplicated inONandWHEREfor planner-pruning. New pattern canonicalised on this ingest. - patterns/projection-partitioning-over-managed-partitions
— choose partition projection when prefix template is
known; avoids
MSCK REPAIRchurn + metastore-lookup planning latency. New pattern canonicalised on this ingest. - patterns/s3-access-based-retention — inventory ⋈ SAL at prefix granularity; equality-join beats LIKE-join (70k rows: 5 min → 2 sec); prefix granularity is what makes best-effort SAL delivery acceptable as access signal. New pattern canonicalised on this ingest.
- patterns/optional-non-capturing-tail-regex — wrap
user-controlled tail fields of a log regex in
(?:<rest>)?so the non-user-controlled prefix always parses; empty parsed rows are the failure signal. New pattern canonicalised on this ingest.
Ad-tech / experimentation axis (2026-02-02)¶
- concepts/filter-before-ab-test — the experimentation- workflow position Yelp occupies with the Back-Testing Engine: cheap pre-filter (back-testing) before expensive validation (A/B testing); A/B is preserved for final validation rather than discovery. New concept canonicalised on this ingest.
- concepts/hybrid-backtesting-with-ml-counterfactual — the methodology Yelp named explicitly as "not a pure back-testing approach, but rather a hybrid that combines elements of both simulation and back-testing". Historical inputs + alternative code path + ML-predicted counterfactual outcomes. New concept canonicalised on this ingest.
- concepts/counterfactual-outcome-prediction — the sub-concept: CatBoost regressors predict outcomes that never actually happened under the alternative treatment; non-parametric so they capture diminishing returns on budget. New concept canonicalised on this ingest.
- concepts/poisson-sampling-for-integer-outcomes — the trick that converts the regressor's smooth averages into realistic integer counts, restoring live-system stochasticity to the simulation. New concept canonicalised on this ingest.
- concepts/bayesian-optimization-over-parameter-space — sequential model-based optimization; Yelp's default via Scikit-Optimize. Grid + listed are "just wrappers that yield the next candidate". New concept canonicalised on this ingest.
- concepts/overfitting-to-historical-data — the named risk. Yelp's mitigation is organisational (keep A/B tests in the loop), not technical. New concept canonicalised on this ingest.
- patterns/production-code-as-submodule-for-simulation — the fidelity primitive. Budgeting and Billing repos as Git Submodules pointing at branches under test; "blurs the line between prototyping and production". New pattern canonicalised on this ingest.
- patterns/historical-replay-with-ml-outcome-predictor — the full simulation-loop shape (historical inputs + alternative code via submodule + ML outcome predictor + Poisson sampling); generalises to dynamic pricing, recommendation-ranking, bandit-policy domains. New pattern canonicalised on this ingest.
- patterns/yaml-declared-experiment-config — the configuration surface (date range, search space, search strategy, metric, max_evals) Yelp's Back-Testing Engine consumes. New pattern canonicalised on this ingest; sibling of the 2025-02-19 yaml-declared-feature-dag (same Yelp YAML-config discipline applied to different problem).
Datastore-platform / database-upgrade axis (2026-04-07)¶
- concepts/rolling-upgrade — the upgrade idiom; Yelp's Cassandra 4.x ingest extends the PlanetScale database-tier framing to a gossip-based NoSQL fleet at > 1,000 nodes.
- concepts/in-place-vs-new-dc-upgrade — the architectural
choice. New concept canonicalised on this ingest with
Yelp's explicit reasoning for rejecting new-DC at fleet
scale (time, cost,
EACH_QUORUMpreservation). - concepts/mixed-version-cluster — the upgrade state as a named operational concept. New concept canonicalised on this ingest.
- concepts/performance-regression-from-mid-upgrade-state — transient vs genuine regression distinction that Yelp's pre-migration observability dashboards made diagnosable. New concept canonicalised on this ingest.
- concepts/init-container-ip-gossip-pre-migration — the
Kubernetes sequencing trick for pods-get-new-IPs deployments
(
CASSANDRA-19244). New concept canonicalised on this ingest. - concepts/cassandra-cdc-commit-log — the 3.x → 4.x
write-point semantics change
(
CASSANDRA-12148) that forced Yelp's CDC Connector rewrite. New concept canonicalised on this ingest. - concepts/schema-disagreement — the distributed- datastore failure mode surfaced on Yelp's CDC-enabled clusters post-upgrade; Yelp's empirical remediation is dummy multi-node schema changes. New concept canonicalised on this ingest.
- concepts/anti-entropy-repair-pause — pre-flight / post-flight bookend of the Cassandra upgrade. New concept canonicalised on this ingest.
- concepts/checkpointed-automation-script — Yelp's upgrade driver runs in auto-proceed or per-step-confirmation mode; tunes risk to cluster-criticality. New concept canonicalised on this ingest.
- concepts/observability-before-migration — extended with the datastore-upgrade application: per-version dashboards caught the Stargate 2.x range-query regression in non-prod before it reached production.
- patterns/version-specific-images-per-git-branch — the core "no-hard-block" lever: 3.11 and 4.1 images published from dedicated Git branches, selected at bootstrap via env var. New pattern canonicalised on this ingest.
- patterns/pre-flight-flight-post-flight-upgrade-stages — the three-stage discipline (reversible gate → commit → reversible restore). New pattern canonicalised on this ingest.
- patterns/dual-run-version-specific-proxies — Stargate fleet split across 3.11-persistence + 4.1-persistence with service-mesh alias, keeping the last 3.11 Cassandra node deliberately on 3.11 until the 3.11 Stargate fleet is drained. New pattern canonicalised on this ingest.
- patterns/benchmark-in-own-environment-before-upgrade — don't trust upstream benchmarks alone. Yelp measured 4% p99 + 11% mean + 11% throughput in own env, production later measured 58% p99 reduction on key clusters. New pattern canonicalised on this ingest.
- patterns/production-qualification-criteria-upfront — six-criterion upfront list (perf / functional / security / rollback / observability / component-health). New pattern canonicalised on this ingest.
Key model zoo (named by the 2025-02-04 post)¶
- systems/gpt-4 — the formulation-phase LLM; also used to generate golden datasets for distillation.
- systems/o1-preview / systems/o1-mini — reserved for "newer and more complex tasks that require logical reasoning".
- systems/gpt-4o-mini — the fine-tuned offline student; ~100× cost reduction vs. direct GPT-4 prompt at equivalent quality on query-understanding tasks.
- systems/bert / systems/t5 — the realtime tail- query models; production serving tier for never-seen-before queries that miss the cache.
Recent articles¶
- 2026-04-07 —
Zero downtime Upgrade: Yelp's Cassandra 4.x Upgrade Story.
Datastore-platform / database-upgrade axis (opens a
seventh Yelp axis on the wiki). Yelp Database Reliability
Engineering upgraded > 1,000
Cassandra nodes from 3.11
to 4.1 on Kubernetes with zero
downtime / zero incidents / zero client-code changes.
In-place over new-DC on time + cost +
EACH_QUORUM-preservation grounds. Init containers sequence IP + version changes through gossip (CASSANDRA-19244). Dual-run version-specific Stargate fleets span theMigrationCoordinatorschema-fetch change. CDC Source Connector split-rollout around the 4.x commit-log write- point change (CASSANDRA-12148). Own-env benchmark: 4% p99 / 11% mean / 11% throughput; production: up to 58% p99 reduction on key clusters. Six-criterion qualification list upfront; three-stage automation script (pre-flight / flight / post-flight) with auto-proceed + per-step confirmation modes. Post-upgrade schema disagreement on CDC-enabled clusters remediated by dummy multi-node schema changes. Presented at KubeCon 2025. - 2026-02-02 — How Yelp Built a Back-Testing Engine for Safer, Smarter Ad Budget Allocation. Ad-tech / experimentation axis (opens a fifth Yelp axis); eight-component Back-Testing Engine simulating proposed Ad Budget Allocation algorithms against historical campaign data; production code as Git Submodules for fidelity; systems/catboost regressors as counterfactual-outcome predictor with Poisson-sampled integer counts; systems/scikit-optimize Bayesian search over YAML-declared search space; systems/mlflow as experiment store — first non-LLM-evaluation MLflow Seen-in on the wiki.
- 2025-09-26 —
S3 server access logs at scale. Storage / data-
engineering axis (opens a fourth Yelp axis); TiBs/day of
SAL compacted to Parquet (85 % storage + 99.99 % object-
count reduction); daily Tron-orchestrated Athena INSERTs
with idempotent self-LEFT-JOIN; Glue partition projection
with
enumover managed partitions; tag-based lifecycle expiration via S3 Batch Operations; weekly access-based retention via inventory ⋈ SAL at prefix granularity; measured SAL best-effort delivery (< 0.001 % > 2-day late). - 2025-07-08 —
Exploring CHAOS: Building a Backend for Server-Driven UI.
SDUI / client-platform-framework axis (opens a fourth Yelp
axis); CHAOS backend unpacked —
Apollo Federation subgraph in
Python Strawberry over
multiple team-owned REST backends; six-stage
FeatureProvider
lifecycle run as a
two-loop parallel
async build;
Register-based client capability matching drops features
on unsupported clients;
JSON-string parameters keep the GraphQL schema stable;
per-feature
error wrapper drops failed features without sinking the
view (unless
IS_ESSENTIAL_PROVIDER); advanced primitives view flows and view placeholders. - 2025-05-27 — Revenue Automation Series: Testing an Integration with Third-Party System. Financial-systems / integration- testing axis; six-step testing strategy for the Revenue Data Pipeline; parallel staging pipeline on Glue+Spectrum bypassing ~10-hour Redshift Connector latency; dual-cadence integrity checks (99.99% contract match threshold); Schema Validation Batch for pre-upload schema drift; SFTP standardised over REST for bulk daily upload.
- 2025-04-15 — Journey to Zero Trust Access. Corporate-security axis; Netbird replaces Ivanti Pulse Secure as the employee VPN; WireGuard mesh topology + router peers for <2s transparent failover; OIDC+device-posture via Okta; upstream contributions to Netbird main.
- 2025-02-19 — Revenue Automation Series: Building Revenue Data Pipeline.
- 2025-02-04 — Search query understanding with LLMs: from ideation to production.
Related¶
- systems/yelp-query-understanding
- systems/yelp-search
- systems/yelp-revenue-data-pipeline
- systems/yelp-spark-etl
- systems/yelp-billing-system
- systems/netbird
- systems/wireguard
- systems/okta
- systems/pulse-secure
- concepts/query-understanding
- concepts/long-tail-query
- concepts/retrieval-augmented-generation
- concepts/llm-cascade
- concepts/revenue-recognition-automation
- concepts/glossary-dictionary-requirement-translation
- concepts/data-gap-analysis
- concepts/pyspark-udf-for-complex-business-logic
- concepts/spark-etl-feature-dag
- concepts/checkpoint-intermediate-dataframe-debugging
- concepts/mysql-snapshot-to-s3-data-lake
- concepts/zero-trust-authorization
- concepts/vpn-to-zta-migration
- concepts/wireguard-mesh-topology
- concepts/router-peer
- concepts/sso-authentication
- patterns/head-cache-plus-tail-finetuned-model
- patterns/offline-teacher-online-student-distillation
- patterns/daily-mysql-snapshot-plus-spark-etl
- patterns/source-plus-transformation-feature-decomposition
- patterns/business-to-engineering-requirement-translation
- patterns/yaml-declared-feature-dag-topology-inferred
- patterns/oidc-plus-device-posture-access-gate
- patterns/open-source-for-security-response-agency
- patterns/upstream-contribution-parallel-to-in-house-integration
- systems/yelp-staging-pipeline
- systems/yelp-schema-validation-batch
- systems/yelp-redshift-connector
- systems/aws-glue
- systems/amazon-redshift
- systems/amazon-redshift-spectrum
- concepts/staging-pipeline
- concepts/data-integrity-checker
- concepts/redshift-connector-latency
- concepts/test-data-generation-for-edge-cases
- concepts/data-upload-format-validation
- patterns/parallel-staging-pipeline-for-prod-verification
- patterns/monthly-plus-daily-dual-cadence-integrity-check
- patterns/schema-validation-pre-upload-via-mapping-api
- patterns/sftp-for-bulk-daily-upload
- systems/yelp-s3-sal-pipeline
- systems/tron
- systems/aws-s3
- systems/amazon-athena
- systems/apache-parquet
- systems/s3-batch-operations
- systems/s3-inventory
- concepts/s3-server-access-logs
- concepts/partition-projection
- concepts/best-effort-log-delivery
- concepts/athena-shared-resource-contention
- concepts/user-controlled-log-fields
- concepts/url-encoding-idiosyncrasy-s3-keys
- patterns/raw-to-columnar-log-compaction
- patterns/object-tagging-for-lifecycle-expiration
- patterns/idempotent-athena-insertion-via-left-join
- patterns/projection-partitioning-over-managed-partitions
- patterns/s3-access-based-retention
- patterns/optional-non-capturing-tail-regex
- systems/yelp-back-testing-engine
- systems/yelp-ad-budget-allocation
- systems/scikit-optimize
- systems/catboost
- systems/mlflow
- concepts/filter-before-ab-test
- concepts/hybrid-backtesting-with-ml-counterfactual
- concepts/counterfactual-outcome-prediction
- concepts/poisson-sampling-for-integer-outcomes
- concepts/bayesian-optimization-over-parameter-space
- concepts/overfitting-to-historical-data
- patterns/production-code-as-submodule-for-simulation
- patterns/historical-replay-with-ml-outcome-predictor
- patterns/yaml-declared-experiment-config
- systems/yelp-chaos
- systems/apollo-federation
- systems/strawberry-graphql
- concepts/server-driven-ui
- concepts/register-based-client-capability-matching
- concepts/json-string-parameters-for-schema-stability
- patterns/federated-graphql-subgraph-per-domain
- patterns/feature-provider-lifecycle
- patterns/two-loop-parallel-async-build
- patterns/error-isolation-per-feature-wrapper
- patterns/preloaded-view-flow-for-predictable-navigation
- patterns/view-placeholder-async-embed
- systems/apache-cassandra
- systems/stargate-cassandra-proxy
- systems/cassandra-source-connector
- systems/kubernetes-init-containers
- systems/kubernetes
- systems/spark-cassandra-connector
- systems/yelp-pushplan-automation
- concepts/rolling-upgrade
- concepts/mixed-version-cluster
- concepts/in-place-vs-new-dc-upgrade
- concepts/performance-regression-from-mid-upgrade-state
- concepts/init-container-ip-gossip-pre-migration
- concepts/cassandra-cdc-commit-log
- concepts/schema-disagreement
- concepts/anti-entropy-repair-pause
- concepts/checkpointed-automation-script
- concepts/observability-before-migration
- patterns/version-specific-images-per-git-branch
- patterns/pre-flight-flight-post-flight-upgrade-stages
- patterns/dual-run-version-specific-proxies
- patterns/benchmark-in-own-environment-before-upgrade
- patterns/production-qualification-criteria-upfront