YELP 2025-02-19 Tier 3

Yelp — Revenue Automation Series: Building Revenue Data Pipeline¶

Summary¶

Yelp Engineering post (2025-02-19) by the Commerce Platform / Financial Systems team — second in the Revenue Automation Series after the 2024-12 billing-system modernisation post. Documents how Yelp built a batch Revenue Data Pipeline that feeds a third-party Revenue Recognition SaaS ("REVREC service"), enabling the Accounting team to close the books ~50% faster and unlock real-time revenue forecasting / dashboards. The post's load-bearing content is split across three methodology moves and two technical-implementation moves:

Ambiguous-requirement translation via a Glossary Dictionary that maps accounting terms (Revenue Contract, Booking, Fair Value, Each Line, Invoice Amount) to Yelp- engineering concepts, plus an explicit Purpose + Example Calculation block per requirement to de-risk cross- functional interpretation. Worked example: "TCV captured during booking as revenue based on fair value allocation by each line will be different from invoice amount" → an engineering requirement "The REVREC service requires a purchased product's gross amount to be sent over whenever a user completes a subscription order; this amount is possible to be different from the amount we actually bill."
Data Gap Analysis to reconcile Yelp's custom order-to-cash data model against the standard ETL template REVREC expects. Two gap classes: Data Gaps (no direct field mapping — immediate solution is approximation like "use gross billing amount as list price" + composite data like "unique monthly contract ID = product ID + revenue period"; long-term solution is a Product Catalog system). Inconsistent Product Implementation (data scattered across databases per product — immediate solution is pre-process into a unified schema
categorise by event type; long-term solution is unified billing data models).
System design evaluation of four architectures for the reporting substrate, explicitly rejected and documented:
- MySQL + Python batch (rejected — inconsistent rerun results from mutating production data + slow batch on peak volumes).
- Data Warehouse + dbt (rejected — SQL is insufficient for complex revenue-recognition logic).
- Event Streams + Stream Processing (rejected — "third- party interfaces don't support stream integration"; immediacy not needed).
- Data Lake + Spark ETL (chosen — daily MySQL snapshots → S3 → Spark ETL → REVREC; independent reproducibility, peak-time scalability, strong community).
Spark ETL pipeline management via an internal spark-etl package. Building blocks are Spark Features — web-API-like units with input/transform/output schemas, in two categories: source data snapshot features (read S3 snapshot unchanged; reusable across downstream features) and transformation features (take other features as pyspark.DataFrame inputs, apply transformations). Dependency DAG is declared in a YAML config listing only the nodes; the runtime infers edges from each SparkFeature.sources field via topological sort — "no need to draw a complex diagram of dependency relationships in the yaml file." Checkpointing intermediate DataFrames to an S3 scratch path via --checkpoint feat1, feat2, feat3 on spark_etl_runner.py enables interactive debugging in Jupyterhub without re-executing expensive upstream features.
PySpark UDFs for complex business logic that SQL-like expressions cannot express cleanly. Worked example: a discount-application UDF with five business rules — (a) product eligible iff its active period covers the discount's period; (b) one discount per product; (c) Type A before Type B; (d) within type, smaller product ID wins; (e) smaller discount IDs applied first. Implementation is a pure Python function annotated @udf(ArrayType(DISCOUNT_APPLICATION_SCHEMA)) applied to DataFrames pre-grouped by customer_id, then explode()-ed to normalised rows. Canonical framing: "Without using UDFs, implementing this logic would require multiple window functions, which can be hard to read and maintain."

Key takeaways¶

Translating accountant requirements to engineering requirements is a first-class project phase, not an ad-hoc chat. Yelp formalised it as a three-step methodology — Glossary Dictionary mapping business ↔ engineering vocabulary, Purpose statement + Example Calculation per requirement, then an engineering rewording that downstream pipeline code can be validated against. (Source: sources/2025-02-19-yelp-revenue-automation-series-building-revenue-data-pipeline)
Data Gap Analysis separates immediate approximations from long-term structural fixes. "Use gross billing amount as list price" is an immediate approximation that unblocks REVREC integration today; the long-term fix is a dedicated Product Catalog system for common product attributes. This dual-horizon output format is portable to any third- party-integration analysis. (Source: sources/2025-02-19-yelp-revenue-automation-series-building-revenue-data-pipeline)
Data Lake + Spark ETL beats three alternatives for revenue-recognition reporting — all three rejections are load-bearing. MySQL+Python fails on reproducibility (production data changes mid-rerun) and peak-volume throughput. Warehouse+dbt fails on expressing complex business logic in SQL. Event Streams fails on both "immediate presentation not necessary" and "third-party interfaces don't support stream integration." The winner wins on independent reproducibility (daily snapshot → same input produces same output), peak-time scalability, and community support. (Source: sources/2025-02-19-yelp-revenue-automation-series-building-revenue-data-pipeline)
Source-feature + transformation-feature decomposition keeps Spark DAGs maintainable. Source features are thin wrappers over S3PublishedSource that pass data through unchanged; transformation features consume other features via their .sources dict. This separation means raw snapshots are loaded once and reused across downstream features, and the DAG shape is encoded in Python class definitions rather than YAML edges. (Source: sources/2025-02-19-yelp-revenue-automation-series-building-revenue-data-pipeline)
Declare DAG nodes in YAML; let the runtime infer the edges. Yelp's spark-etl reads only a flat list of feature aliases + class paths from YAML, then topologically sorts them at runtime from each feature's SparkFeature.sources. Edge-free YAML stays DRY — adding a new feature means adding one YAML line, not editing a graph. (Source: sources/2025-02-19-yelp-revenue-automation-series-building-revenue-data-pipeline)
Checkpoint intermediate DataFrames to a scratch path to make lazy-evaluation Spark debuggable. --checkpoint feat1, feat2 materialises named features to S3 scratch; subsequent reads in Jupyterhub are fast and shareable across the team. This substitutes for the interactive debugger that Spark's distributed + lazy model makes impractical. (Source: sources/2025-02-19-yelp-revenue-automation-series-building-revenue-data-pipeline)
PySpark UDFs are the correct tool when business logic exceeds SQL's expressive power cleanly. Multi-priority discount application with type + ID tie-breaks is readable as a ~10-line Python function; rewriting it as window functions would be "hard to read and maintain." The cost is documented: UDFs "are expensive to run and can penalize performance and reliability of the job over time" — Yelp lists UDF removal via simplified source data models as a future-improvement axis. (Source: sources/2025-02-19-yelp-revenue-automation-series-building-revenue-data-pipeline)
The post names its own ceiling: 50+ Spark features in a 2-staged job is a maintenance risk. Adding a new product requires changes + testing across the whole job. Future direction: feature-team-owned standardised data interfaces for offline consumption, simplified source data models that reduce UDF need, unified implementation across products to shrink input-table count. (Source: sources/2025-02-19-yelp-revenue-automation-series-building-revenue-data-pipeline)

Systems extracted¶

systems/yelp-revenue-data-pipeline — the named pipeline itself: MySQL snapshot → S3 → Spark ETL → REVREC service.
systems/yelp-spark-etl — Yelp's internal package wrapping Spark feature definitions + YAML-declared DAG config + topological execution + checkpointing.
systems/yelp-billing-system — the upstream source-of- truth for revenue contracts and invoices (covered in the first post of the Revenue Automation Series, 2024-12; referenced as prerequisite).
systems/apache-spark — the underlying distributed dataflow engine; extended with a Yelp-specific production instance altitude (revenue-recognition reporting).
systems/aws-s3 — the snapshot + publish + scratch substrate. Features read from s3a:// paths; checkpoints are written to a scratch s3a:// prefix.
systems/jupyterhub — the post-facto checkpoint inspection surface.

Concepts extracted¶

concepts/revenue-recognition-automation — the accounting-process primitive the whole pipeline exists to serve. Defined by external resources but canonicalised here with Yelp's concrete operational win ("close the books up to 50% faster") + real-time forecasting as downstream benefits.
concepts/glossary-dictionary-requirement-translation — the three-step methodology (Glossary Dictionary → Purpose + Example → Engineering Requirement) for translating accountant / legal / business-analyst requirements into engineering- consumable ones.
concepts/data-gap-analysis — the two-axis output format (immediate approximation / composite data vs long-term structural fix) for a third-party-integration data-mapping analysis.
concepts/pyspark-udf-for-complex-business-logic — when SQL-like DataFrame expressions (select, filter, join) can't express multi-priority tie-break logic cleanly, a Python UDF is the preferred replacement over stacked window functions.
concepts/spark-etl-feature-dag — the Spark-feature abstraction (web-API-like unit with input/transform/output schemas) + source/transformation categorisation.
concepts/checkpoint-intermediate-dataframe-debugging — materialising intermediate Spark DataFrames to a scratch S3 path as a substitute for breakpoint-based debugging of lazy-evaluation distributed programs.
concepts/mysql-snapshot-to-s3-data-lake — the reproducibility primitive: freeze the operational database's state into an immutable object-storage snapshot daily, then run downstream ETL against that snapshot. Same input → same output across reruns.

Patterns extracted¶

patterns/daily-mysql-snapshot-plus-spark-etl — the chosen architecture: MySQL tables snapshotted daily to S3, processed downstream by Spark ETL jobs. Generalisable to any OLTP-to-analytics pipeline where immediate latency isn't needed.
patterns/source-plus-transformation-feature-decomposition — decompose a Spark pipeline into (a) source-snapshot features that wrap raw inputs without transformation and (b) transformation features that consume other features. Keeps raw loads reusable and DAG shape encoded in Python class definitions.
patterns/business-to-engineering-requirement-translation — the reusable Glossary Dictionary + Purpose + Example + Engineering Rewording playbook; lifts out of the revenue- recognition domain to any cross-functional requirement handoff.
patterns/yaml-declared-feature-dag-topology-inferred — list features in YAML config without their edges; let the runtime topologically sort from each feature's declared sources. Keeps config DRY and makes adding features a one-line change.

Operational numbers and scale datums¶

~50% faster book-close — Accounting-team cycle time reduction from real-time reconciliation vs prior manual process. Load-bearing business-value datum.
50+ Spark features in the 2-staged production pipeline (post's disclosure of the current maintenance burden).
Discount-application UDF worked example: 5 business rules expressed in ~10 lines of Python; equivalent with window functions would be "hard to read and maintain" by Yelp's assessment.
Pipeline stages: 2 (noted as a maintenance concern — adding a new product requires changes + testing across the whole job).
Not disclosed: daily record volumes, S3 snapshot size, Spark cluster size, job runtime, REVREC API throughput, number of products / subscription SKUs, TCV magnitudes. The post focuses on methodology + structural-architecture decisions rather than throughput / latency numbers.

Cross-references¶

Sibling to sources/2025-02-04-yelp-search-query-understanding-with-llms (the wiki's first Yelp ingest, 2 weeks earlier) at a very different stack altitude — search-query-understanding / LLM-serving-infra vs financial-systems batch ETL. Together they establish Yelp's multi-axis engineering output: the search post was the LLM / serving-infra axis; this post opens the data-platform / financial-systems axis.
Sibling to sources/2024-08-01-segment-0-6m-year-savings-by-using-s3-for-change-data-capture-for-dynamodb at data-lake-for-reproducibility altitude — both move data out of the mutable operational database into an immutable object-storage substrate for downstream processing. Segment does it via CDC for cost savings ($600K/year); Yelp does it via daily snapshots for rerun reproducibility + peak-time scalability.
Sibling to sources/2024-04-29-canva-scaling-to-count-billions (referenced via concepts/elt-vs-etl) at ELT-vs-ETL altitude — both teams evaluated warehouse + dbt and found it lacking, but for opposite reasons: Canva adopted warehouse + dbt because it reduced code volume and simplified aggregations; Yelp rejected warehouse + dbt because revenue-recognition logic exceeded SQL's expressive power. Different workload shapes drive different answers.
Earlier in the Revenue Automation Series: 2024-12 legacy-billing-system modernisation (referenced in the body as prerequisite; not separately ingested yet).
Series continuation: the post closes by flagging a third post in the series on "ensuring data integrity, reconciling data discrepancies and all the learnings from working with 3rd party systems." Stay tuned.

Caveats¶

Tier-3 on-scope — Yelp is Tier-3, which per AGENTS.md requires strict filtering. This post passes decisively on grounds of (a) concrete system-design evaluation (four rejected architectures with load-bearing reasons), (b) architectural-density (Data Lake + Spark ETL mechanism
YAML-declarative DAG + source/transformation feature decomposition + checkpointing debugging + UDF discipline), (c) reusable methodology (Glossary Dictionary + Data Gap Analysis as portable playbooks).
Operational numbers are sparse — the 50% book-close speed-up is the only hard quantitative business metric. No throughput, latency, cost, or scale disclosures on the pipeline itself. This reflects financial-systems posts generally leaning on methodology over benchmarks; subsequent posts in the series may fill the gap.
"REVREC service" is unnamed — the post treats it as a black-box third-party SaaS; no vendor name (likely one of Zuora RevPro, NetSuite ARM, Sage Intacct, or similar revenue-recognition SaaS given Yelp's scale and the 2024-2025 timeframe), no API shape, no SLA. Downstream integration specifics (required templates, field mappings beyond example) are abstracted.
Series is partial at ingest time — the third post on "data integrity, reconciling data discrepancies" is flagged as forthcoming. Anything this wiki page claims about reconciliation mechanisms is from this one post only.
Stack specifics are minimised — Yelp's internal spark-etl package is not open-source; the code samples (SparkFeature base class, S3PublishedSource source type, ConfiguredSparkFeature composition) show the API shape but not the implementation. A consumer of this wiki page can rebuild the pattern from the shape, but not adopt Yelp's code directly.
Future-improvements section names the limits — 50+ features is "really challenging for a single team to maintain"; UDFs are "expensive to run and can penalize performance and reliability". Yelp explicitly flags the architecture as evolving — the canonicalisation here captures the current state, not the target.