Skip to content

SYSTEM Cited by 2 sources

Yelp Revenue Data Pipeline

Definition

Yelp Revenue Data Pipeline is the named batch data pipeline that collects revenue data from Yelp's ecosystem (billing system, order-to-cash system, product catalogs) and delivers it to a third-party Revenue Recognition SaaS ("REVREC service") for automated revenue recognition. Centerpiece of Yelp's multi-year Revenue Automation project.

Architecture (from 2025-02-19 post)

MySQL operational tables
       │ (daily snapshot)
AWS S3 (data lake, snapshot prefix)
[Spark ETL](<./yelp-spark-etl.md>) (Yelp's spark-etl package)
       │   — source snapshot features (read raw)
       │   — transformation features (aggregate, join,
       │     apply UDFs for revenue logic, map templates)
AWS S3 (publish prefix, REVREC-template shape)
REVREC service (3rd party SaaS)
Accounting team books close + real-time revenue forecasts

Why the pipeline exists

The REVREC service needs clean, REVREC-template-shaped revenue data to do its job. Yelp's operational systems produce that data in native shapes that don't match — so the Revenue Data Pipeline is the bridge that reshapes Yelp-native data into REVREC- template-shaped data.

Three benefits attributed to REVREC adoption (from the post):

  • Recognise any revenue stream (one-time purchases + subscriptions, flat + variable pricing) with minimal cost / risk.
  • Close the books up to 50% faster via real-time revenue reconciliation.
  • Real-time revenue forecasting with out-of-the-box reports + dashboards.

Substrate choice: Data Lake + Spark ETL

Chosen after rejecting three alternatives (see sources/2025-02-19-yelp-revenue-automation-series-building-revenue-data-pipeline for the full evaluation):

Option Rejected because
MySQL + Python Batch Inconsistent rerun results (production data mutates mid-rerun); slow batch on peak volumes.
Data Warehouse + dbt Complex revenue-recognition logic "difficult to represent in SQL."
Event Streams + Stream Processing Immediate data presentation not necessary; third-party REVREC interfaces don't support stream integration.
Data Lake + Spark ETL Independent reproducibility (daily snapshots are immutable → same input → same output). Peak-time scalability. Strong community support.

The key differentiator versus MySQL+Python is daily snapshot reproducibility — reruns against a frozen snapshot produce identical output regardless of when they run.

Two-stage Spark ETL structure

The post discloses the pipeline runs as a 2-staged Spark job with 50+ Spark features — both numbers named as current maintenance risk by the team (see Future Improvements below).

Stages are:

  1. Source data snapshot features — read raw MySQL snapshots from S3; pass through unchanged.
  2. Transformation features — consume source features (or other transformation features); apply filtering, projection, joins, UDF-based revenue-calculation logic, and finally map the output to REVREC-template schemas.

Known implementation details

  • Snapshot frequency: daily.
  • Processing engine: Apache Spark via PySpark.
  • Orchestration: spark-etl internal package + YAML config
  • topological-sort runtime (see systems/yelp-spark-etl).
  • Debugging: checkpointing intermediate DataFrames to an S3 scratch prefix; loaded in systems/jupyterhub for post-facto inspection.
  • Complex business logic (e.g. multi-priority discount application): PySpark UDFs (see concepts/pyspark-udf-for-complex-business-logic).

Future improvements (named by the post)

  • Enhanced Data Interfaces and Ownership — feature teams own + maintain standardised data interfaces for offline consumption, decoupling reporting from implementation.
  • Simplified Data Models — reduce UDF count by simplifying source data so SQL-like PySpark expressions suffice.
  • Unified Implementation — standardise schemas across products to shrink input-table count and collapse DAG stages.

Caveats

  • "REVREC service" is unnamed — the post treats the third- party SaaS as black-box. Likely candidates (2024-2025 Yelp-scale): Zuora RevPro, NetSuite ARM, Sage Intacct, Workiva — but none confirmed.
  • No throughput / latency disclosures — the post emphasises methodology over benchmarks. Daily record volume, Spark cluster size, runtime, and cost are not published.
  • 50+ features + 2 stages is a named risk — adding a new product requires changes + testing across the whole job. Yelp's future-improvements section explicitly flags this as unsustainable at the next product-catalog growth step.

Testing + verification (from 2025-05-27 post)

The 2025-05-27 Revenue Automation Series post on integration testing describes the verification machinery around this pipeline. The same pipeline code runs in two configurations:

  • Production — output published via Redshift Connector to Redshift tables; consumed by BI + REVREC upload.
  • Staging — the parallel Yelp Staging Pipeline consuming the same production data with code-under-test, publishing to AWS Glue tables queryable via Redshift Spectrum. Enables same-day verification that bypasses the ~10-hour Redshift Connector latency.

Integrity is checked at two cadences (see patterns/monthly-plus-daily-dual-cadence-integrity-check):

  • Daily (over staging Glue tables via Redshift Spectrum): lightweight pipeline-internal invariants — zero negative revenues, zero unknown programs, count of contracts missing parent category, etc.
  • Monthly (over Redshift tables against billing-system truth): full four-metric suite with a 99.99% contract- invoice match threshold.

Output to the third-party REVREC system is guarded by a Schema Validation Batch that polls the REVREC mapping API before every upload and aborts on mismatch. Standardised delivery is SFTP (4-5 files/day at 500k-700k records each) after REST was found to be flaky and capped at 50k records/file → 15 files/day.

Seen in

Last updated · 476 distilled / 1,218 read