SYSTEM Cited by 2 sources

Yelp Revenue Data Pipeline¶

Definition¶

Yelp Revenue Data Pipeline is the named batch data pipeline that collects revenue data from Yelp's ecosystem (billing system, order-to-cash system, product catalogs) and delivers it to a third-party Revenue Recognition SaaS ("REVREC service") for automated revenue recognition. Centerpiece of Yelp's multi-year Revenue Automation project.

Architecture (from 2025-02-19 post)¶

MySQL operational tables
       │ (daily snapshot)
       ▼
AWS S3 (data lake, snapshot prefix)
       │
       ▼
[Spark ETL](<./yelp-spark-etl.md>) (Yelp's spark-etl package)
       │   — source snapshot features (read raw)
       │   — transformation features (aggregate, join,
       │     apply UDFs for revenue logic, map templates)
       ▼
AWS S3 (publish prefix, REVREC-template shape)
       │
       ▼
REVREC service (3rd party SaaS)
       │
       ▼
Accounting team books close + real-time revenue forecasts

Why the pipeline exists¶

The REVREC service needs clean, REVREC-template-shaped revenue data to do its job. Yelp's operational systems produce that data in native shapes that don't match — so the Revenue Data Pipeline is the bridge that reshapes Yelp-native data into REVREC- template-shaped data.

Three benefits attributed to REVREC adoption (from the post):

Recognise any revenue stream (one-time purchases + subscriptions, flat + variable pricing) with minimal cost / risk.
Close the books up to 50% faster via real-time revenue reconciliation.
Real-time revenue forecasting with out-of-the-box reports + dashboards.

Substrate choice: Data Lake + Spark ETL¶

Chosen after rejecting three alternatives (see sources/2025-02-19-yelp-revenue-automation-series-building-revenue-data-pipeline for the full evaluation):

Option	Rejected because
MySQL + Python Batch	Inconsistent rerun results (production data mutates mid-rerun); slow batch on peak volumes.
Data Warehouse + dbt	Complex revenue-recognition logic "difficult to represent in SQL."
Event Streams + Stream Processing	Immediate data presentation not necessary; third-party REVREC interfaces don't support stream integration.
Data Lake + Spark ETL ✅	Independent reproducibility (daily snapshots are immutable → same input → same output). Peak-time scalability. Strong community support.

The key differentiator versus MySQL+Python is daily snapshot reproducibility — reruns against a frozen snapshot produce identical output regardless of when they run.

Two-stage Spark ETL structure¶

The post discloses the pipeline runs as a 2-staged Spark job with 50+ Spark features — both numbers named as current maintenance risk by the team (see Future Improvements below).

Stages are:

Source data snapshot features — read raw MySQL snapshots from S3; pass through unchanged.
Transformation features — consume source features (or other transformation features); apply filtering, projection, joins, UDF-based revenue-calculation logic, and finally map the output to REVREC-template schemas.

Known implementation details¶

Snapshot frequency: daily.
Processing engine: Apache Spark via PySpark.
Orchestration: spark-etl internal package + YAML config
topological-sort runtime (see systems/yelp-spark-etl).
Debugging: checkpointing intermediate DataFrames to an S3 scratch prefix; loaded in systems/jupyterhub for post-facto inspection.
Complex business logic (e.g. multi-priority discount application): PySpark UDFs (see concepts/pyspark-udf-for-complex-business-logic).

Future improvements (named by the post)¶

Enhanced Data Interfaces and Ownership — feature teams own + maintain standardised data interfaces for offline consumption, decoupling reporting from implementation.
Simplified Data Models — reduce UDF count by simplifying source data so SQL-like PySpark expressions suffice.
Unified Implementation — standardise schemas across products to shrink input-table count and collapse DAG stages.

Caveats¶

"REVREC service" is unnamed — the post treats the third- party SaaS as black-box. Likely candidates (2024-2025 Yelp-scale): Zuora RevPro, NetSuite ARM, Sage Intacct, Workiva — but none confirmed.
No throughput / latency disclosures — the post emphasises methodology over benchmarks. Daily record volume, Spark cluster size, runtime, and cost are not published.
50+ features + 2 stages is a named risk — adding a new product requires changes + testing across the whole job. Yelp's future-improvements section explicitly flags this as unsustainable at the next product-catalog growth step.

Testing + verification (from 2025-05-27 post)¶

The 2025-05-27 Revenue Automation Series post on integration testing describes the verification machinery around this pipeline. The same pipeline code runs in two configurations:

Production — output published via Redshift Connector to Redshift tables; consumed by BI + REVREC upload.
Staging — the parallel Yelp Staging Pipeline consuming the same production data with code-under-test, publishing to AWS Glue tables queryable via Redshift Spectrum. Enables same-day verification that bypasses the ~10-hour Redshift Connector latency.

Integrity is checked at two cadences (see patterns/monthly-plus-daily-dual-cadence-integrity-check):

Daily (over staging Glue tables via Redshift Spectrum): lightweight pipeline-internal invariants — zero negative revenues, zero unknown programs, count of contracts missing parent category, etc.
Monthly (over Redshift tables against billing-system truth): full four-metric suite with a 99.99% contract- invoice match threshold.

Output to the third-party REVREC system is guarded by a Schema Validation Batch that polls the REVREC mapping API before every upload and aborts on mismatch. Standardised delivery is SFTP (4-5 files/day at 500k-700k records each) after REST was found to be flaky and capped at 50k records/file → 15 files/day.

Seen in¶

sources/2025-02-19-yelp-revenue-automation-series-building-revenue-data-pipeline — canonical disclosure. System design evaluation (four architectures, three rejected), Spark ETL feature structure, YAML DAG config, checkpointing, UDF discipline for discount application, future-improvements roadmap.
sources/2025-05-27-yelp-revenue-automation-series-testing-an-integration-with-third-party-system — how Yelp validates this pipeline in production: parallel staging pipeline, dual-cadence integrity checks, pre-upload schema validation, SFTP over REST for bulk delivery.

systems/yelp-spark-etl — the internal package
systems/yelp-billing-system — upstream source-of-truth
systems/yelp-staging-pipeline — verification sandbox
systems/yelp-schema-validation-batch — pre-upload guard
systems/yelp-redshift-connector — publication path (and latency constraint)
systems/apache-spark — underlying engine
systems/aws-s3 — snapshot + publish + scratch substrate
systems/aws-glue — staging catalog
systems/amazon-redshift — production BI substrate
systems/amazon-redshift-spectrum — staging query path
systems/jupyterhub — debugging surface
companies/yelp
concepts/revenue-recognition-automation
concepts/mysql-snapshot-to-s3-data-lake
concepts/staging-pipeline
concepts/data-integrity-checker
concepts/redshift-connector-latency
concepts/data-upload-format-validation
patterns/daily-mysql-snapshot-plus-spark-etl
patterns/source-plus-transformation-feature-decomposition
patterns/parallel-staging-pipeline-for-prod-verification
patterns/monthly-plus-daily-dual-cadence-integrity-check
patterns/schema-validation-pre-upload-via-mapping-api
patterns/sftp-for-bulk-daily-upload