SYSTEM Cited by 2 sources
Yelp Revenue Data Pipeline¶
Definition¶
Yelp Revenue Data Pipeline is the named batch data pipeline that collects revenue data from Yelp's ecosystem (billing system, order-to-cash system, product catalogs) and delivers it to a third-party Revenue Recognition SaaS ("REVREC service") for automated revenue recognition. Centerpiece of Yelp's multi-year Revenue Automation project.
Architecture (from 2025-02-19 post)¶
MySQL operational tables
│ (daily snapshot)
▼
AWS S3 (data lake, snapshot prefix)
│
▼
[Spark ETL](<./yelp-spark-etl.md>) (Yelp's spark-etl package)
│ — source snapshot features (read raw)
│ — transformation features (aggregate, join,
│ apply UDFs for revenue logic, map templates)
▼
AWS S3 (publish prefix, REVREC-template shape)
│
▼
REVREC service (3rd party SaaS)
│
▼
Accounting team books close + real-time revenue forecasts
Why the pipeline exists¶
The REVREC service needs clean, REVREC-template-shaped revenue data to do its job. Yelp's operational systems produce that data in native shapes that don't match — so the Revenue Data Pipeline is the bridge that reshapes Yelp-native data into REVREC- template-shaped data.
Three benefits attributed to REVREC adoption (from the post):
- Recognise any revenue stream (one-time purchases + subscriptions, flat + variable pricing) with minimal cost / risk.
- Close the books up to 50% faster via real-time revenue reconciliation.
- Real-time revenue forecasting with out-of-the-box reports + dashboards.
Substrate choice: Data Lake + Spark ETL¶
Chosen after rejecting three alternatives (see sources/2025-02-19-yelp-revenue-automation-series-building-revenue-data-pipeline for the full evaluation):
| Option | Rejected because |
|---|---|
| MySQL + Python Batch | Inconsistent rerun results (production data mutates mid-rerun); slow batch on peak volumes. |
| Data Warehouse + dbt | Complex revenue-recognition logic "difficult to represent in SQL." |
| Event Streams + Stream Processing | Immediate data presentation not necessary; third-party REVREC interfaces don't support stream integration. |
| Data Lake + Spark ETL ✅ | Independent reproducibility (daily snapshots are immutable → same input → same output). Peak-time scalability. Strong community support. |
The key differentiator versus MySQL+Python is daily snapshot reproducibility — reruns against a frozen snapshot produce identical output regardless of when they run.
Two-stage Spark ETL structure¶
The post discloses the pipeline runs as a 2-staged Spark job with 50+ Spark features — both numbers named as current maintenance risk by the team (see Future Improvements below).
Stages are:
- Source data snapshot features — read raw MySQL snapshots from S3; pass through unchanged.
- Transformation features — consume source features (or other transformation features); apply filtering, projection, joins, UDF-based revenue-calculation logic, and finally map the output to REVREC-template schemas.
Known implementation details¶
- Snapshot frequency: daily.
- Processing engine: Apache Spark via PySpark.
- Orchestration:
spark-etlinternal package + YAML config - topological-sort runtime (see systems/yelp-spark-etl).
- Debugging: checkpointing intermediate DataFrames to an S3 scratch prefix; loaded in systems/jupyterhub for post-facto inspection.
- Complex business logic (e.g. multi-priority discount application): PySpark UDFs (see concepts/pyspark-udf-for-complex-business-logic).
Future improvements (named by the post)¶
- Enhanced Data Interfaces and Ownership — feature teams own + maintain standardised data interfaces for offline consumption, decoupling reporting from implementation.
- Simplified Data Models — reduce UDF count by simplifying source data so SQL-like PySpark expressions suffice.
- Unified Implementation — standardise schemas across products to shrink input-table count and collapse DAG stages.
Caveats¶
- "REVREC service" is unnamed — the post treats the third- party SaaS as black-box. Likely candidates (2024-2025 Yelp-scale): Zuora RevPro, NetSuite ARM, Sage Intacct, Workiva — but none confirmed.
- No throughput / latency disclosures — the post emphasises methodology over benchmarks. Daily record volume, Spark cluster size, runtime, and cost are not published.
- 50+ features + 2 stages is a named risk — adding a new product requires changes + testing across the whole job. Yelp's future-improvements section explicitly flags this as unsustainable at the next product-catalog growth step.
Testing + verification (from 2025-05-27 post)¶
The 2025-05-27 Revenue Automation Series post on integration testing describes the verification machinery around this pipeline. The same pipeline code runs in two configurations:
- Production — output published via Redshift Connector to Redshift tables; consumed by BI + REVREC upload.
- Staging — the parallel Yelp Staging Pipeline consuming the same production data with code-under-test, publishing to AWS Glue tables queryable via Redshift Spectrum. Enables same-day verification that bypasses the ~10-hour Redshift Connector latency.
Integrity is checked at two cadences (see patterns/monthly-plus-daily-dual-cadence-integrity-check):
- Daily (over staging Glue tables via Redshift Spectrum): lightweight pipeline-internal invariants — zero negative revenues, zero unknown programs, count of contracts missing parent category, etc.
- Monthly (over Redshift tables against billing-system truth): full four-metric suite with a 99.99% contract- invoice match threshold.
Output to the third-party REVREC system is guarded by a Schema Validation Batch that polls the REVREC mapping API before every upload and aborts on mismatch. Standardised delivery is SFTP (4-5 files/day at 500k-700k records each) after REST was found to be flaky and capped at 50k records/file → 15 files/day.
Seen in¶
- sources/2025-02-19-yelp-revenue-automation-series-building-revenue-data-pipeline — canonical disclosure. System design evaluation (four architectures, three rejected), Spark ETL feature structure, YAML DAG config, checkpointing, UDF discipline for discount application, future-improvements roadmap.
- sources/2025-05-27-yelp-revenue-automation-series-testing-an-integration-with-third-party-system — how Yelp validates this pipeline in production: parallel staging pipeline, dual-cadence integrity checks, pre-upload schema validation, SFTP over REST for bulk delivery.
Related¶
- systems/yelp-spark-etl — the internal package
- systems/yelp-billing-system — upstream source-of-truth
- systems/yelp-staging-pipeline — verification sandbox
- systems/yelp-schema-validation-batch — pre-upload guard
- systems/yelp-redshift-connector — publication path (and latency constraint)
- systems/apache-spark — underlying engine
- systems/aws-s3 — snapshot + publish + scratch substrate
- systems/aws-glue — staging catalog
- systems/amazon-redshift — production BI substrate
- systems/amazon-redshift-spectrum — staging query path
- systems/jupyterhub — debugging surface
- companies/yelp
- concepts/revenue-recognition-automation
- concepts/mysql-snapshot-to-s3-data-lake
- concepts/staging-pipeline
- concepts/data-integrity-checker
- concepts/redshift-connector-latency
- concepts/data-upload-format-validation
- patterns/daily-mysql-snapshot-plus-spark-etl
- patterns/source-plus-transformation-feature-decomposition
- patterns/parallel-staging-pipeline-for-prod-verification
- patterns/monthly-plus-daily-dual-cadence-integrity-check
- patterns/schema-validation-pre-upload-via-mapping-api
- patterns/sftp-for-bulk-daily-upload