Skip to content

YELP 2025-05-27 Tier 3

Read original ↗

Yelp — Revenue Automation Series: Testing an Integration with Third-Party System

Summary

Yelp Engineering post (2025-05-27) by the Revenue Recognition Team — third post in the Revenue Automation Series after the 2024-12 billing-system modernisation and the 2025-02-19 Revenue Data Pipeline build-out. Where the prior post described what was built (the Revenue Data Pipeline feeding a third-party REVREC SaaS), this post describes how Yelp gained confidence to ship it — a six-step production testing + integration strategy that replaced a 10-hour-latency, manual-verification status quo with a near-realtime, self-validating pipeline. The load-bearing content is organised as six named steps:

  1. Staging Pipeline Setup — run a parallel staging pipeline alongside production that consumes production data, publishes to AWS Glue tables (not Redshift), and lets engineers query via Redshift Spectrum immediately rather than waiting on the Redshift Connector's ~10-hour latency. Canonical parallel-staging-pipeline-for-prod-verification instance.
  2. Test Data Generation — dev-environment data is too sparse to cover Yelp's product + edge-case matrix, so edge cases discovered via integrity-checker runs in production are backported into the dev environment as new fixture data. The process is explicitly manual "due to the number of input tables" and flagged for future automation.
  3. Data Integrity Checkers — monthly + daily dual-cadence integrity checks compare pipeline output against the billing-system truth. Four named metrics:
    • Contracts matching invoice gross revenue (target: 99.99% match rate).
    • Contracts with mismatched discount invoice (discount application close to billing).
    • Contract lines with no equivalent invoice (manually-billed products recognition).
    • Duplicate or past contract lines (dedup + no-stale guarantee).
  4. Data Format Validation — before upload, a Schema Validation Batch pulls the third-party system's current field-mapping via REST API and compares it against the revenue data schema. Prevents upload failures from column-mapping or schema-drift mismatch.
  5. Data Ingestion — two upload methods evaluated (REST API vs SFTP). SFTP chosen on reliability + file-size tolerance + setup simplicity. Concrete numbers:
    • REST API: ~50,000 records/file limit → ~15 files/day, ~50 files at month-end close.
    • SFTP: 500,000-700,000 records/file → 4-5 files/day.
  6. External System Support — the irreducible dependency on third-party ops: SFTP server downtime, external-upload job trigger failure, insufficient table space. Mitigation is documented escalation paths + support channels, not a technical fix.

A Future Improvements section names five further work items: automate test-data generation, optimise discount calculation, opt-in/out product-type maintainability, enhance integrity checker, and a user-facing upload UI.

Context inheritance

This post presupposes the two prior Revenue Automation Series entries:

  • 2025-02-19 ( Building Revenue Data Pipeline) — the pipeline itself, the architectural evaluation of four substrates, the Spark ETL feature DAG, the daily MySQL snapshot → S3 → Spark ETL → REVREC template flow.
  • 2024-12 (billing-system modernisation, not yet ingested) — the upstream source-of-truth that the integrity checkers compare pipeline output against.

The testing strategy described here assumes that pipeline exists and wraps it with production verification machinery.

Key takeaways

  1. The Redshift Connector's ~10-hour latency is the motivating constraint. The status quo was: data pipeline runs → Redshift Connector publishes to Redshift → 10 hours later the data is available → engineers query to verify. Yelp cites this as the specific pain: "this introduced a latency of approximately 10 hours before the data was available in the data warehouse for verification." The fix is a parallel pipeline that bypasses the connector entirely by writing to AWS Glue data-catalog tables on S3, queryable immediately via Redshift Spectrum. (Source: "Its results were published to AWS Glue tables which enabled us to solve the Redshift Connector latency problem by making data immediately available for querying via Redshift Spectrum.")
  2. The staging pipeline mutates — production doesn't. The discipline is explicit: "the production pipeline and its data were left untouched until the new changes were verified before updating it." This makes rollback free (staging is throwaway), isolates prod from experiments, and lets engineers test code changes against real production data without the "revert the change" ceremony typical of direct-to-prod testing.
  3. Dev-environment data diversity is insufficient for a product catalog of Yelp's size. "Since the development environments have limited data, the different edge cases that occur in production could not be covered during dev testing." The post names this as something that "was discovered when the data pipeline was executed in production for the first time" — a production-data-diversity gap acknowledged retrospectively. The workflow-level fix: when the staging pipeline discovers a new edge case in production, a matching synthetic record is created in the dev environment to replicate it.
  4. Integrity checking runs at two cadences for a reason. The source of truth is monthly (Yelp's billing system closes monthly). But waiting for end-of-month makes code iteration impossibly slow. Yelp runs monthly checks on billing-system truth (SQL over Redshift tables) as the final authority, plus daily checks on staging-pipeline output via Redshift Spectrum for rapid iteration on bug fixes. The 99.99% contract match rate threshold is named: "we wanted to verify that at least 99.99 percent of the contracts matched the billed revenue as this level of accuracy was required to ensure the reliability of the system."
  5. Schema drift in a third-party system is an operational reality, not an edge case. The Schema Validation Batch polls the third-party system's field-mapping API before every upload: "To avoid such failures, we developed a Schema Validation Batch, which retrieves the field mapping from the external system using a REST API and compares it against the revenue data schema before proceeding with the data upload." This is not a one-time deploy-time check; it's a pre-upload runtime guard against partner-side schema churn.
  6. SFTP beats REST for bulk daily upload on three named axes: (a) reliability ("API responses were found to be flaky, which lead to inconsistent uploads and multiple retries"), (b) file-size ceiling (REST capped at 50k records/file → 15 files/day + 50/month-end vs SFTP at 500k-700k records/file → 4-5 files/day), (c) setup simplicity ("the SFTP upload setup process was found to be less complex when compared to the API set up"). The post explicitly notes REST was the default first choice"standard HTTP methods, which are easy to use and scalable" — then production experience overrode the default. Yelp standardises on SFTP across multiple pipelines after this finding.
  7. Third-party ops dependencies aren't solvable — only managed. Three failure modes are enumerated (SFTP downtime, upload-job trigger failure, external-system table space exhaustion), each requiring external-team intervention. The engineering response is not a technical fix but an operational one: "establish internal guidelines for escalating high-priority issues and contacting the appropriate support teams." Honest acknowledgement that no amount of code makes a third-party system your own.

Architecture: the parallel-pipeline verification topology

The post's key diagram shows two pipelines running in parallel with different output substrates and different verification latencies:

                  Production data
           ┌─────────────┴─────────────┐
           ▼                           ▼
  Production pipeline          Staging pipeline
  (stable code)                (code under test)
           │                           │
           ▼                           ▼
  Redshift tables              AWS Glue data catalog
   (via Redshift Connector)    tables on S3
           │                           │
           │ ~10 hr latency            │ immediate
           ▼                           ▼
  Production queries           Verification queries
  (business reporting)         (via Redshift Spectrum)
                              Data integrity checkers
                              (daily, SQL over staging)
                              Bug discovered?
                               ├─ fix code in staging
                               └─ edge case? → backport
                                  test data to dev env

After verification, the staging pipeline's validated changes are promoted to production. The production pipeline remains the authoritative data source for the REVREC integration; the staging pipeline is a verification sandbox.

Six-step testing strategy — mechanisms and numbers

Step 1: Staging Pipeline Setup

  • Configuration: "a separate data pipeline configuration, referred to as the staging pipeline, in parallel with the production pipeline."
  • Data source: production data (not dev data).
  • Output substrate: AWS Glue tables (not Redshift).
  • Query path: Redshift Spectrum queries Glue tables directly — "immediately after the reports were generated."
  • Key discipline: "the production pipeline and its data were left untouched until the new changes were verified." Staging mutates, prod doesn't.

Step 2: Test Data Generation

  • Problem named: "the different edge cases that occur in production could not be covered during dev testing. This was discovered when the data pipeline was executed in production for the first time."
  • Workflow: edge case discovered in staging integrity check → new synthetic data point added to dev environment to replicate it.
  • Named constraint: "the large number of database tables required as input to the pipeline made this process very tedious as it involved manual creation of data points."
  • Future work flagged: automate test-data creation.

Step 3: Data Integrity Checkers

  • What they compare: pipeline-produced revenue contract data (including estimated revenue) vs billing-system billed revenue.
  • Four named metrics:
    • Contract → invoice match rate (target: 99.99%).
    • Discount invoice mismatch count (measure discount application accuracy).
    • Contract lines with no equivalent invoice (recognise manually-billed products).
    • Duplicate or past contract lines (no dedup violations or stale entries).
  • Monthly check: SQL over Redshift tables comparing billing-system monthly close against contract pipeline output. Sample output shape (redacted):
---- Metrics ----
Number of contracts reviewed: 100000
Gross Revenue:  xxxx
Net Revenue:  xxxx

Number of contracts matching both gross & net:  99799
Number of contracts with mismatch discount: 24
Number of contracts with mismatch gross revenue:  6
Number of contracts with no equivalent invoice:  1
Number of invoice with no equivalent contract:  10

Line match %: 99.997
Net revenue mismatch difference:  1883.12

The 99.997% match rate shown in the example is above the 99.99% target. - Daily check: SQL over staging-pipeline output via Redshift Spectrum on Glue tables. Lighter metric set:

---- Metrics ----
number of contracts with negative revenue:  0
number of contracts passed or expired:  28
number of contracts with unknown program:   0
number of contracts missing parent category:   66

Daily checks surface implementation-level bugs fast — discrepancies are investigated + fixed + rerun same day. - Iteration discipline: "determining the right SQL queries was done iteratively, accounting for different edge cases based on the variety of products offered at Yelp."

Step 4: Data Format Validation

  • The risk: "internal file data format, column mapping sequence, or differences in the external table schema led to upload failures."
  • The mechanism: Schema Validation Batch — retrieves field mapping from the third-party system via REST API before every upload, compares to revenue data schema, aborts if mismatch.
  • Sample request/response (from the post, verbatim):
-- Request
curl -X GET --header "token: RandomBasicToken" \
  "https://yourHost/api/integration/v1/upload/mapping?templatename=SAMPLE_TEMPLATE"
-- Response
{
    "Dateformat":"MM/DD/YYYY",
    "Mapping":[
        {
            "sample_column_id":1,
            "sample_column_name":"random",
            "sample_column_data_type":"string"
        }
    ]
}

Note that date format and per-column name + data type are returned — the validation layer checks all three axes.

Step 5: Data Ingestion

Two upload methods evaluated:

Axis REST API SFTP
File size limit 50,000 records/file 500,000-700,000 records/file
Daily file count ~15 4-5
Month-end file count ~50 few dozen
Reliability "flaky", retry-prone stable
Setup complexity higher lower
Security HTTPS SSH/SFTP (encrypted channel)
Yelp decision initial dev-phase choice production standardised

Sample REST request (verbatim):

curl -X POST https://yourHost/api/integration/v1/upload/file \
  -H 'cache-control: no-cache' \
  -H 'content-type: multipart/form-data; boundary=----WebKitFormBoundary' \
  -H 'templatename: Sample Template' \
  -H 'token: <generated token>' \
  -F 'file=sample_data_file.csv'

Response: {"Message":"File consumed successfully",
  "Result":{"file_name":"sample_data_file.csv","file_request_id":"1000"},
  "Status":"Success"}

The post explicitly calls out SFTP as the standardisation choice across "multiple pipelines at Yelp, such as the revenue contract and invoice data pipelines."

Step 6: External System Support

Three named third-party failure modes:

  • SFTP Server downtime or service disruption — files don't land on the external system.
  • Upload-job trigger failure — files land but the external system doesn't pick them up.
  • Insufficient table space in the external system — external-side storage full.

Mitigation: internal escalation guidelines + support channel documentation — not a technical fix. Honest irreducibility.

Operational numbers

Compact list of numbers named by the post (and only those):

  • ~10 hours — Redshift Connector latency before verification data lands.
  • 99.99% — target contract-invoice match rate for pipeline reliability.
  • 99.997% — example Line match % shown in the sample monthly check output (above target).
  • 100,000 — example contract volume reviewed in a monthly check.
  • 50,000 records/file — REST API upload size ceiling.
  • 500,000-700,000 records/file — SFTP file size range.
  • 15 files/day (REST) vs 4-5 files/day (SFTP) — steady-state upload count.
  • ~50 files — month-end close REST file count.
  • 1,883.12 — example net-revenue mismatch difference in sample monthly output ($1,883.12 across 100k contracts).

Caveats

  • Third-party REVREC system remains unnamed (consistent with the 2025-02-19 post). No vendor disclosure.
  • No SLO/SLI numbers on staging-pipeline latency vs production pipeline latency. Just "immediately after reports were generated" qualitatively.
  • No code for the Schema Validation Batch — sample request/response shown but the comparison logic is described in prose only.
  • No load numbers — how many records/day flow through REST vs SFTP; what the month-end peak looks like vs steady state.
  • No test-data-generation tooling named — flagged as future work; current process is manual.

Positioning on the wiki

This is Yelp's fourth ingested article — and its fourth axis after:

  • 2025-02-04 — LLM-powered search query understanding (serving-infra axis).
  • 2025-02-19 — Revenue Data Pipeline build (financial-systems / data-platform axis).
  • 2025-04-15 — Zero Trust Access migration to Netbird (corporate-security / networking axis).

This post deepens the 2025-02-19 axis rather than opening a new one — it's a direct sequel documenting the integration testing strategy for the pipeline that post built. Together they establish the wiki's most-complete view of Yelp's financial- systems engineering stack.

Cross-source siblings across the corpus:

  • Pre-deploy schema validation: The Schema Validation Batch here parallels patterns/schema-validation-before-deploy from Datadog's CDC platform (2025-11-04) — both validate schema compatibility before an upstream change breaks a downstream consumer. Different consumer (3rd-party REVREC vs CDC-subscriber), same shape.
  • Parallel-pipeline verification: similar in spirit to Slack's systems/slack-deploy-safety-program automated violation detection, but tuned for a batch-data domain rather than a real-time request-handling domain.
  • Data-lake for reproducibility: consistent with Yelp's own concepts/mysql-snapshot-to-s3-data-lake rationale (daily snapshots produce same output regardless of rerun timing) but extended with a parallel staging pipeline that doesn't exist in 2025-02-19.

Source

Last updated · 476 distilled / 1,218 read