Skip to content

CONCEPT Cited by 1 source

Data upload format validation

Definition

Data upload format validation is the pre-upload check that a data file about to be sent to a downstream system (typically a third-party SaaS) actually conforms to the downstream system's current schema — column names, column types, date formats, column order.

It's the runtime-cousin of patterns/schema-validation-before-deploy: that pattern validates schema compatibility at deploy time against a consumer-contract registry; this concept validates at upload time against a live-polled consumer contract, which is necessary when the downstream system's contract is not owned by you and can drift between uploads.

Three validation axes

Yelp's canonical 2025-05-27 disclosure names three axes the pre-upload validation must cover:

  • Date format — e.g. MM/DD/YYYY vs YYYY-MM-DD. Easy to miss in a file-diff because both formats parse; semantically silent failure if the downstream system mis-interprets.
  • Column name — missing or renamed columns on the downstream side.
  • Column data type — string vs number vs date; type drift on the downstream side.

Yelp's Schema Validation Batch checks all three via a REST mapping API provided by the third-party system.

Why runtime, not deploy-time?

Three conditions together make deploy-time validation insufficient:

  1. You don't own the downstream schema. Third-party SaaS configs are mutable out-of-band.
  2. Schema drift is silent until uploads fail. The downstream system doesn't necessarily proactively notify integrators of config changes.
  3. Partial uploads have high blast radius. If half the records land before the schema mismatch surfaces, you're now in a partial-state recovery situation.

Pulling the schema fresh before each upload amortises the check cost (one API call) against the catastrophic cost of a partial upload failure.

Shape

Pipeline produces output file
Upload-format validator:
  ├─ GET downstream /mapping API
  │     returns: { date_format, columns[] }
  ├─ Compare to output file schema
  │     - every expected column present?
  │     - every data type matches?
  │     - date format matches?
  └─ Match? ─yes→ proceed to upload
             no  → abort + alert

Distinguishing features

From patterns/schema-validation-before-deploy (Datadog- style CDC schema validation):

  • When: runtime (per-upload) vs deploy-time (once per migration).
  • Against what: live downstream schema (polled) vs consumer-contract registry (static at deploy time).
  • Ownership: downstream not owned by you (third-party) vs downstream owned but federated (internal CDC consumer).
  • Failure mode blocked: partial/total upload failure + silent semantic drift vs consumer-breaking migration.

Both validate before the risky operation; both sit at a schema boundary; they differ on who owns each side.

Caveats

  • Still one-directional. The validator checks that Yelp's output matches the downstream schema; it doesn't check that Yelp's output is semantically correct, or that the downstream schema itself is sane.
  • Depends on downstream exposing a mapping API. Without an introspection endpoint, you're stuck validating against a local copy of the schema that you maintain by hand (and can fall out of sync).
  • Doesn't auto-remediate. On mismatch, someone has to update Yelp's output schema or re-negotiate the downstream mapping — the batch just prevents the broken upload.

Seen in

Last updated · 476 distilled / 1,218 read