CONCEPT Cited by 1 source
Data upload format validation¶
Definition¶
Data upload format validation is the pre-upload check that a data file about to be sent to a downstream system (typically a third-party SaaS) actually conforms to the downstream system's current schema — column names, column types, date formats, column order.
It's the runtime-cousin of patterns/schema-validation-before-deploy: that pattern validates schema compatibility at deploy time against a consumer-contract registry; this concept validates at upload time against a live-polled consumer contract, which is necessary when the downstream system's contract is not owned by you and can drift between uploads.
Three validation axes¶
Yelp's canonical 2025-05-27 disclosure names three axes the pre-upload validation must cover:
- Date format — e.g.
MM/DD/YYYYvsYYYY-MM-DD. Easy to miss in a file-diff because both formats parse; semantically silent failure if the downstream system mis-interprets. - Column name — missing or renamed columns on the downstream side.
- Column data type — string vs number vs date; type drift on the downstream side.
Yelp's Schema Validation Batch checks all three via a REST mapping API provided by the third-party system.
Why runtime, not deploy-time?¶
Three conditions together make deploy-time validation insufficient:
- You don't own the downstream schema. Third-party SaaS configs are mutable out-of-band.
- Schema drift is silent until uploads fail. The downstream system doesn't necessarily proactively notify integrators of config changes.
- Partial uploads have high blast radius. If half the records land before the schema mismatch surfaces, you're now in a partial-state recovery situation.
Pulling the schema fresh before each upload amortises the check cost (one API call) against the catastrophic cost of a partial upload failure.
Shape¶
Pipeline produces output file
│
▼
Upload-format validator:
├─ GET downstream /mapping API
│ returns: { date_format, columns[] }
│
├─ Compare to output file schema
│ - every expected column present?
│ - every data type matches?
│ - date format matches?
│
└─ Match? ─yes→ proceed to upload
no → abort + alert
Distinguishing features¶
From patterns/schema-validation-before-deploy (Datadog- style CDC schema validation):
- When: runtime (per-upload) vs deploy-time (once per migration).
- Against what: live downstream schema (polled) vs consumer-contract registry (static at deploy time).
- Ownership: downstream not owned by you (third-party) vs downstream owned but federated (internal CDC consumer).
- Failure mode blocked: partial/total upload failure + silent semantic drift vs consumer-breaking migration.
Both validate before the risky operation; both sit at a schema boundary; they differ on who owns each side.
Caveats¶
- Still one-directional. The validator checks that Yelp's output matches the downstream schema; it doesn't check that Yelp's output is semantically correct, or that the downstream schema itself is sane.
- Depends on downstream exposing a mapping API. Without an introspection endpoint, you're stuck validating against a local copy of the schema that you maintain by hand (and can fall out of sync).
- Doesn't auto-remediate. On mismatch, someone has to update Yelp's output schema or re-negotiate the downstream mapping — the batch just prevents the broken upload.
Seen in¶
- sources/2025-05-27-yelp-revenue-automation-series-testing-an-integration-with-third-party-system — canonical disclosure with API sample (request/response), three axes (Dateformat + column name + column data type), and reliability rationale.
Related¶
- patterns/schema-validation-pre-upload-via-mapping-api — the repeatable pattern
- patterns/schema-validation-before-deploy — deploy-time cousin
- concepts/schema-evolution
- systems/yelp-schema-validation-batch — canonical implementation