Skip to content

ZALANDO 2021-04-12

Read original ↗

Zalando — Modeling Errors in GraphQL

Summary

A companion to Zalando's How we use GraphQL at Europe's largest fashion e-commerce company, this post — published as part of the UBFF's planned post-series — attacks the single most underspecified part of GraphQL-at-scale: how to model errors when the response.errors envelope is deliberately outside the schema. The author walks through four progressively richer modelings and lands on a classification framework that splits errors into two categories by who can act on the failure: the Developer (front-end code) or the Customer (end user). Schema-modeled errors — named Problem (after RFC 7807 — Problem Details for HTTP API) to avoid colliding with GraphQL's reserved error keyword — are used only for the narrow case where only the Customer can recover, canonically mutation input validation. All other failures stay in the response.errors array with a machine-readable extensions.code. The post is explicit that overusing Problem types for 404s, authZ failures, or internal errors makes the schema unusable — "When you have a hammer…" — because GraphQL clients would have to fan out into ... on Success { } ... on Error { } at every level of the hierarchy, defeating the query-shape discipline. The framework names three architectural concerns driving the split: (1) action-ability (who recovers), (2) bug vs. end-user input (schema types break encapsulation for bugs), and (3) error propagation — schema-modeled Problems bypass GraphQL's default null-propagation semantics, which matters for correctness.

Key takeaways

  1. GraphQL's response.errors envelope is schema-free by design. The response shape is fixed at Array<{ message: string, path: string[] }>; the schema the organisation defines applies only to response.data. This creates a schema-discoverability gap — developers using strongly-typed tooling get no help with error payloads. The post calls this out explicitly as "We use such a powerful language like GraphQL to define each field in our data structure using Schemas, but when designing the errors, we went back to a loose mode of not using any of the ideas GraphQL brought us" (Source: sources/2021-04-12-zalando-modeling-errors-in-graphql).

  2. error.extensions.code is the first line of defence. GraphQL does allow extensions on error objects, and a well-known code field (e.g. NOT_FOUND, NOT_AUTHORIZED, INVALID_EMAIL) gives front-end code something parseable without relying on error.message text. The post is explicit that "parsing the message is a no-go because it is not reliable". This becomes the canonical home for all errors the Developer (front- end code) acts on (Source: sources/2021-04-12-zalando-modeling-errors-in-graphql).

  3. Problem types — schema-modelled errors — are for the Customer, not the Developer. The worked mutation example:

type Mutation {
  register(email: String!, password: String!): RegisterResult
}

union RegisterResult = RegisterSuccess | RegisterProblem

type RegisterSuccess {
  id: ID!
  email: String!
}

type RegisterProblem {
  "translated message encompassing all invalid inputs."
  title: String!
  invalidInputs: [RegisterInvalidInput]
}

type InvalidInput {
  field: RegisterInvalidInputField!
  "translated message."
  message: String!
}

enum RegisterInvalidInputField { EMAIL  PASSWORD }

The Problem type embeds both the machine-readable code (via field: RegisterInvalidInputField!) and the translated customer-facing message in the schema. That is the load-bearing property: multi-locale error messages sit in the schema, not in a separate i18n pipeline.

  1. The naming discipline is from RFC 7807. The post references RFC 7807 — Problem Details for HTTP API explicitly: "Since the name error is already taken by the GraphQL language (response.errors), it would be confusing to name our error types in Schema as Error." Naming the schema type Problem avoids the collision and imports an industry-standard vocabulary.

  2. The classification framework has three axes. The post explicitly constructs the decision framework around:

  3. Part 1 — Action-ables. "Errors are containers of action-ables. We classify them into different groups depending on who can take that action."
  4. Part 2 — Bugs in the system. Any error conveying a bug must stay outside the schema, because exposing it as a Problem type forces every query consumer to fork on ... on Success / ... on Error at every hierarchy level, destroying the UX benefit of GraphQL.
  5. Part 3 — Error propagation. GraphQL's default behaviour is to propagate an error upwards until it hits a nullable field. Schema-modelled Problem types do not propagate — they are just a branch of the union. This is a semantic change, not a formatting change.

  6. Four concrete case classifications are given. The post is explicit about where each type of error lands:

  7. Resource Not Found (404)Error, code NOT_FOUND. It's a navigation bug, needs to propagate, not a Customer-recoverable action.
  8. AuthorizationError, code NOT_AUTHORIZED. Action-taker looks like the Customer ("please log in") but is actually the Front-end (show a login dialog / navigate to login view). Developer-actionable, so Error.
  9. Mutation Input ValidationProblem. "Mutation Inputs is the only case where it is crucial to construct Problem types." Customer-actionable; needs translated text; needs per-field granularity.
  10. Runtime / Internal Server ErrorsError, no code. Front-end treats all non-coded errors as 500s and can uniformly retry / show an error page.

  11. A single GraphQL error object can carry multiple validation failures via structured extensions. Before arriving at Problem types, the post shows an intermediate modelling where one error object encodes multiple invalid inputs:

{
  "data": {},
  "errors": [{
    "message": "Multiple inputs are invalid",
    "extensions": {
      "invalidInputs": [
        {"code": "INVALID_EMAIL",    "message": "Die E-Mail-Addresse ist ungültig"},
        {"code": "INVALID_PASSWORD", "message": "Das Passwort erfüllt nicht die Sicherheitsstandards"}
      ]
    }
  }]
}

This works but is called out as "not as friendly as the data modeled with a GraphQL schema" and — crucially — not discoverable. That's what motivates the move to Problem types for this specific case.

Systems / concepts / patterns extracted

Systems

Concepts

  • GraphQL error extensions — the error.extensions mechanism that keeps error metadata out of the schema but still machine-readable. The extensions.code convention is the minimum-viable discipline.
  • Error action-taker classification — the "classify errors by who can act on them" framework. The core design move.
  • Problem vs Error distinction — Zalando's naming split: Problem for schema-modeled errors (RFC 7807); Error for response.errors-envelope errors.
  • GraphQL error propagation — the null-propagation-until-nullable- field semantic that makes schema-modeled errors a semantic choice, not a formatting one.
  • Schema discoverability gap in errors — the fact that the GraphQL schema does not describe the shape of the errors envelope.

Patterns

Operational numbers disclosed

None. The post is a design-principles piece; no field counts, throughput, latency, or adoption metrics are given. (The companion Part-1 UBFF post covers those.)

Caveats

  • Zero production-incident evidence. The post is normative, not retrospective. There is no claim of the form "before we adopted this, we saw X% of tickets from Y category". The decision framework is motivated by first-principles API design, not by a postmortem.
  • No explicit federation or subgraph guidance. The post takes the UBFF single-service shape as given. In a federated world with subgraphs owned by different teams, the naming discipline around RegisterProblem / OrderProblem / CheckoutProblem has to be enforced at schema-review gates — the post doesn't address that.
  • Translated messages in schema couples i18n to resolvers. The Problem type quoted has title: String! as a translated message. That implicitly means resolvers need access to the requestor's locale and an i18n catalogue. The post does not explore the operational implications (catalogue hot-reload, missing-key behaviour, fallback language) of making this a first-class schema type.
  • Client-side library support undiscussed. Union-type queries require ... on TypeName fragments in every mutation caller. The tooling ergonomics — Apollo Client, Relay, codegen — are not discussed.
  • Contradiction with default null-propagation. Using Problem types means mutation errors don't propagate as nulls. This is called out as a feature (front-end gets rich data) but also a semantic divergence from the rest of the API surface — queries behave one way, error- modelled mutations behave another.
  • Part 1 of a planned series. The author signals future posts on Observability, Performance Optimization, Security, Tooling, Errors — this post is the Errors installment of that plan.

Source

Last updated · 476 distilled / 1,218 read