Zalando — Modeling Errors in GraphQL¶
Summary¶
A companion to Zalando's How we use GraphQL at Europe's
largest fashion e-commerce company, this post — published
as part of the UBFF's planned post-series — attacks the
single most underspecified part of GraphQL-at-scale: how
to model errors when the response.errors envelope is
deliberately outside the schema. The author walks through
four progressively richer modelings and lands on a
classification framework that splits errors into two
categories by who can act on the failure: the Developer
(front-end code) or the Customer (end user). Schema-modeled
errors — named Problem (after
RFC 7807 — Problem Details
for HTTP API) to avoid colliding with GraphQL's reserved
error keyword — are used only for the narrow case where
only the Customer can recover, canonically mutation input
validation. All other failures stay in the response.errors
array with a machine-readable extensions.code. The post is
explicit that overusing Problem types for 404s, authZ failures,
or internal errors makes the schema unusable — "When you have
a hammer…" — because GraphQL clients would have to fan out
into ... on Success { } ... on Error { } at every level of
the hierarchy, defeating the query-shape discipline. The
framework names three architectural concerns driving the
split: (1) action-ability (who recovers), (2) bug
vs. end-user input (schema types break encapsulation for
bugs), and (3) error propagation — schema-modeled
Problems bypass GraphQL's default null-propagation semantics,
which matters for correctness.
Key takeaways¶
-
GraphQL's
response.errorsenvelope is schema-free by design. The response shape is fixed atArray<{ message: string, path: string[] }>; the schema the organisation defines applies only toresponse.data. This creates a schema-discoverability gap — developers using strongly-typed tooling get no help with error payloads. The post calls this out explicitly as "We use such a powerful language like GraphQL to define each field in our data structure using Schemas, but when designing the errors, we went back to a loose mode of not using any of the ideas GraphQL brought us" (Source: sources/2021-04-12-zalando-modeling-errors-in-graphql). -
error.extensions.codeis the first line of defence. GraphQL does allowextensionson error objects, and a well-knowncodefield (e.g.NOT_FOUND,NOT_AUTHORIZED,INVALID_EMAIL) gives front-end code something parseable without relying onerror.messagetext. The post is explicit that "parsing the message is a no-go because it is not reliable". This becomes the canonical home for all errors the Developer (front- end code) acts on (Source: sources/2021-04-12-zalando-modeling-errors-in-graphql). -
Problem types — schema-modelled errors — are for the Customer, not the Developer. The worked mutation example:
type Mutation {
register(email: String!, password: String!): RegisterResult
}
union RegisterResult = RegisterSuccess | RegisterProblem
type RegisterSuccess {
id: ID!
email: String!
}
type RegisterProblem {
"translated message encompassing all invalid inputs."
title: String!
invalidInputs: [RegisterInvalidInput]
}
type InvalidInput {
field: RegisterInvalidInputField!
"translated message."
message: String!
}
enum RegisterInvalidInputField { EMAIL PASSWORD }
The Problem type embeds both the machine-readable code
(via field: RegisterInvalidInputField!) and the
translated customer-facing message in the schema.
That is the load-bearing property: multi-locale error
messages sit in the schema, not in a separate i18n
pipeline.
-
The naming discipline is from RFC 7807. The post references RFC 7807 — Problem Details for HTTP API explicitly: "Since the name
erroris already taken by the GraphQL language (response.errors), it would be confusing to name our error types in Schema as Error." Naming the schema typeProblemavoids the collision and imports an industry-standard vocabulary. -
The classification framework has three axes. The post explicitly constructs the decision framework around:
- Part 1 — Action-ables. "Errors are containers of action-ables. We classify them into different groups depending on who can take that action."
- Part 2 — Bugs in the system. Any error conveying a
bug must stay outside the schema, because exposing it
as a Problem type forces every query consumer to fork
on
... on Success / ... on Errorat every hierarchy level, destroying the UX benefit of GraphQL. -
Part 3 — Error propagation. GraphQL's default behaviour is to propagate an error upwards until it hits a nullable field. Schema-modelled Problem types do not propagate — they are just a branch of the union. This is a semantic change, not a formatting change.
-
Four concrete case classifications are given. The post is explicit about where each type of error lands:
- Resource Not Found (404) → Error, code
NOT_FOUND. It's a navigation bug, needs to propagate, not a Customer-recoverable action. - Authorization → Error, code
NOT_AUTHORIZED. Action-taker looks like the Customer ("please log in") but is actually the Front-end (show a login dialog / navigate to login view). Developer-actionable, so Error. - Mutation Input Validation → Problem. "Mutation Inputs is the only case where it is crucial to construct Problem types." Customer-actionable; needs translated text; needs per-field granularity.
-
Runtime / Internal Server Errors → Error, no
code. Front-end treats all non-coded errors as 500s and can uniformly retry / show an error page. -
A single GraphQL error object can carry multiple validation failures via structured
extensions. Before arriving at Problem types, the post shows an intermediate modelling where one error object encodes multiple invalid inputs:
{
"data": {},
"errors": [{
"message": "Multiple inputs are invalid",
"extensions": {
"invalidInputs": [
{"code": "INVALID_EMAIL", "message": "Die E-Mail-Addresse ist ungültig"},
{"code": "INVALID_PASSWORD", "message": "Das Passwort erfüllt nicht die Sicherheitsstandards"}
]
}
}]
}
This works but is called out as "not as friendly as the data modeled with a GraphQL schema" and — crucially — not discoverable. That's what motivates the move to Problem types for this specific case.
Systems / concepts / patterns extracted¶
Systems¶
- RFC 7807 — Problem Details for HTTP API — IETF RFC naming the Problem vocabulary Zalando adopts. The post cites it by number as the authority for the naming choice.
Concepts¶
- GraphQL error
extensions — the
error.extensionsmechanism that keeps error metadata out of the schema but still machine-readable. Theextensions.codeconvention is the minimum-viable discipline. - Error action-taker classification — the "classify errors by who can act on them" framework. The core design move.
- Problem vs
Error distinction — Zalando's naming split:
Problem for schema-modeled errors (RFC 7807);
Error for
response.errors-envelope errors. - GraphQL error propagation — the null-propagation-until-nullable- field semantic that makes schema-modeled errors a semantic choice, not a formatting one.
- Schema
discoverability gap in errors — the fact that the
GraphQL schema does not describe the shape of the
errorsenvelope.
Patterns¶
- Result
union type for mutation outcome —
union Result = Success | Problemas the canonical mutation return shape. Imports a Result/Either discriminant into the schema. - Problem type for customer-actionable errors — schema-level Problem types used only for the narrow case where the end user is the action-taker.
-
Error extensions-code for developer-actionable errors —
error.extensions.codeas the canonical channel for everything the front-end code acts on.
Operational numbers disclosed¶
None. The post is a design-principles piece; no field counts, throughput, latency, or adoption metrics are given. (The companion Part-1 UBFF post covers those.)
Caveats¶
- Zero production-incident evidence. The post is normative, not retrospective. There is no claim of the form "before we adopted this, we saw X% of tickets from Y category". The decision framework is motivated by first-principles API design, not by a postmortem.
- No explicit federation or subgraph guidance. The
post takes the UBFF
single-service shape as given. In a federated world
with subgraphs owned by different teams, the naming
discipline around
RegisterProblem/OrderProblem/CheckoutProblemhas to be enforced at schema-review gates — the post doesn't address that. - Translated messages in schema couples i18n to
resolvers. The Problem type quoted has
title: String!as a translated message. That implicitly means resolvers need access to the requestor's locale and an i18n catalogue. The post does not explore the operational implications (catalogue hot-reload, missing-key behaviour, fallback language) of making this a first-class schema type. - Client-side library support undiscussed. Union-type
queries require
... on TypeNamefragments in every mutation caller. The tooling ergonomics — Apollo Client, Relay, codegen — are not discussed. - Contradiction with default null-propagation. Using Problem types means mutation errors don't propagate as nulls. This is called out as a feature (front-end gets rich data) but also a semantic divergence from the rest of the API surface — queries behave one way, error- modelled mutations behave another.
- Part 1 of a planned series. The author signals future posts on Observability, Performance Optimization, Security, Tooling, Errors — this post is the Errors installment of that plan.
Source¶
- Original: https://engineering.zalando.com/posts/2021/04/modeling-errors-in-graphql.html
- Raw markdown:
raw/zalando/2021-04-12-modeling-errors-in-graphql-882cd6ed.md
Related¶
- systems/graphql — the substrate
- systems/zalando-graphql-ubff — the UBFF this post prescribes errors for
- systems/rfc-7807-problem-details — the naming source
- concepts/graphql-error-extensions — the fallback channel
- concepts/error-action-taker-classification — the core decision rule
- concepts/problem-vs-error-distinction — the naming split
- concepts/graphql-error-propagation — why the choice is semantic
- concepts/schema-discoverability-gap-in-errors — the underlying gap
- patterns/result-union-type-for-mutation-outcome — the schema shape
- patterns/problem-type-for-customer-actionable-errors — when to use Problem
- patterns/error-extensions-code-for-developer-actionable-errors — when to use Error
- companies/zalando