Skip to content

ZALANDO 2022-02-16

Read original ↗

Zalando — GraphQL persisted queries and Schema stability

Summary

A follow-up to Zalando's 2021 UBFF post, this article describes the two-layer discipline Zalando uses to get a stable, evolvable GraphQL schema at fashion-e-commerce scale: (1) GraphQL is effectively disabled in production — production endpoints accept only persisted query IDs, not raw GraphQL, so the set of queries running against the graph is a finite, inspectable database; and (2) three custom directives@draft, @component, and @allowedFor — encode a field lifecycle directly in the schema, letting the platform refuse to persist queries that reference not-ready fields and constrain which UI components may use experimental ones. The combined effect: developers keep full GraphQL expressiveness at build time, the graph keeps a machine-readable truth table for which fields are used in production, and the operator can break fields before they are locked in. Apollo's Automatic Persisted Queries are contrasted as a hash-caching optimisation; Zalando's version is a hash-enforced production contract.

Key takeaways

  1. Development-time GraphQL, production-time query IDs. Developers write GraphQL at build time for codegen, batching, and dev ergonomics. At merge-time to the main deployment branch, the UI build pipeline sends each query to the GraphQL service; the service hashes the normalised query (formatting + operation selection removed), returns an ID, and the UI bundle ships with the ID in place of the query text. In production, the request body is {"id": "a1b2c3", "variables": {...}} — no query field is accepted (Source: sources/2022-02-16-zalando-graphql-persisted-queries-and-schema-stability).

  2. "We disable GraphQL in production" is the load-bearing framing. The post opens by saying it sounds counter-intuitive — Zalando runs a GraphQL service but the production endpoint refuses GraphQL queries. That inversion is the whole point: everything downstream (observability, schema stability, safe breaking changes) follows from the property that the set of production queries is closed and known at persist time (Source: sources/2022-02-16-zalando-graphql-persisted-queries-and-schema-stability).

  3. The persisted-queries DB becomes a usage index. Because the runtime set of queries is finite and enumerated, the platform can answer: which fields are actually used in production? Which are not? Which query IDs touch field X? The post says this directly: "we know at any time what parts of the schema are used in production and what are not used in production" — and this enables "better monitoring and alerting for each individual query separately" and "tell if certain fields can have a breaking change because the field is no longer used or never used in production" (Source: sources/2022-02-16-zalando-graphql-persisted-queries-and-schema-stability).

  4. The @draft directive blocks persist of not-ready fields. Declared as:

directive @draft on FIELD_DEFINITION

Applied to a new field:

type Product {
  fancyNewField: FancyNewType @draft
}

The persist-time validation rule walks the query AST, looks up each referenced field's AST node, checks astNode.directives for @draft, and raises "Cannot persist draft field" if any field is draft. The post ships the JS validator as a minimal GraphQL validation rule implementation (Source: sources/2022-02-16-zalando-graphql-persisted-queries-and-schema-stability).

  1. Draft gives three guarantees simultaneously. Zalando names them: "(1) The field cannot be used in production. (2) We can break it at will, since we allow ONLY persisted queries in production. (3) We can merge it to the main branch (and even deploy it)." The third is the important one — draft is not a branch-deployment substitute; the unreleased schema is already in mainline and already deployed, so cross-domain feature work (the GraphQL layer aggregates 3-5 other services at Zalando) doesn't have to maintain parallel multi-repo feature branches — explicitly called out as "a nightmare in reality" at Zalando's microservice topology (Source: sources/2022-02-16-zalando-graphql-persisted-queries-and-schema-stability).

  2. @component + @allowedFor gate experimental fields to named UI components. The directive pair:

directive @component(name: String!) on QUERY
directive @allowedFor(componentNames: [String!]!) on FIELD_DEFINITION

Schema side:

type Product {
  fancyProp: String @allowedFor(componentNames: ["web-product-card"])
}

Query side:

query productCard @component(name: "web-product-card") {
  product {
    fancyProp
  }
}

At persist time, if any query references a field whose @allowedFor set does not include the query's @component(name:), persistence fails. This is how Zalando promotes a field from @draft to production- eligible with restricted blast radius — an experiment can ship to production but only one UI surface can exercise it, so a subsequent breaking change only forces that component to migrate (Source: sources/2022-02-16-zalando-graphql-persisted-queries-and-schema-stability).

  1. The lifecycle is explicit: draft → allowedFor → stable. "When we first extend the GraphQL schema, we start with the draft annotation. Then we promote new fields to a restricted usage in production using the allowedFor annotation. After we finally have stabilized the schema, we remove all of these annotations and have a non-breaking contract in form of persisted queries." The directives are scaffolding on the path to stable, not permanent load-bearing structure. Once a field is stable, both markers are removed, and the field joins the "non-breaking contract" that persisted queries underwrite (Source: sources/2022-02-16-zalando-graphql-persisted-queries-and-schema-stability).

  2. Apollo's Automatic Persisted Queries is named as the peer. The post links to Apollo Server APQ documentation as the point of reference, noting only that "we took a different approach." The Apollo model caches queries by hash for bandwidth reduction (first request sends hash; if unknown, client retries with full query). Zalando's model refuses unknown hashes entirely — which is the whole point. Same mechanism, different enforcement mode: cache vs. gate (Source: sources/2022-02-16-zalando-graphql-persisted-queries-and-schema-stability).

  3. Zalando rejects branch-deployments for experimental schema work at their topology. The post explicitly considers and rejects the branch-deployment alternative: "At Zalando, we have microservices and the GraphQL layer is an aggregator from multiple other services. So, maintaining multiple feature branches across 3-5 projects for 1 or 2 product features isn't going to help any developer or team move smoothly. The complexity increases non-linearly as we mix different features that must work together." The directive-based lifecycle is the alternative that keeps the aggregator stack on one mainline (Source: sources/2022-02-16-zalando-graphql-persisted-queries-and-schema-stability).

Systems / concepts / patterns extracted

Systems

No new system pages — the content extends the existing Zalando UBFF page with the persisted-queries + directive-lifecycle discipline, and adds a persisted-queries-related facet to GraphQL.

Concepts

  • GraphQL persisted queries — build-time query-text → stable-ID registration so runtime requests carry only the ID; Zalando's variant enforces ID-only in production.
  • Draft schema field — a not-ready-for-production marker on a GraphQL field definition; persist-time validation refuses queries that touch drafts.
  • GraphQL schema usage observability — the inspectability property that falls out of persisted-queries-only-in-prod — for every field, know the query IDs that reference it, or that no production query does.
  • Component- scoped field access — the @component / @allowedFor directive pair that restricts an experimental field to named UI component(s), so subsequent breaking changes have a known, small migration surface.
  • Schema evolution — this post is a named datapoint for schema-evolution at the GraphQL-API layer (vs. the more common DB / Avro / Protobuf framing).
  • Backward compatibility — persisted-queries-only is how Zalando turns "non-breaking API contract" from an aspiration into an enforceable build-time check.

Patterns

  • Automatic Persisted Queries — the canonical build-time register-and-swap pattern. Zalando's variant and Apollo APQ are both instances; the distinction is whether unknown hashes are cached (Apollo) or rejected (Zalando).
  • Disable GraphQL in production — the counter-intuitive framing Zalando uses. The graph is a development-time artifact; the production endpoint is a lookup service over a closed set of registered queries.
  • Directive- based field lifecycle — encode field lifecycle stages (draft → component-scoped → stable) in the schema itself via custom directives; persist-time validation enforces. The @draft + @component + @allowedFor triple is the canonical Zalando instance.

Operational numbers / mechanisms

  • Query ID derivation: hash of the normalised query (formatting + operation selection stripped). No hash algorithm disclosed.
  • Persist path: UI merge to main → build-time request to GraphQL service → ID returned → bundled into built UI artefact.
  • Production request body: {"id": "...", "variables": {...}} — the query key is rejected (by absence).
  • Validation rule (provided verbatim): walks each Field node, looks up parentType.getFields()[node.name.value], checks field.astNode.directives for @draft, and emits a GraphQLError("Cannot persist draft field").
  • Directive slate: @draft (field-definition), @component(name: String!) (query), @allowedFor(componentNames: [String!]!) (field-definition).

Caveats / gaps

  • No hash algorithm disclosed — the post says "just the hash of the normalized query" but doesn't name a specific algorithm (SHA-256, xxhash, etc.). Apollo APQ uses SHA-256 by convention; Zalando's choice is unspecified.
  • Normalisation semantics under-specified — the post names only "formatting and operation selection" as the things stripped. Fragment ordering, field ordering within a selection set, variable renaming, and comment handling are not addressed. Different normalisations give different IDs for semantically identical queries.
  • Storage of the persisted-queries DB is not described — the post refers to it as "a database of queries" without naming a backing store, replication model, or versioning scheme.
  • Persist-time auth / ownership — who can add entries to the persisted-queries DB? The post doesn't say. In a shared-ownership UBFF monorepo with 150+ contributors across 12+ domains, that is a real question.
  • No production numbers — no query count, no distribution of draft / @allowedFor / stable fields, no "fields safely broken because unused" count, no rollback anecdote.
  • No treatment of query deletion — presumably old IDs stop being referenced by newer UI bundles; the post doesn't describe whether entries are garbage-collected, versioned, or retained forever.
  • No SSR / BFF-to-BFF persistence story — the post is UI-client-centric. What about server-side GraphQL calls from the Rendering Engine or other BFFs? The Part 2 micro-frontends post mentions APQ-on-by-default for the Rendering Engine's GraphQL client, suggesting the same discipline extends there, but details are not in this post.
  • Borderline cases between @allowedFor and @draft — the post frames them as two stages, but a field could simultaneously be experimental and still evolving shape; the lifecycle doesn't discuss mid-flight transitions.

Why this ingest

This post turns out to be the first canonical source on the wiki for the persisted-queries + directive-lifecycle discipline at Zalando, filling a gap already flagged in UBFF's "Gaps in the public record" (where "Schema evolution / deprecation tooling — undisclosed" was explicitly listed). It also creates the patterns/automatic-persisted-queries page that was referenced but not previously canonicalised from Part 2 micro-frontends. The raw-file frontmatter had marked this as a batch-skip "pure marketing", but the 9,926-char body is in fact an API-design architecture post with two directive schemas, a working persist-time validation rule, a three-stage field lifecycle, and an explicit peer-comparison with Apollo APQ — well above the Tier-2 bar.

Source

Last updated · 501 distilled / 1,218 read