Skip to content

ZALANDO 2021-03-03

Read original ↗

Zalando — How we use GraphQL at Europe's largest fashion e-commerce company

Summary

Rajesh Jain describes Zalando's migration from a fleet of per-surface Backend-for-Frontend microservices — adopted alongside their 2015 microservices move and structurally identical to SoundCloud's BFF and Netflix's Embracing the Differences — to a single unified GraphQL service acting as a Unified Backend-For-Frontends (UBFF) for all Web and mobile-app feature teams. Development began in H1 2018; the service has been in production since end of 2018. By February 2021 the unified schema spans more than 12 domains in a monorepo shared by 150+ contributors, with 200+ developers across 25-30 feature teams consuming it. It serves >80% of Web and >50% of App use cases. The service runs on Zalando's in-house open-source graphql-jit — a JIT-compiled GraphQL execution engine — and enforces a strict "No Business Logic" principle: the GraphQL layer aggregates and shapes, but platform/domain-specific logic lives in downstream presentation- layer backends. Operational concerns are addressed with Circuit Breakers, Timeouts, Retry, and — specifically called out — the Bulkhead pattern applied by deploying separate instances of the same service per platform (Web vs mobile Apps) for fault isolation. The post names the UBFF as Zalando's answer to four BFF-at-scale pathologies rooted in Conway's Law: duplicated effort, fragmented security/auth, fragmented observability, and — most subtly — inconsistent customer experience across platforms when the same business logic gets reimplemented in N BFFs. Four adoption levers are described: one-stop documentation with embedded GraphiQL + Voyager, a support chat, company-wide training with 150+ attendees, and schema-design consultation hours for new domains.

Key takeaways

  1. BFF-per-surface at 2015-era microservices scale produces five named pathologies. Adopting the BFF pattern surface- by-surface (Web product page, Web wishlist, App wishlist, App home, …) at Zalando's team count produced: (i) lack of balance between fast feature delivery and developer experience, (ii) duplication of effort across BFFs, (iii) inconsistent customer experience across platforms (the post's worked example: mobile shows delivery "5-9 Feb" while desktop shows "1-3 Feb" because different BFFs computed the window independently), (iv) fragmented security/auth, (v) fragmented observability. The root cause is named explicitly as Conway's Law — different BFF teams independently re-derive the same business logic (Source: sources/2021-03-03-zalando-how-we-use-graphql-at-europes-largest-fashion-e-commerce-company).

  2. The UBFF is one GraphQL service, not a federation. The post distinguishes Zalando's approach from Apollo Federation explicitly: "instead of having multiple Graphs connected via a library and gateway we have a single service at Zalando which connects all the domains in a single schema Graph." The named trade-off: give up per-domain deploy independence in exchange for unified tooling, single deployment, and single-point governance. This is the unified-monolith GraphQL end of the spectrum and the explicit counter- example to federated subgraph-per-domain (Source: sources/2021-03-03-zalando-how-we-use-graphql-at-europes-largest-fashion-e-commerce-company).

  3. Shared-ownership monorepo with contribution principles, not centralised team. 12+ domain teams contribute to one repo under shared ownership guided by a documented contribution framework. Contributor count grew 50 → 150+ in 2020; adoption grew 70 → 200 developers in the same window. The pattern is GitHub's one graph concept from Principled GraphQL operationalised at Zalando scale without an Apollo-style federation runtime (Source: sources/2021-03-03-zalando-how-we-use-graphql-at-europes-largest-fashion-e-commerce-company).

  4. Entities are first-class citizens in the graph. Zalando built an Entity system on top of GraphQL: named entities like Product and Campaign are the first-class types domain teams contribute. The entity layer is a stable, cross-domain abstraction the schema is organised around — flagged in the post as the subject of its own upcoming article in the series. Zalando's adjective for the overall schema is "a dense graph" — deliberately tangled, with cross-entity navigation as the value proposition for clients (Source: sources/2021-03-03-zalando-how-we-use-graphql-at-europes-largest-fashion-e-commerce-company).

  5. Built on graphql-jit, an in-house open-source JIT compiler for GraphQL execution. Zalando replaced the reference graphql-js execution engine with graphql-jit — an open-source JIT- compiled GraphQL executor they built for performance optimisation. The repo is zalando-incubator/graphql-jit. The fact that Zalando had to build and open-source a JIT engine is itself a scale signal: at 200+ developers' worth of schema density and 80% of Web traffic through one service, the reference interpreter was insufficient (Source: sources/2021-03-03-zalando-how-we-use-graphql-at-europes-largest-fashion-e-commerce-company).

  6. "No Business Logic" principle at the GraphQL layer. Domain- and platform-specific logic is explicitly forbidden in the GraphQL layer. Instead, domain teams implement that logic in presentation-layer backend services (which the post calls the "presentation layer") that sit behind the graph. The rationale: keeping the aggregation layer business-logic-agnostic "allows domain specific backend APIs to steer domain or platform (Web vs. App) specific content on their own" and simplifies operational maintenance. Captured as patterns/business-logic-free-data-aggregation-layer (Source: sources/2021-03-03-zalando-how-we-use-graphql-at-europes-largest-fashion-e-commerce-company).

  7. Monorepo is a known "god component" design-smell risk, addressed via two orthogonal mechanisms. The post names the risk: a 12+ domain monorepo is a God Component (architectural smell — excessive LOC or class count). The architectural mitigation is shared ownership + explicit contribution principles. The operational mitigation is Reliability Patterns — Circuit Breakers, Timeouts, Retry — plus the Bulkhead pattern applied as "separate deployments for Web and mobile Apps". This is a deployment-level bulkhead: the same codebase, one runtime process per platform, so a Web-only regression cannot take the mobile app down and vice versa. Captured as patterns/per-platform-deployment-bulkhead (Source: sources/2021-03-03-zalando-how-we-use-graphql-at-europes-largest-fashion-e-commerce-company).

  8. Four-lever adoption programme unblocked 50→200 developer growth in one year. The post names the four levers: (1) One-stop-shop Documentation — single structured docs site (following Divio's doc framework) with embedded GraphQL editor, schema docs, Voyager for schema exploration, and practice exercises. (2) Support chat for user and contributor queries. (3) Trainings — one company-wide GraphQL adoption training had 150+ participants. (4) Consultation — the platform team provides schema-design consultation hours for new domains integrating into the graph (because schema design is hard even for developers who can use GraphQL). The named outcomes: contributors 50 → 150+ in 2020; consumers 70 → 200 across 25-30 feature teams (Source: sources/2021-03-03-zalando-how-we-use-graphql-at-europes-largest-fashion-e-commerce-company).

  9. Positioning against peers is explicit. The post names six peer organisations running unified-graph setups and classifies Zalando within them: GitHub (single-graph GraphQL API covering repos, users, marketplace), Shopify (separate StoreFront + Admin unified graphs), Airbnb (working on unified schema, 2019 GraphQL Summit), Expedia (REST-to-graph migration — the "developers spent more time figuring out which service to call than shipping features" observation), Apollo Federation (library + gateway model), and Netflix (one-graph in the Studio ecosystem — see systems/netflix-enterprise-graphql-gateway). Zalando picks the single-service unified-graph lane explicitly; Apollo Federation is named as the alternative they are not using (Source: sources/2021-03-03-zalando-how-we-use-graphql-at-europes-largest-fashion-e-commerce-company).

Systems

  • Zalando Unified Backend-For-Frontends (UBFF) GraphQL (new) — the single- service unified-schema GraphQL layer described in the post; in production since end of 2018; 12+ domains; 200+ consumers across 25-30 feature teams; serves >80% of Web and >50% of App use cases as of Feb 2021.
  • systems/graphql-jit (new) — Zalando's open-source JIT-compiled GraphQL execution engine (zalando-incubator/graphql-jit), used as the execution layer in the UBFF in place of the reference graphql-js implementation.
  • systems/graphql (new) — the query language itself; Facebook-developed, declarative data-fetching spec with a hierarchical + product-centric design philosophy; the substrate the UBFF is built on.
  • systems/apollo-federation (updated) — the explicit not-chosen alternative; Zalando's UBFF is a single service where Apollo Federation is library + gateway across N subgraphs.
  • systems/netflix-enterprise-graphql-gateway (updated) — Netflix's one-graph approach in the Studio ecosystem, named as a peer.

Concepts

Patterns

Operational numbers disclosed

  • End of 2018: UBFF first production deploy.
  • February 2021 (post date): snapshot figures below.
  • 12+ domain teams contributing to the monorepo.
  • 150+ contributors (up from 50 in 2020).
  • 200+ developers consuming GraphQL for feature work (up from 70 in 2020).
  • 25-30 feature teams served.
  • >80% of Web use cases served by UBFF.
  • >50% of App use cases served by UBFF.
  • 150+ attendees at Zalando's in-house GraphQL adoption training.
  • Separate Web + mobile deployments of the same UBFF service for bulkhead fault isolation.

Caveats

  • Genre is architectural retrospective + advocacy, not production deep-dive. Numbers are team/adoption counts and percentages. No latency SLOs, no QPS, no p99, no resolver-fan-out depth, no cache hit rates, no specific outage stories. The post is the Part 1 framing of what the author promises as a series on Observability, Performance Optimization, Security, Tooling, Errors — each of which would be its own deep-dive.
  • The Entity system is named but not described. Named as "will be its own post in the series." The wiki's systems/zalando-graphql-ubff entry flags this as a known gap.
  • The "presentation layer" is a black box. The post says domain/platform logic lives there but does not specify: is it one service per domain? One per (domain × platform)? What protocol does the GraphQL layer call them with? What's the fan-out shape on a complex query? All unstated.
  • graphql-jit is mentioned as a named dependency but not benchmarked. We know it's JIT-compiled and that Zalando chose to build it for "performance optimization" — we don't know against what baseline, at what query shape, with what measured delta.
  • No disclosure of schema size. 12+ domains tells us about team count; it doesn't tell us the number of types, fields, or resolvers in the graph.
  • No disclosure of evolution tooling. Schema versioning, deprecation tracking, and breaking-change policies are all implied by a 3-year production history across 150+ contributors but not described.

Source

Last updated · 476 distilled / 1,218 read