PATTERN Cited by 1 source
Per-Platform Deployment Bulkhead¶
When to use¶
You have:
- One service with one codebase serving multiple distinct traffic classes (Web clients, mobile Apps, partner APIs, internal tooling).
- The traffic classes have different load profiles or different criticality — a problem on one should not take out the others.
- You want fault isolation between traffic classes without the operational cost of maintaining multiple codebases.
- The service is a potential single point of failure (e.g., a shared API gateway or aggregation layer that all clients depend on).
The pattern¶
Deploy the same service binary as N independent deployments, one per traffic class. Each deployment has its own pods/hosts/autoscaling group, its own resource budget, and serves one class of client:
Web clients mobile App clients
│ │
▼ ▼
┌───────────────────┐ ┌───────────────────┐
│ Service (Web │ │ Service (App │
│ deployment) │ │ deployment) │
│ ─ same code │ │ ─ same code │
│ ─ own pods │ │ ─ own pods │
│ ─ own auto- │ │ ─ own auto- │
│ scaling │ │ scaling │
└─────────┬─────────┘ └─────────┬─────────┘
│ │
└────────┬───────────────────┘
▼
shared downstream services
A bug, traffic spike, or resource exhaustion in one deployment does not affect the other.
The standard Bulkhead pattern, deployed at service¶
granularity¶
The Bulkhead pattern classically partitions a service internally — separate thread pools, separate connection pools, separate queues per client class. Per-platform deployment bulkhead is the same idea applied at the deployment level: rather than partition one running process, run N separate processes of the same binary.
Why go to deployment granularity instead of internal partitioning:
- Process-level isolation. An internal partition still shares the JVM heap, the GC, the CPU scheduler, the network stack. A separate deployment doesn't.
- Autoscaling can diverge. Each deployment scales independently on its own traffic — mobile's morning peak doesn't force Web to spin up too.
- Deploys can diverge (when needed). Normally the same binary deploys to both; in emergencies you can canary a fix to one and not the other.
- No internal partitioning to design. You get isolation for free from the infrastructure, not from in-process bookkeeping.
Canonical wiki instance: Zalando UBFF¶
Zalando's Unified Backend- For-Frontends GraphQL applies this pattern explicitly (Source: sources/2021-03-03-zalando-how-we-use-graphql-at-europes-largest-fashion-e-commerce-company):
"We also introduced Bulkhead pattern to provide more Fault tolerance and isolation by deploying the application to serve traffic per platform (separate deployments for Web and mobile Apps)."
This sits alongside Circuit Breakers, Timeouts, and Retry as one of four reliability patterns Zalando uses to manage the operational risk of the UBFF being a single point of failure across all frontends.
What this buys you¶
- Web regression doesn't break App. A bad query, resolver bug, or traffic spike confined to Web-client queries stays in the Web deployment.
- Independent scaling. Mobile and Web traffic often peak at different times; each deployment scales on its own observed load.
- Independent resource budgets. Memory, CPU, pod counts can be tuned per platform based on actual load profile.
- Cheap canary surface. A risky change can be shipped to just the Web deployment, observed, then rolled to mobile.
What this costs¶
- Duplicate infrastructure. Twice the pods, twice the autoscaling cost floor. Cheap in cloud environments, not free.
- Config drift risk. Two deployments means two sets of environment variables, feature flags, scaling policies — one can diverge from the other accidentally.
- Cache coldness on deploy. Each deployment has its own cache population curve.
- Observability doubling. Dashboards and alerts need
to cover both deployments; you're filtering every
query by
platform=web|app.
When not to use¶
- Low-scale services where a single deployment is already adequate — the duplicate infra cost doesn't earn its keep.
- Services without meaningful traffic class differences — if all traffic behaves the same, splitting by client doesn't give fault isolation, it just doubles ops.
- Services where intra-process state is load-bearing (stateful caches, sessions) — now you need sticky routing on top.
Contrast with adjacent patterns¶
- Bulkhead (internal partitioning) — same motivation, smaller unit of isolation (thread pool / connection pool / queue). Cheaper but shares runtime failure modes.
- Cell-based architecture — fault isolation at the customer or tenant level, not the client-platform level. Larger unit, different partition axis.
- Per-platform codebase — each platform has its own service with its own code. More isolation, but loses the "one schema, one repo" commitment of patterns like UBFF.
Seen in¶
- sources/2021-03-03-zalando-how-we-use-graphql-at-europes-largest-fashion-e-commerce-company — Zalando's UBFF deploys two instances of the same GraphQL service — one for Web, one for mobile Apps — as a Bulkhead for fault isolation. First wiki instance.
Related¶
- patterns/unified-graphql-backend-for-frontend — the pattern this bulkheads
- systems/zalando-graphql-ubff
- concepts/blast-radius — the underlying reliability concept
- concepts/cell-based-architecture — the adjacent larger-grain partitioning pattern