PATTERN Cited by 1 source
Embedded SRE team from customer pull¶
The customer-pull pattern for provisioning an embedded SRE team: the product area whose code the embedded team will work on requests the team, rather than the SRE department unilaterally deploying it. The request is routed, the SRE department provisions the team, and both sides pre-commit to dual KPIs — one from the product area (Availability via SLOs) and one from the SRE department (On-Call Health).
Problem¶
SRE departments in Phase 3 (concepts/sre-organizational-evolution) face a staffing and scope question: how do you deploy a hands-on SRE team into a product area without (a) recreating the lone-SRE- per-team anti-pattern, (b) forcing the embedded team to report into the product org and lose reliability focus, or (c) being perceived as an SRE dept empire-builder push?
Pattern¶
Solve (c) — and make (a) and (b) less likely — by requiring customer pull:
- Product area senior management initiates the ask. The pitch comes from product-area leadership, not from SRE. Zalando names this directly: "the senior management of that department officially pitched for the creation of an Embedded SRE team."
- SRE department agrees to provision. Team reports to SRE dept; not a handoff into product-area headcount.
- Dual KPIs agreed up front. Zalando: "both SRE and product area management have aligned on a set of KPIs like Availability and On Call Health." The two KPIs anchor accountability across the two reporting chains.
- Scope finite and hands-on. Team works on the product area's critical paths only. Doesn't get pulled into company-wide platform work (that's Enablement's remit).
Why customer pull matters¶
- Demand is pre-validated. The product area signed up for the embedded team; SRE isn't imposing the intervention. Reduces early-stage friction.
- Product area commits resources. Senior management sponsorship means the embedded team gets access to product code review, design reviews, SLO conversations — which are political resources that only product-area leadership can grant.
- Escape hatch is legitimate. If the embedded model doesn't work for this product area, the pull framing lets the product area pull back without reading as SRE failure. Lower political cost on both sides.
- Signals department maturity. A product area asking for an embedded SRE team proves SRE Enablement has delivered enough value that other orgs now want more of it. Internal marketing in the best sense.
Implementation shape (Zalando, Checkout 2020)¶
- Prior state: SRE Enablement had collaborated repeatedly with Checkout on ad-hoc projects.
- Trigger: Checkout senior management pitched the embedded team.
- New team structure:
- Reports to SRE department (not Checkout).
- Scoped to Checkout product area.
- KPIs:
- Availability — from the product area's SLOs.
- On-Call Health — paging rate + individual on-call frequency.
- Mandate: more hands-on on code/tooling than Enablement.
- Role: "a voice for reliability within that product area, able to influence the prioritization of topics which ensure a reliable customer experience in our Fashion Store."
- Feedback to SRE dept: "The SRE department will also benefit from having a source providing precious feedback on whatever the department is trying to roll out to the wider engineering community."
The dual-KPI contract¶
The KPI choice is deliberate and load-bearing:
| KPI | Owner | Opposing pressure |
|---|---|---|
| Availability | Product area | Cuts against on-call health (alert more to catch more) |
| On-Call Health | SRE dept | Cuts against availability (mute noisy alerts) |
The tension forces the team toward root-cause work: the only way to hold both numbers is to fix underlying reliability issues rather than trade availability for on-call sanity or vice versa. This is the same error- budget logic applied to team structure.
Preconditions¶
- SRE department exists. Phase 3 has landed; there is a reporting chain to belong to. Without this, the embedded team collapses into lone-SRE per team.
- SLO portfolio exists in the product area. KPI negotiation requires Availability to be measurable per operation. Per-service SLOs don't suffice.
- On-call health can be measured. Rotation size and paging volume data must exist. See concepts/on-call-health-metric.
- Enablement primitives exist. The embedded team doesn't build its own alerting / tracing / SLO tools — it uses the department's. Shared tooling reduces duplication.
Caveats¶
- One customer at a time is manageable. N customers doubles the challenges. Zalando acknowledges: "we're planning to bootstrap another team, so that cannot be making things any easier." Hiring, consistency across embedded teams, and cross-team coordination all become harder.
- Cross-reporting-chain performance review is unsolved in the post. How an engineer on the embedded team is reviewed when their work serves two stakeholders is open.
- Pull can mask avoidance. If a product area never asks for an embedded team despite reliability issues, customer pull alone won't fix it. May need to combine with push mechanisms (e.g. PRR findings that recommend embedding).
- Zalando's data point is new. At publication time (2021-10) the team was recently formed; long-term outcomes not yet reported.
Seen in¶
- sources/2021-10-14-zalando-tracing-sres-journey-part-iii — Checkout senior management pitched for the embedded SRE team in 2020; team reports to SRE dept with dual-KPI alignment (Availability + On-Call Health); hands-on in Checkout code and tooling.