Skip to content

CONCEPT Cited by 1 source

Embedded SRE team

An Embedded SRE team is a full SRE team that sits within a product area but reports up the SRE department chain. It is distinct from the lone-SRE-per-team anti-pattern (one engineer embedded into a product team, who degrades into the team's Ops engineer) and from SRE Enablement (platform team ships capabilities; no hands-on within a product area).

Definition

Zalando's 2020 instantiation, pitched by the Checkout senior management rather than by the SRE department:

"After another collaboration between SRE and the Checkout teams, the senior management of that department officially pitched for the creation of an Embedded SRE team. [...] The Embedded team will report to the SRE department, and both SRE and product area management have aligned on a set of KPIs like Availability and On Call Health."sources/2021-10-14-zalando-tracing-sres-journey-part-iii

Four defining properties:

  1. Full team, not individual. The team has its own engineers, its own backlog, its own charter — not a single SRE sprinkled across the product org.
  2. Reports to the SRE department. Performance evaluation, headcount, role definition, hiring stay with SRE. Prevents the classic lone-SRE pathology where the individual is pulled into product work and loses reliability focus.
  3. Scoped to one product area. Unlike Enablement (company-wide scope, ship tools to everyone), an Embedded team's scope is one product domain. This lets them be hands-on in the product's code and tooling — "more hands-on on the code and tooling used within the product development team" — while keeping the scope finite.
  4. Dual KPI alignment with the product area. The team is accountable on agreed KPIs to both the SRE dept (their reporting chain) and product area management (their customer). Zalando names two: On-Call Health + Availability (driven by the product area's SLOs).

How it differs from adjacent shapes

Shape Reports to Scope Hands-on?
Lone SRE per team Product team One team Yes, becomes team Ops
SRE team per product cluster SRE dept One product cluster Medium
Embedded SRE team (this) SRE dept One product area Yes, explicitly hands-on
SRE Enablement SRE dept Company-wide No, ships platform

The distinction between "embedded team" and "team-per-product-cluster" (Zalando 2016 Part I) is real but subtle:

  • Part I's team-per-product-cluster was attempted in the grassroots phase and stalled — it covered multiple services across a cluster but did not embed in product code/tooling.
  • Part III's Embedded SRE team is post-department, scoped to a single product area, and explicitly hands-on in the product's code path.

Why it works in Phase 3

An Embedded team only becomes viable after the SRE department exists because:

  • Reporting-chain contract requires a department to report to. Without a department, the embedded engineers end up reporting to product management, which reverts to lone-SRE Ops.
  • Enablement team provides the platform the embedded team uses. The embedded SRE team doesn't reinvent observability, alerting, or incident process — it consumes the Enablement team's primitives (concepts/adaptive-paging, concepts/multi-window-multi-burn-rate, systems/zalando-service-level-management-tool) and applies them inside the product area.
  • Dual-KPI contract requires a KPI portfolio. Without an concepts/sre-kpi-portfolio that both the SRE dept and product-area management can agree on, the dual-accountability model has nothing to anchor on.

Benefits

  • Deep product context. The team can participate in design reviews, make code changes, and run load tests on the product's specific critical paths. Enablement can't do this at scale.
  • Voice for reliability in product prioritisation. Zalando names this directly: "able to influence the prioritization of topics which ensure a reliable customer experience."
  • Listening post for the SRE department. The embedded team feeds real-world usage signals back to SRE dept, shaping what Enablement builds next.

Failure modes

  • Scope creep into product delivery. Without discipline, the embedded team ends up implementing product features because the product team is understaffed. The SRE dept reporting chain is the guardrail.
  • Divergence from Enablement's tools. If the embedded team builds its own reliability tooling rather than using Enablement's, the department fragments. Shared tooling should be the default.
  • Dual-KPI disagreement. If the product area prioritises availability and the SRE dept prioritises on-call health, the embedded team gets pulled between them. The KPIs must be agreed in writing and reviewed periodically.
  • Hiring becomes harder. Zalando acknowledges: "hiring was always a challenge [...] now we're planning to bootstrap another team, so that cannot be making things any easier." An embedded team still has the SRE-profile hiring filter; a second team to staff doubles the demand.

Caveats

  • Zalando's embedded-team instance (Checkout) was new at publication time (2021-10). The post describes it as "we're still figuring out most things as we go along". Long-term durability is not proven in this post.
  • Whether the shape scales to N product areas — as opposed to being a one-off — is not disclosed. Zalando only has one instance documented here.
  • The post does not discuss compensation or career-ladder mechanics for embedded SREs. Performance review across two reporting chains is named as a hard question, not a solved one.

Seen in

  • sources/2021-10-14-zalando-tracing-sres-journey-part-iii — Zalando's 2020 Embedded SRE team for Checkout: requested by Checkout senior management, reports to SRE dept, dual KPIs (Availability + On-Call Health), hands- on in product code and tooling. First documented embedded SRE team at Zalando.
Last updated · 550 distilled / 1,221 read