Skip to content

CONCEPT Cited by 1 source

Critical business operation

A Critical Business Operation (CBO) is a named, top-level user-facing business action (checkout, item view, order placement, sign-in, add-to-cart) that is elevated to a first-class alerting + SLO primitive. The CBO's error rate, latency, and throughput — not the individual backing service's — are what get monitored, alerted on, and held to SLO.

Definition

Zalando's instantiation:

"This alert handler monitors the error rate of what we call Critical Business Operations (CBO) and when it is triggered it uses the tracing data to determine where the error comes from across the entire distributed system, and pages the team that is closest to the problem."

Three defining properties:

  1. CBO is the alertable unit, not the service. One CBO ("checkout") maps to a set of distributed-tracing root spans whose parent-child graph spans many microservices.
  2. Error rate is measured at the CBO level. A CBO fails when the root span carries an error tag or non-2xx status — regardless of which service inside the trace caused it.
  3. Per-CBO SLO, alerting, and dashboards become the canonical way to talk about reliability; per-service metrics are diagnostic but not primary.

Why CBOs are load-bearing

Classical service-centric observability breaks down at scale:

  • Service A can be 99.99% available while checkout is broken. If 3 of 20 services on the checkout path are degraded (each at 99.5%), each service's SLO reports green but the user-visible compound failure rate is ~1.5%.
  • User journeys cross team boundaries. "Checkout" touches inventory, pricing, payments, fraud, notifications; no single team owns the compound success rate.
  • Service-level alerting produces noise and misses. Alerts fire for services not on any hot path; outages that matter hide in compound rates nobody tracks.

Defining CBOs as first-class lets the organization:

Relationship to the trace

A CBO maps to a root span identity pattern plus some attribute filters:

  CBO "checkout_v2" :=
    root_span.operation_name == "POST /checkout"
    AND root_span.service == "checkout-api"
    AND root_span.tag.version >= 2

The error rate of the CBO is the fraction of root spans matching the pattern that carry error=true or http.status >= 500. A CBO catalogue is usually small (single-digit to low- hundreds); they are curated business-critical operations, not an auto-discovered enumeration.

CBOs vs Google's SLI/SLO model

Google's SRE-book SLI framework has always allowed per- endpoint or per-user-journey SLIs, and Google's recent user-journey SLO writings align with CBO thinking. The Zalando term Critical Business Operation emphasises two things the generic SRE-book framing soft-pedals:

  • Business-critical curation: CBOs are explicitly picked by product + SRE together as the operations whose reliability matters to revenue / customer trust.
  • Cross-service compound measurement: the CBO's success rate necessarily crosses service boundaries and must be measured at the root span — this makes OpenTracing / OpenTelemetry a hard prerequisite.

Dependencies

  • Distributed tracing deployed fleet-wide with consistent root-span identification — see OpenTracing semantic conventions.
  • CBO catalogue maintained by product + SRE, typically a small curated list rather than an auto-discovered one.
  • Error taxonomy agreed per CBO — what counts as a "checkout failure"? HTTP 5xx? User-abandoned after a 5xx? 5xx including those retried successfully by the client? This is typically the hardest design choice.
  • Alerting + SLO tooling aware of CBO as a dimension — vanilla per-service dashboards insufficient; the CBO needs a first-class definition in the alerting platform.

Downstream primitives

Seen in

  • — Zalando names CBO as the alertable primitive underlying Adaptive Paging and Symptom-Based Alerting. Quote: "monitors the error rate of what we call Critical Business Operations."
  • sources/2022-04-27-zalando-operation-based-slostechnical deep-dive. Names the origin of the CBO catalogue — renamed from internal "User Functions" (generated by SREs + experienced engineers for Cyber-Week load-testing work, ordered by revenue impact) to "Critical Business Operation" to encompass non-user operations (e.g. SRE's own "Ingest Metrics", "Query Traces" CBOs). Introduces the senior-manager ownership model — each CBO's SLO is signed off by the Director / VP owning the customer experience, not by any component service's team. Also names per-CBO error-budget tracking across three 28-day windows in the new Service Level Management Tool. Ties the "transport-agnostic SLI" shift (OpenTracing error tag rather than 5xx-rate) to CBOs crossing protocol boundaries, so graceful degradation fallbacks can still register as CBO failures.
  • CBO as the priority-class assignment axis for admission control. Order confirmations are canonical-Zalando CBOs (SLO-protected, revenue-linked); marketing / brand-alert / campaign notifications are non-CBO bulk traffic. The Communication Platform's three-tier priority system P1/P2/P3 maps directly onto CBO/non-CBO distinction — the platform's Stream Consumer assigns per-event-type AIMD coefficients such that P1 (CBO-carrying) event types barely slow under congestion while P3 contracts sharply. Canonical instance of the CBO classification surfacing in an admission-control coefficient table rather than only in SLO definitions and paging policy — the downstream implication of the business taxonomy reaches into runtime rate control (concepts/per-priority-aimd-coefficients, patterns/priority-differentiated-load-shedding).
  • CBOs as probe-scenario scoping unit. Zalando's 2024 e2e test probe tier scopes its probe scenarios 1-to-1 to CBOs: home →gender→product, catalog→filter→product, product→size→ cart→checkout. Declared growth path: "include more of our Critical Business Operations (CBOs) and we also [are] looking at extending this idea to our mobile apps." Canonical wiki instance of CBOs as a probe-level scoping unit that ties browser-altitude synthetic monitoring to the same symptom-primitive as trace-derived CBO error-rate alerts. The probe surfaces a CBO failure mode (frontend interactivity crash from React hydration breakdown) that the trace-derived CBO alert misses because HTTP 200 still flows.
Last updated · 542 distilled / 1,571 read