Skip to content

CONCEPT Cited by 1 source

Critical business operation

A Critical Business Operation (CBO) is a named, top-level user-facing business action (checkout, item view, order placement, sign-in, add-to-cart) that is elevated to a first-class alerting + SLO primitive. The CBO's error rate, latency, and throughput — not the individual backing service's — are what get monitored, alerted on, and held to SLO.

Definition

Zalando's instantiation:

"This alert handler monitors the error rate of what we call Critical Business Operations (CBO) and when it is triggered it uses the tracing data to determine where the error comes from across the entire distributed system, and pages the team that is closest to the problem."sources/2021-09-20-zalando-tracing-sres-journey-part-ii

Three defining properties:

  1. CBO is the alertable unit, not the service. One CBO ("checkout") maps to a set of distributed-tracing root spans whose parent-child graph spans many microservices.
  2. Error rate is measured at the CBO level. A CBO fails when the root span carries an error tag or non-2xx status — regardless of which service inside the trace caused it.
  3. Per-CBO SLO, alerting, and dashboards become the canonical way to talk about reliability; per-service metrics are diagnostic but not primary.

Why CBOs are load-bearing

Classical service-centric observability breaks down at scale:

  • Service A can be 99.99% available while checkout is broken. If 3 of 20 services on the checkout path are degraded (each at 99.5%), each service's SLO reports green but the user-visible compound failure rate is ~1.5%.
  • User journeys cross team boundaries. "Checkout" touches inventory, pricing, payments, fraud, notifications; no single team owns the compound success rate.
  • Service-level alerting produces noise and misses. Alerts fire for services not on any hot path; outages that matter hide in compound rates nobody tracks.

Defining CBOs as first-class lets the organization:

Relationship to the trace

A CBO maps to a root span identity pattern plus some attribute filters:

  CBO "checkout_v2" :=
    root_span.operation_name == "POST /checkout"
    AND root_span.service == "checkout-api"
    AND root_span.tag.version >= 2

The error rate of the CBO is the fraction of root spans matching the pattern that carry error=true or http.status >= 500. A CBO catalogue is usually small (single-digit to low- hundreds); they are curated business-critical operations, not an auto-discovered enumeration.

CBOs vs Google's SLI/SLO model

Google's SRE-book SLI framework has always allowed per- endpoint or per-user-journey SLIs, and Google's recent user-journey SLO writings align with CBO thinking. The Zalando term Critical Business Operation emphasises two things the generic SRE-book framing soft-pedals:

  • Business-critical curation: CBOs are explicitly picked by product + SRE together as the operations whose reliability matters to revenue / customer trust.
  • Cross-service compound measurement: the CBO's success rate necessarily crosses service boundaries and must be measured at the root span — this makes OpenTracing / OpenTelemetry a hard prerequisite.

Dependencies

  • Distributed tracing deployed fleet-wide with consistent root-span identification — see OpenTracing semantic conventions.
  • CBO catalogue maintained by product + SRE, typically a small curated list rather than an auto-discovered one.
  • Error taxonomy agreed per CBO — what counts as a "checkout failure"? HTTP 5xx? User-abandoned after a 5xx? 5xx including those retried successfully by the client? This is typically the hardest design choice.
  • Alerting + SLO tooling aware of CBO as a dimension — vanilla per-service dashboards insufficient; the CBO needs a first-class definition in the alerting platform.

Downstream primitives

Seen in

Last updated · 476 distilled / 1,218 read