Skip to content

PATTERN Cited by 1 source

Standardize observability SDK per language

Standardize observability SDK per language is the pattern of shipping language-specific observability SDKs — one for each major language on the engineering Tech Radar — that expose a common observability API (traces, metrics, logs) to applications. Application developers don't see the underlying protocols; they see a SDK that produces correctly-shaped telemetry regardless of which service they're instrumenting.

This is the structural answer to why "standardise the dashboards" is not enough for fleet-wide observability.

Problem

An engineering fleet with hundreds of services in five languages can converge on shared dashboards — same Grafana templates, same alert rules — while still having inconsistent underlying telemetry:

  • Java service emits trace spans with http.status_code attribute.
  • Go service emits trace spans with status_code attribute.
  • Python service emits metrics with _ms suffix on histograms while Kotlin emits _seconds.
  • Node.js service logs in JSON; C++ service logs in key= value text.

Dashboards built on these can paper over the differences at great cost, but they can't fix:

  • Cross-service queries. "Which endpoints had p99 > 1s last week?" requires consistent latency units across all services.
  • Trace navigation. Adaptive-paging-like tooling (see concepts/adaptive-paging) that walks a trace graph needs consistent span structure to work.
  • Onboarding burden. Every new service author re- implements instrumentation from scratch, with subtle mistakes.

Pattern

Build and maintain one SDK per language, each exposing the same conceptual observability API:

  • TracingstartSpan(name) / endSpan(), with consistent attribute-naming conventions.
  • Metrics — histogram, counter, gauge with consistent units (seconds for latencies, bytes for sizes).
  • Logging — structured logs with common fields (trace_id, span_id, service, severity).

Each SDK wraps the underlying exporter (OpenTelemetry, a custom collector, etc.) so application developers don't touch the protocol layer. Attribute/metric naming conventions are baked into the SDKs — not documented as voluntary standards.

Zalando names the commitment directly:

"Our strategy set a target standardizing Observability across Zalando. Through that standardization we could achieve a common understanding of Observability within the company, reduce overhead of operating multiple services and make it easier to build on top of well defined signals (like we did before with OpenTracing). The concrete step for making this possible was to develop SDKs for the major programming languages at use in Zalando."sources/2021-10-14-zalando-tracing-sres-journey-part-iii

Three outcomes the SDKs enable

  1. Common understanding of Observability. Engineers across teams use the same vocabulary for spans, metrics, and logs because the SDKs impose it.
  2. Reduced overhead operating multiple services. Instead of per-team observability debt, the SDK is maintained centrally (typically by the SRE department / observability team).
  3. Easier to build on top of well-defined signals. Tooling like Adaptive Paging that consumes traces across services works because every service's traces follow the same attribute conventions. Without per-language SDKs, each tool would need N adapters.

The third is the largest: the upstream standardization pays out downstream as tooling leverage.

Relationship to Tech Radar

Zalando explicitly scopes the SDKs to "the major programming languages at use in Zalando" per their Tech Radar. The Tech Radar provides the authoritative language list, so the SDK investment is bounded — you don't build an SDK for every language engineers happen to use for side projects, only for the endorsed ones.

This is a concrete benefit of Tech Radar governance: SDK investment is scoped by language endorsement, and language endorsement is gated on SDK availability.

Relationship to OpenTelemetry

The pattern predates OpenTelemetry's maturity but the OpenTelemetry ecosystem is the clean way to implement it today:

  • OpenTelemetry API in each language → a single SDK extension layer that enforces Zalando conventions on top.
  • Auto-instrumentation agents (Java, Python) cover the bulk of the instrumentation.
  • The Zalando-specific layer adds internal attribute schemas, log formats, and dashboards.

Zalando's 2020 history used OpenTracing (which merged into OpenTelemetry in 2019). The specific protocol is secondary; what matters is that the SDK is owned centrally and imposes conventions.

Caveats

  • SDK surface can bloat. If each language's SDK grows its own idiosyncratic helpers, the "common understanding" erodes. Requires discipline to keep APIs parallel across languages.
  • Language coverage lags. New languages added to the Tech Radar need new SDK builds. There is a time window where the language is endorsed but observability-incomplete.
  • Forcing SDK usage is political. Teams with pre- existing observability code must migrate. Requires Phase 3 SRE department authority to enforce.
  • Not all observability fits one SDK. Specialised telemetry (GPU metrics, RUM browser events) may have their own pipelines. The language-SDK pattern covers the common 80%.

Seen in

  • sources/2021-10-14-zalando-tracing-sres-journey-part-iii — Zalando's 2020 SRE Strategy anchors on Observability standardisation, with language-specific SDKs for the major Tech Radar languages as the concrete step. Explicitly framed as the evolution of the earlier OpenTracing rollout.
Last updated · 550 distilled / 1,221 read