Skip to content

SYSTEM Cited by 2 sources

Zalando Observability SDK (Node)

What it is

@zalando/observability-sdk-node is Zalando's thin internal wrapper over upstream OpenTelemetry core Node.js packages, built by the SRE Enablement and Web Platform teams at the end of 2022. It pre-configures Zalando-specific defaults and acts as a proxy for all underlying OTel dependencies so service owners can instrument a Node.js application with a single statement (Source: sources/2024-07-28-zalando-opentelemetry-for-javascript-observability-at-zalando):

import { SDK } from "@zalando/observability-sdk-node";
new SDK().start();

Why it exists

The SDK was built in direct response to the worker-threads incident (sources/2024-07-24-zalando-nodejs-and-the-tale-of-worker-threads): before 2023, the on-call team had "almost zero visibility" into what a misbehaving Node.js service was doing during production incidents. The original 2022-04 positive-feedback- loop incident was fixed without event-loop-lag instrumenta- tion — the team guessed at the root cause. Post-incident, the author called for a dedicated Node.js observability effort; this SDK is that effort.

What it configures

Without any constructor arguments, new SDK().start() does the following (Source: sources/2024-07-28-zalando-opentelemetry-for-javascript-observability-at-zalando):

  • Parses platform environment variables set by Zalando's Kubernetes platform for all deployed applications — the SDK is auto-configured without service-owner action.
  • Registers auto-instrumentations:
  • HTTP module functions are monkey-patched to produce span data on network calls.
  • Optional Express.js instrumentation enabled via boolean flag in the (optional) constructor config.
  • Enables built-in metrics collection at a configured interval:
  • CPU usage
  • Memory usage
  • Garbage collection metrics
  • Event-loop lag — the specific signal the worker- threads incident was blind to
  • Enables span and metric exporters on a specified interval, shipping data to the telemetry backend (Lightstep / ServiceNow Cloud Observability).

OTel-context trade-off

The SDK does not use OTel's context API on the server side — they resorted to manual span passing through function parameters (tracer.startSpan("name", {}, context)), because the differing API shape between OpenTelemetry's tracer.startActiveSpan and OpenTracing's manual span-passing "makes it difficult to migrate existing instrumentation code, especially in a large codebase like ours" — Zalando already had substantial OpenTracing instrumentation in place and needed migration to be easy.

Adoption

Two years after build (by 2024-07), 53 Node.js applications at Zalando were instrumented with this SDK (Source: sources/2024-07-24-zalando-nodejs-and-the-tale-of-worker-threads).

Relationship to sibling SDKs

Part of a 3-package family, tied together by an isomorphic API package:

@zalando/observability-api             # shared types + API
@zalando/observability-sdk-node        # this package
@zalando/observability-sdk-browser     # sibling

The isomorphic structure makes it easy to instrument isomorphic applications — those running on both the server and the browser, typically pages served by the Rendering Engine.

Seen in

Last updated · 501 distilled / 1,218 read