Skip to content

Datadog

Datadog Engineering — Datadog is an observability / APM / monitoring vendor; its engineering blog documents the internals of their ingest, storage, query, and analytics stack at "trillions of events per day" scale.

Tier: Datadog is not in AGENTS.md's formal Tier 1/2/3 lists; treat as Tier-3-equivalent for scope — apply the Tier-3 selectivity filter (skip pure-ML research / product-PR). Distributed-systems- internals posts (e.g., Husky) are on-topic and ingested in full; foundation-model-research announcements (e.g., Toto) are skipped.

Key systems / series covered

Key concepts / patterns introduced into the wiki from Datadog

  • concepts/lsm-compaction — size-tiered + leveled hybrid.
  • concepts/columnar-storage-format — with adaptive row-group sizing + inline column headers for streaming.
  • concepts/fragment-pruning — per-file skip-index design axis.
  • patterns/streaming-k-way-merge — 1-GET-per-input bounded- memory CPU-saturated compaction merge.
  • patterns/trimmed-automaton-predicate-filter — FSA → regex in metadata, Bloom-analog with no false negatives.
  • concepts/go-runtime-memory-model — Go runtime virtual-vs-RSS divergence; /proc/[pid]/smaps as the ground truth the runtime/ metrics package cannot see.
  • concepts/binary-size-bloat — 5-year Agent .deb growth 428 → 1,248 MiB (+192 %) as the canonical bloat datum; 6-month reduction program as the canonical cure recipe.
  • concepts/dead-code-elimination — Go linker's method-DCE surface: what disables it (reflect / plugin-mode), what re-enables it (source patches).
  • concepts/go-build-tags — file-level compile guard; the mechanism behind Datadog's per-binary / per-platform build matrix and the upstream containerd fix.
  • concepts/transitive-dependency-reachability — graph-theoretic reason a single function can drag a 570-package k8s cluster into an unrelated binary.
  • concepts/reflect-methodbyname-linker-pessimism — 16-25 % per-binary cost of a non-constant reflect.MethodByName call site; Datadog patched ~a dozen dependencies + forked stdlib text/template + html/template.
  • concepts/go-plugin-dynamic-linking-implication — 245 MiB cost of importing stdlib plugin transitively via containerd on amd64 only.
  • patterns/upstream-the-fix — Datadog's four PRs (containerd / kubernetes / uber-go/dig / google/go-cmp) canonicalised the Go-toolchain variant of Cloudflare's 2025-10 shape; Kubernetes inherited 16-37 % of its own.
  • patterns/build-tag-dependency-isolation — upstream containerd plugin-import build tag; the Agent applies it.
  • patterns/single-function-forced-package-split — trace-agent one-function-pulls-k8s case: 570 packages / 36 MiB removed by moving one function.
  • patterns/bisect-driven-regression-hunt — production-signal → env bisect → feature-flag A/B → drop-one-observability-layer → minimal reproducer → upstream git bisect → maintainer fix.
  • systems/ebpf — kernel programmable runtime; the substrate under Workload Protection's FIM.
  • concepts/in-kernel-filtering — push per-event match logic into eBPF programs so the ring buffer carries only plausible matches.
  • concepts/edge-filtering — Agent-side rule evaluation as the pipeline-volume-reduction move (producer-side filter; composes with streaming aggregation).
  • patterns/approver-discarder-filter — compile-time positive filters + runtime-learned negative filters in eBPF maps.
  • patterns/two-stage-evaluation — cheap O(1) kernel filter protecting a rich user-space rule engine.
  • concepts/ebpf-verifier — both safety guarantee and the primary source of operational variability across kernel versions / distros; motivates CI matrices + lifecycle abstractions + minimum-viable hook-set gating.
  • systems/co-re — Compile Once – Run Everywhere; BTF-backed offset patching; with fallback offset-guessing + hardcoded offsets covers kernels back to 4.14.
  • systems/ebpf-manager — Datadog's OSS Go library consolidating eBPF program lifecycle across Workload Protection, Cloud Network Monitoring, Universal Service Monitoring.
  • patterns/shared-kernel-resource-coordination — multi-vendor eBPF coexistence (TC priorities + handles, cgroup ordering); the named case study is the 2022 Datadog × systems/cilium TC-handle-collision incident.
  • concepts/agent-context-window — the fixed-size LLM working set as the dominant scarce resource in MCP-server design; Datadog reframes format, pagination, query surface, tool count, and error design as applications of this one discipline.
  • patterns/token-budget-pagination — cut at N tokens, return cursor; robust against record-size variance (Datadog log records span 100 B to 1 MB).
  • patterns/query-language-as-agent-tool — expose SQL, not raw retrieval; ~40% cheaper eval runs in some Datadog scenarios.
  • patterns/tool-surface-minimization — flexible tools + opt-in toolsets + layering (tool chaining) as three complementary ways to stay within agent tool-calling accuracy + context budget.
  • patterns/actionable-error-messages — specific corrective errors ("did you mean 'status'?") as the agent-recovery primitive; paired with discoverable-docs tool + advisory-guidance-in-success-responses.
  • concepts/wal-write-ahead-logging — Postgres WAL as both the durability primitive and the hard throughput ceiling (~1,000 8-KiB fsyncs/sec on gp3 EBS per pg_test_fsync); every committed transaction costs one fsync, and row locks that assign a transaction ID force a COMMIT record whether or not data actually changed.
  • concepts/postgres-mvcc-hot-updates — Postgres MVCC writes a new tuple per UPDATE; HOT updates skip index writes when no indexed column changes and the same page has free space (fillfactor<100); HOT composes with update cost, not lock cost — the Datadog case study where a HOT-optimized table still suffered 2× IOPS because ON CONFLICT DO UPDATE locks before WHERE.
  • patterns/cte-emulated-upsert — Postgres-specific query pattern: WITH insert_attempt AS (INSERT ... ON CONFLICT DO NOTHING RETURNING ...) UPDATE ... WHERE ... AND NOT EXISTS (SELECT FROM insert_attempt). Avoids the implicit conflict-row lock of ON CONFLICT DO UPDATE; common fresh-row path emits zero WAL records. Trade-off: small concurrent-delete race, accepted only when workload tolerates imprecision.
  • concepts/evaluation-label — two-part eval unit: ground truth
  • world snapshot (telemetry queries, not raw bytes). Agent never sees the ground truth; world snapshot survives raw-telemetry TTL.
  • concepts/trajectory-evaluation — score how the agent investigated (depth, telemetry surfaced, distance to correct answer), not only final-answer correctness. Unlocked at Datadog by a ~30% uplift in label RCA quality.
  • concepts/pass-at-k — over k attempts, does the agent succeed on at least one? Separates capability from sampling stability; standard label attribute in the Bits eval platform.
  • concepts/noise-injection-in-evaluation — simulated eval environments must include unrelated components on the same platform/team/monitor/naming cluster. Without it, evals are an open-book exam with only the relevant pages; scores over-state production quality.
  • concepts/telemetry-ttl-one-way-door — raw telemetry expires; decisions to defer snapshotting are decisions to permanently lose it. Short-term 11% pass-rate + 35% label-count hit when Datadog regenerated too-narrow labels with broader scope, for long-term eval fidelity.
  • patterns/product-feedback-to-eval-labels — every user thumbs-up/-down + free-text feedback becomes a candidate eval label; label volume scales with adoption (+order-of-magnitude over manual internal labelling).
  • patterns/agent-assisted-label-validation — once alignment studies with human judges clear a quality bar, use the agent itself to aggregate signals + derive RCAs + propose labels; humans shift from assembling to refining. Validation time per label ↓ >95% in one week at Datadog.
  • patterns/noisy-simulated-evaluation-environment — the operational shape for noise injection: expand the world snapshot along platform/team/monitor/naming adjacency edges, reconstruct signals for the noisy set, isolate per-label at the data layer.
  • concepts/change-data-capture — Datadog's 2025-11-04 CDC platform is the wiki's canonical instance of CDC as a first-class multi-tenant platform (vs. the earlier table-format-compaction and real-time-cache-invalidation framings).
  • concepts/asynchronous-replication — deliberately chosen as the foundation of Datadog's CDC platform ("favouring scalability over strict consistency"); ~500 ms replication lag as the operating point.
  • concepts/schema-evolution — the "hard problem" of async CDC. Datadog's two-layer answer (offline migration-SQL validator + runtime backward-compat Schema Registry) is the canonical wiki reference.
  • concepts/logical-replication — Postgres logical replication with wal_level=logical as a CDC source; 7-step source-side operator runbook documented via Datadog's 2025-11-04 post.
  • systems/debezium — Kafka Connect-based CDC source- connector family; core ingestion component of Datadog's CDC platform. Datadog maintains custom forks for Datadog-specific logic.
  • systems/kafka-connect"backbone for scalable, fault- tolerant data movement between systems" in Datadog's CDC platform; single-message transforms as the per-tenant customisation surface.
  • systems/kafka-schema-registry — multi-tenant, backward- compat mode, integrated with source + sink connectors; protects external custom consumers too.
  • systems/temporal — durable-workflow engine used to automate the 7-step CDC-pipeline provisioning runbook.
  • patterns/managed-replication-platform — Datadog's platform shape; five-pattern bundle above.
  • patterns/debezium-kafka-connect-cdc-pipeline — OSS CDC transport backbone.
  • patterns/workflow-orchestrated-pipeline-provisioning — Temporal-decomposed provisioning runbook.
  • patterns/schema-validation-before-deploy — offline migration-SQL validator (blocks SET NOT NULL on potentially-null columns); half of Datadog's two-layer schema-evolution safety answer.
  • patterns/schema-registry-backward-compat — runtime half.
  • patterns/connector-transformations-plus-enrichment-api — two-axis per-tenant customisation: Kafka Connect SMTs at transport + centralised enrichment API at storage.

Recent articles

Skipped articles

  • Toto time-series foundation model (2024-07-11) — pure ML research announcement, no serving-infra architecture, explicitly "not currently deployed in any production systems" per the post itself. See log entry [2026-04-21 11:17].
Last updated · 200 distilled / 1,178 read