Skip to content

DATADOG 2026-03-04 Tier 3

Read original ↗

Designing MCP tools for agents: Lessons from building Datadog's MCP server

Summary

Datadog's retrospective on shipping its official MCP (Model Context Protocol) server — the company's first observability interface built specifically for customer AI agents rather than humans or programmatic clients. The V1 was "a thin wrapper around existing APIs" — it worked well enough to validate the idea, then failed in characteristic ways once real agents drove it. Agents filled context windows with log data and lost track of their task; they blew token budgets on variable-size records; they inferred trends from raw samples instead of aggregating. The post reframes tool design around an LLM context-window budget as the dominant operational resource, and proposes a set of patterns Datadog uses to live within it.

Key takeaways

  • An agent's context window is a hard scarce resource, and the tool owns how much of it each call consumes. The entire tool result ends up in context; "I can't predict how much I'll eat" is a design bug. This reframes classic ergonomic questions (output format, pagination unit, tool count, error messages) into a single discipline around context efficiency (Source: sources/2026-03-04-datadog-mcp-server-agent-tools).

  • Response format is a 2× lever without changing semantics. CSV uses about half as many tokens per record as JSON for tabular data (no nesting). YAML trims roughly 20% off JSON for nested data. Trimming rarely-used fields from the default response (and letting agents re-request them) compounded further: Datadog reports fitting ~5× more records per token budget on some tools after applying format + trim together.

  • Paginate by token budget, not by record count — Datadog log messages range from 100 bytes to 1 MB, so fixed-record-count pages have unbounded token cost; the server cuts at N tokens and returns a cursor (pattern: patterns/token-budget-pagination). This may matter less when clients like Cursor and Claude Code write long tool results to disk instead of inlining into context — "this isn't in the MCP spec yet."

  • Expose a query language, not raw retrieval. For questions like "which services are logging the most errors in the last hour?", V1 agents pulled samples of raw logs and guessed; some brute-forced until the context window filled. Datadog added SQL-style querying over observability data — agents are "quite good at writing it" — and reports ~40% cheaper eval runs in some scenarios because agents SELECT only the fields they need and LIMIT / COUNT instead of retrieving. Pattern: patterns/query-language-as-agent-tool. Supporting this at Datadog scale was "a significant lift" because traditional relational databases do not apply.

  • Tool count must be actively managed. Cited evidence: agents' tool-calling accuracy degrades as the tool inventory grows (arXiv 2411.15399), and each tool description consumes context window. Three tactics used together (pattern: patterns/tool-surface-minimization):

  • Flexible tools — one schema covering many use cases, not one tool per API endpoint.
  • Toolsets — a minimal default set + opt-in toolsets for specialized workflows; user must anticipate agent needs.
  • Layering — chained tools ("how do I do X?" → "do X") to hide specialized functionality behind a discovery tool (Block's "MCP tools like ogres with layers"); tradeoff is +1 tool call of latency per task.

  • Error messages are agent-recovery primitives. Generic "invalid query" gets agents stuck in retry loops; specific "unknown field 'stauts' – did you mean 'status'?" gives a clear next step. Pattern: patterns/actionable-error-messages. Two adjunct moves: (a) a search_datadog_docs RAG tool reachable from server instructions so agents look up syntax on demand instead of cramming it into tool descriptions; (b) tool responses can carry advisory guidance alongside data ("you searched for payment, did you mean payments?") — departure from REST API conventions where no party on the other end can reason over prose.

  • General-purpose MCP server vs specialized agent is a deliberate trade-off, not a winner-takes-all. Datadog also ships Bits AI SRE, a hosted agent with a purpose- built web UI for alert investigation. The hosted agent can assume the workflow (user is investigating an alert) and pre-load related data; the MCP server can't. The stated roadmap is to expose Bits AI SRE capabilities through MCP and to broaden what the specialized agent can investigate — "over time the line between 'specialized agent' and 'MCP server with good defaults' may get blurry."

  • Future may relax some constraints. Clients are adding tool-search (Claude Code) and on-demand skills (Claude skills, Kiro Powers) so servers needn't front- load every tool. Clients writing long tool results to disk (Cursor, Claude Code) could reduce the importance of format-level token efficiency. "How skills and MCP fit together is still an open question."

Distilled principles (author's summary)

  1. Don't just wrap your APIs. Design tools around agents' constraints.
  2. Be frugal with context windows, and give agents the tools to be frugal too. Query languages help.
  3. Guide agents with good error messages and discoverable documentation.

Cross-references

Caveats

  • Numbers in the post are qualitative ("~5× more records in the same token budget", "~40% cheaper in some evals", "~50% token savings CSV-vs-JSON"). No absolute scale (requests/sec, token- budget distribution, query-language coverage) is disclosed.
  • The SQL surface is described but the engine/semantics/indexing layer behind it is not. At Datadog scale it is implicitly built on top of (or alongside) systems/husky-class infrastructure; the post does not confirm which engine serves the agent-issued SQL.
  • The post is first-person by one MCP-server engineer, not a formal postmortem or whitepaper; treat claims as design testimony.

Source

Last updated · 200 distilled / 1,178 read