Skip to content

SYSTEM Cited by 1 source

Meta Capacity Efficiency Platform

Definition

The Meta Capacity Efficiency Platform is Meta's unified AI-agent platform for hyperscale performance engineering — one substrate that serves both offense (proactively finding and shipping code-change optimizations) and defense (catching and resolving performance regressions). It is the production infrastructure underneath Meta's Capacity Efficiency program, which has recovered "hundreds of megawatts of power" (Source: sources/2026-04-16-meta-capacity-efficiency-at-meta-how-unified-ai-agents-optimize-performance-at-hyperscale).

Two-layer architecture

  1. MCP Tools layer — standardized Model Context Protocol interfaces that let LLMs invoke code. "Each tool does one thing: query profiling data, fetch experiment results, retrieve configuration history, search code, or extract documentation." Five named categories in the post:

    • profiling-data query
    • experiment-results fetch
    • configuration-history retrieval
    • code search
    • documentation extraction
  2. Skills layer — modules that encode domain expertise about performance efficiency. A skill tells an LLM which tools to use and how to interpret results. "It captures reasoning patterns that experienced engineers developed over years, such as 'consult the top GraphQL endpoints for endpoint latency regressions' or 'look for recent schema changes if the affected function handles serialization'."

Together, tools + skills "promote a generalized language model into something that can apply the domain expertise typically held by senior engineers." Canonical wiki instance of patterns/mcp-tools-plus-skills-unified-platform.

The insight that made it one platform

"The breakthrough was realizing that both problems share the same structure... We didn't need two separate AI systems. We needed one platform that could serve both."

  • Same tools across offense and defense: profiling data, code search, documentation, configuration history.
  • Different skills per use case: regression-mitigation skills for defense (e.g. "regressions from logging can be mitigated by increasing sampling"); optimization-pattern skills for offense (e.g. "memoizing a given function to reduce CPU usage").

Agent compositions built on the platform

Defense: AI Regression Solver

Component of FBDetect (Meta's regression-detection tool). Three-phase pipeline:

  1. Gather context with tools — find regressed functions, look up the root-cause PR, pull exact files/lines changed.
  2. Apply domain expertise with skills — select the right mitigation skill for the codebase / language / regression type.
  3. Create resolution — produce a new PR, send to original root-cause author for review.

Canonical patterns/ai-generated-fix-forward-pr instance.

Offense: Opportunity Resolver

Mirrors the defensive pipeline:

  1. Gather context with tools — opportunity metadata + pattern documentation + prior-resolution examples + specific files/functions
  2. validation criteria.
  3. Apply domain expertise with skills — expert-encoded knowledge per opportunity type (e.g. memoization).
  4. Create resolution — candidate fix with guardrails (syntax / style / right-issue verification) → surfaced in engineer's editor, apply with one click.

Canonical patterns/opportunity-to-pr-ai-pipeline instance.

Additional skills composed over the same tool layer

"Within a year, the same foundation powered additional applications: conversational assistants for efficiency questions, capacity-planning agents, personalized opportunity recommendations, guided investigation workflows, and AI-assisted validation. Each new capability requires few to no new data integrations since they can just compose existing tools with new skills."

Why this is the right abstraction

The platform is the canonical wiki example of tool-skill decomposition as an operational-AI leverage mechanism:

  • Tools amortize data-integration cost. Adding profiling / experiment / config-history access is expensive; done once, it serves everyone.
  • Skills amortize domain-expertise cost. A skill encodes a senior engineer's playbook once and applies it uniformly everywhere the platform runs.
  • New use cases compose freely. Capacity-planning ≈ existing tools + new skill. Conversational assistant ≈ existing tools + new skill. No new pipeline, no new data backfill.

Sibling framing to Meta's AI Pre-Compute Engine (2026-04-06): both bet on markdown-level encoded knowledge as the model-agnostic substrate. Pre-Compute Engine's version is offline compass-shape context files; Capacity Efficiency Platform's version is online invocable skills over a shared tool layer.

Operational outcomes

  • Hundreds of megawatts recovered program-wide; "enough to power hundreds of thousands of American homes for a year."
  • ~10 hours → ~30 minutes compression on manual-investigation time ("automating diagnoses can compress ~10 hours of manual investigation into ~30 minutes"~20× compression).
  • Thousands of regressions weekly caught by FBDetect; faster automated resolution prevents compounding fleet waste.
  • "AI-assisted opportunity resolution is expanding to more product areas every half, handling a growing volume of wins that engineers would never get to manually."

Position in Meta's operational-AI lineage

Meta now has three complementary operational-AI systems on the wiki:

System Date Problem domain Substrate
Meta RCA System 2024-08-23 Web-monorepo incident triage Fine-tuned Llama-2 ranker + heuristic retriever
AI Pre-Compute Engine 2026-04-06 Config-as-code data pipeline navigation Offline multi-agent swarm → 59 compass-shape context files
Meta Capacity Efficiency Platform 2026-04-16 Performance offense + defense MCP tools + skills, per-use-case agents

The 2026-04-16 platform is the MCP-standardised + runtime-composed variant of the operational-AI primitive: tools + skills rather than offline files + rankers + retrievers.

Caveats

  • No total skill-catalogue size disclosed. Two example skills named (logging-sampling on defense; memoization on offense).
  • Model / vendor identity opaque. "In-house coding agent" is named but not specified (no LLM identity, parameter scale, or inference cost).
  • Guardrail mechanism thin. Offense's "verify syntax and style, confirm it addresses the right issue" is named but not decomposed into the verification layer (unit-test execution? static analysis? ML judge?).
  • Platform size not disclosed. No tool count, skill count, agent count, invocations-per-day, or platform compute footprint.
  • Attribution across offense vs defense unspecified in the megawatt figure.
  • Integration surface for opportunities not specified (IDE plugin extension name, commit attribution flow).

Seen in

Last updated · 319 distilled / 1,201 read