Skip to content

PATTERN Cited by 1 source

Agentic RL from Production Signal

Problem

An LLM-based agent deployed in production generates trajectories — sequences of prompts, reasoning, tool calls, code changes, evaluations — that contain rare and valuable signal: the reward data that no public dataset has. Ignoring this signal means the agent is frozen at whichever base model was deployed last quarter. But collecting it as training data for weight updates is non-trivial: the reward must come from measured production outcomes (speedup, correctness), the agent's own trajectories must be structured + logged cleanly, and the downstream RL training loop must stay bounded in cost.

Shape

Close the loop between production deployment and model training:

  1. Every agent session generates structured training data as a natural byproduct — prompts, LLM outputs, tool calls, evaluation results. Log it.
  2. Reward signal comes directly from measured production outcomes — kernel performance (for code agents), query latency (for plan agents), incident-resolution time (for on-call agents). Not human-labelled.
  3. Post-train smaller specialized models on the collected trajectories — via agentic RL. Because the base-model trajectories already encode successful strategies, a small model can learn to produce similar trajectories with far fewer reasoning tokens and search steps.
  4. Deploy the specialized smaller models as the new agent substrate — cheaper to run at scale, with most of the capability of the originating frontier model.
  5. Iterate. Better models → better kernels in fewer reasoning tokens + fewer search steps → higher-quality training data → still-better models.

Canonical instance — Meta KernelEvolve (2026-04-02)

Meta's KernelEvolve is the canonical wiki instance.

Meta's canonical statement:

"Every optimization session generates structured training data as a natural byproduct: agentic trajectories capturing the reasoning, code transformations, and evaluation feedback behind high-performing kernels. This domain-specific data is rare and valuable. It encodes optimization intuition that no public dataset contains.

We use this data to post-train smaller, specialized models through agentic reinforcement learning, where the reward signal comes directly from measured kernel performance. The result is a virtuous cycle where better models produce better kernels in fewer reasoning tokens and fewer search steps, which in turn generate higher-quality training data. Over successive iterations, this compounding flywheel enables us to self-host increasingly efficient models that are compact enough to run cost-effectively at scale while retaining the optimization capability of much larger frontier models."

Why this shape matters

Three structural properties distinguish this pattern from RLHF or instruction-fine-tuning:

  1. Reward is from production measurement, not human preference. A kernel either makes Andromeda 60% faster or it doesn't. No preference labels needed. Kernel-performance reward is objective, measurable, and production-grounded — the ideal RL reward function.
  2. Trajectory structure is rich. Not just prompt → completion pairs; the data captures the full reasoning + tool-use + evaluation chain the agent followed to land on the winning solution.
  3. Specialized smaller models can clone the high-performing subspace of the frontier model's behavior. Meta's flywheel reasoning: the frontier model explores; the specialized model learns what the frontier model did when it succeeded.

The end-state is self-hosted efficient models — Meta retains frontier-model optimization capability at compact-model serving cost.

Relationship to in-context RL

In-context RL is the zeroth-order learning loop — writes to the retrieval-augmented knowledge base, no weight updates. Agentic RL from production signal is the first-order loop — actual weight updates on specialized models trained from the same trajectory data. The two loops feed each other: in-context skills → better trajectories → better RL training data → better models → better trajectories.

Relationship to evaluation-harness-in-agent-loop

Evaluation harness in agent loop produces the reward signal this pattern consumes. Without structured evaluation (memory-bound vs compute-bound, not just wall-clock time), the RL training data is shallower and the specialized models learn less.

Consequences

Positive:

  • Compounding capability without continuous frontier-model upgrades. Each iteration of the flywheel improves the agent's effective performance without waiting on the next OpenAI / Anthropic / DeepSeek release.
  • Cost-effective at production scale. Specialized smaller models are cheap to run for the bulk of straightforward optimization work; the frontier model can be reserved for hard cases.
  • Data moat specific to the deployer. Meta's trajectories encode MTIA-specific optimization intuition no public dataset has.

Negative / care required:

  • Reward signal quality is critical. If the evaluation harness mis-measures, the agent learns the wrong lessons at scale.
  • Catastrophic-forgetting risk on specialized models. Narrow training on one domain can regress general capability. Meta likely mixes broad + specialized training data; the post doesn't specify.
  • Trajectory privacy / IP. Internal trajectories contain proprietary kernel source + production metrics. These models cannot be shared outside the deployer's organization.

Seen in

Last updated · 550 distilled / 1,221 read