PATTERN Cited by 1 source
Critic tool-call introspection suite¶
Intent¶
Give a critic agent a dedicated set of tools for inspecting peer agents' tool-use methodology — not just their claimed findings. The tools expose tool-call arguments, actual tool-call results, per-tool documentation, and the toolset inventory each peer agent had access to. This turns the critic from a claims auditor into a methodology auditor.
Without this tool suite, a critic reviewing an Expert's findings can only evaluate "does the claim look reasonable?" — a surface-level judgement that any confident hallucination passes. With the tool suite, the critic can evaluate "did the Expert use the right tool correctly and interpret the result defensibly?" — the actual methodology.
Canonicalised by Slack's Security Engineering team for Spear's Critic (Source: sources/2026-04-13-slack-managing-context-in-long-run-agentic-applications).
The canonical four tools (Slack's Critic)¶
| Tool | Purpose |
|---|---|
get_tool_call |
Inspect the arguments and metadata of any tool call |
get_tool_result |
Examine the actual output returned by a tool use |
get_toolset_info |
List what tools were available to a specific Expert |
list_toolsets |
List all available toolsets organised by Expert |
Why four specific tools¶
The four tools are not arbitrary — they are the minimal surface for answering four distinct methodology questions:
Q1: Did the Expert use the tool correctly?¶
Answered by get_tool_call + get_toolset_info (which exposes
the tool's documentation).
When an Expert cites "tooluse_abc123 supports my finding,"
the Critic can:
- Fetch the actual tool arguments used (get_tool_call)
- Read the tool's documentation (get_toolset_info)
- Decide whether the arguments were consistent with the tool's
intended use
Example failure mode caught: an Expert ran a tool with arguments outside its documented range, got a misleading result, and confidently interpreted the misleading result.
Q2: What data did the Expert actually see?¶
Answered by get_tool_result.
LLMs regularly summarise tool results into claims without retaining fidelity. The Critic can fetch the raw result to check: - Does the Expert's claim actually match what the tool returned? - Did the Expert notice key details the summary omitted? (Slack's canonical specimen: an Expert summarised process ancestry as "secure" but missed a credential exposure in the raw output.)
Q3: Was this Expert properly equipped for the question?¶
Answered by list_toolsets.
Slack verbatim (Source: sources/2026-04-13-slack-managing-context-in-long-run-agentic-applications):
"It can also use get_tool_info to access each tool's inline documentation to determine if the tool was correctly used, and list_toolsets to understand if the Director made an error by posing a question to an Expert that was not properly equipped to answer"
This is the meta-audit: maybe the Expert's finding is shaky not because the Expert was wrong but because the Director sent the question to the wrong Expert. The Critic catches this, and the Director can use the Critic's annotation to redirect the question.
Q4: Did the Expert pick the right tool?¶
Answered by get_toolset_info (what tools were available to
the Expert) + get_tool_call (what the Expert chose).
Slack verbatim: "or if an Expert made a poor tool selection."
An Expert presented with 15 tools may have picked the 8th-best tool for a query. The Critic, seeing the full toolset, can flag this.
Design mechanics¶
1. Tools, not passed-in transcript¶
The naive implementation is to pass the Expert's entire tool- call transcript into the Critic's prompt. This breaks at scale:
- Transcripts can be megabytes (per Slack's "hundreds of inference requests and megabytes of output")
- The Critic rarely needs the full transcript — it needs specific tool calls when investigating specific findings
- Token cost scales linearly with transcript size
The tool-suite approach lets the Critic pull on-demand, keeping its own context small and letting it drill into specific findings as needed.
2. Four tools, not twenty¶
The temptation is to expose every inspection primitive the execution-framework already has. Slack cut this down to four.
Rationale (structural):
- Each tool added bloats the Critic's prompt and dilutes the Critic's tool-selection accuracy.
- The four tools cover the four distinct methodology questions; additional tools would overlap or introduce new failure modes.
- Four tools are easy to document inline in the Critic's system prompt.
3. Read-only by construction¶
All four tools are read-only. The Critic does not re-run tools with different arguments, doesn't modify the Expert's findings directly, and doesn't write to the Journal. This is an architectural invariant: the Critic's role is to audit, not to re-do the Expert's work.
4. Composable with credibility scoring¶
The four tools are the input to the credibility-scoring rubric's evidence criteria. "Supported by multiple sources with no contradictory indicators" (Trustworthy / 0.9-1.0) requires actually inspecting the sources — which requires these tools.
Without the introspection suite, the rubric's evidence criteria collapse back into surface-level claim evaluation.
Operational properties¶
- Keeps Critic prompt bounded. The Critic's prompt contains the Expert findings + the rubric + the four tools' documentation, not the raw Expert transcript. Token count is manageable even for sprawling investigations.
- Enables lazy audit. The Critic only fetches what it needs to score a given finding. A 0.95-scored finding may not require deep introspection; a borderline 0.55 one might.
- Integrates with tool-call persistence. The suite assumes some form of tool-call persistence (the Hub's persistent storage in Spear). Event-stream infrastructure is a prerequisite.
When to reach for this pattern¶
- Multi-agent systems with tool-using experts. If your Experts take actions via tool calls, a Critic without introspection can only audit surface claims.
- Hallucination-sensitive domains. Security, legal, medical, financial — any domain where an Expert hallucinating a tool result and its interpretation has real-world cost.
- Credibility scoring with evidence criteria. If your rubric says "supported by multiple sources," the Critic needs tools to actually check.
When not to reach for it¶
- Single-agent systems. No peer to introspect.
- Experts without tool calls. If the Expert is pure reasoning (no external tools), introspection collapses to reading the Expert's own output — which the Critic already has.
- Trust-by-default workflows. If the domain tolerates Expert findings going to the final report without audit, the introspection overhead is wasted.
Composes with¶
- patterns/director-expert-critic-investigation-loop — the loop shape this pattern plugs into.
- patterns/three-channel-context-architecture — the Critic's Review (output of the introspection-aided scoring) is channel 2 of 3.
- concepts/credibility-scoring-rubric — the rubric whose evidence criteria require these tools to evaluate faithfully.
- concepts/weakly-adversarial-critic — the critic stance; this pattern gives it the "adversarial" teeth.
- patterns/timeline-assembly-from-scored-findings — consumes the scored output this pattern enables.
Contrasts¶
- vs. claims-only auditor — classic LLM-as-judge looks only at the final output. Surface-level; misses methodology errors.
- vs. full-transcript ingestion — feed the full Expert transcript into the Critic's prompt. Doesn't scale past toy problems.
- vs. re-execution — the Critic re-runs tool calls with different arguments to validate. More expensive, more side-effectful, and changes the pattern from audit to re-execution.
- vs. external eval harness — run a separate evaluation pipeline offline. Useful for aggregate quality tracking, but can't gate individual findings in real-time.
Tradeoffs¶
- Requires tool-call persistence. The introspection suite's tools must be able to fetch historical tool calls and results — requires event-stream or transcript persistence infrastructure.
- Increases Critic's tool-selection cognitive load. Four additional tools the Critic must choose among; risk of mis-selection or over-use.
- Doesn't validate tool correctness. If a data-source tool returns wrong data (bug in the tool itself), the Critic reading the result will inherit the wrong data. This pattern audits Expert methodology, not tool correctness — a separate concern.
- Documentation drift risk.
get_toolset_inforeturns the tool's inline documentation; if the tool's actual behaviour drifts from its docs, the Critic's audit will be based on stale assumptions.
Seen in¶
- systems/slack-spear — canonical first wiki instance.
Four tools:
get_tool_call,get_tool_result,get_toolset_info,list_toolsets. Critic uses them to "examine evidence and data gathering methodology" and specifically to catch (a) incorrect tool use, (b) misinterpretation of tool output, (c) wrong Expert for the question (Director error), (d) poor tool selection by the Expert. (Source: sources/2026-04-13-slack-managing-context-in-long-run-agentic-applications)
Related¶
- systems/slack-spear
- patterns/director-expert-critic-investigation-loop
- patterns/three-channel-context-architecture
- patterns/timeline-assembly-from-scored-findings
- concepts/weakly-adversarial-critic
- concepts/credibility-scoring-rubric
- concepts/llm-hallucination
- concepts/structured-output-reliability
- concepts/llm-as-judge