SLACK 2026-04-13

Slack — Managing context in long-run agentic applications¶

Summary¶

Second post in Slack's Security Engineering series on the Spear multi-agent security-investigation service (first post canonicalised at sources/2025-12-01-slack-streamlining-security-investigations-with-agents). Where the first post established the three-persona team (Director / Expert / Critic) + phased progression + knowledge pyramid, this second post canonicalises how context is maintained across a long-running multi-agent investigation that may span "hundreds of inference requests and generate megabytes of output."

The core architectural claim: three complementary context channels replace raw message history between agent invocations. The Director's Journal (structured working memory — decisions, observations, findings, questions, actions, hypotheses), the Critic's Review (annotated findings report with credibility scores against a 5-level rubric), and the Critic's Timeline (consolidated chronological narrative built from credible findings) — each serves a different purpose, each is consumed by different agents, and together they provide online context summarisation that negates the need to carry raw message history forward. Verbatim canonical claim: "Besides these resources, we do not pass any message history forward between agent invocations."

Three load-bearing mechanisms canonicalised:

Director's Journal — a journaling tool with six typed entries (decision / observation / finding / question / action / hypothesis), plus priority + follow-ups + citation refs, auto-annotated with phase/round/timestamp. The Journal is the Director's working memory and every other agent receives its current content in their prompt. It is how the Director "lead the investigation towards a conclusion, to observe and measure its progress, to identify dead-ends, and to make course corrections."
Critic's Review tools — the Critic gets four tools (get_tool_call, get_tool_result, get_toolset_info, list_toolsets) that let it inspect Expert methodology: tool arguments, actual returned data, tool documentation, and which Expert had which toolset. This is how the Critic audits methodology, not just claims. Findings are then scored on a 5-level credibility rubric (0.9-1.0 Trustworthy → 0.0-0.29 Misguided). Disclosed distribution over 170,000 reviewed findings: 37.7% / 25.4% / 11.1% / 10.4% / 15.4% — "slightly over a quarter of findings don't meet the plausibility threshold."
Critic's Timeline — a separate Critic task that prunes hallucinations via narrative coherence. Inputs: most recent Review + previous Timeline + Director's Journal. Consolidation rules explicit: credible-citation-only, dedup, strongest-evidence-wins-on-conflict, chronological ordering. Gap identification capped at top 3 (evidential / temporal / logical). Canonical framing: "A hallucination can only survive this process if it is more coherent with the body of evidence than any real observation it competes with."

Additional mitigations against the Critic itself hallucinating: stronger model tier (narrow scope keeps token budget manageable), instruction narrowing ("judgement on the submitted findings" only), and the Timeline task as downstream-gate forcing narrative coherence.

Key takeaways¶

Three context channels replace message history. The Director's Journal, the Critic's Review, and the Critic's Timeline are the only state that flows between agent invocations in Spear. Verbatim: "Besides these resources, we do not pass any message history forward between agent invocations. Collectively, these channels provide a means of online context summarisation, negating the need for extensive message histories." Not just a token-budget optimisation — the post argues explicitly that even with unlimited context windows, you wouldn't want to carry raw history forward: "Even if context windows were infinitely large, passing message history between rounds would not necessarily be desirable: the accumulated context could impede the agents' capacity to respond appropriately to new information." (Source: sources/2026-04-13-slack-managing-context-in-long-run-agentic-applications)
The Journal is structured working memory, not free-form scratch. Six entry types, each with named purpose: decision (strategic choices), observation (patterns noticed), finding (confirmed facts), question (open items), action (steps taken/planned), hypothesis (working theories). Each annotated with phase, round, timestamp, optional priority, optional follow-ups, optional citation refs. The journaling tool itself "does nothing more than accumulate entries" — structure comes from typing + auto-annotation, not from database queries or scoring. Every other agent (Experts, Critic) receives the current Journal content in their prompt, rendered as chronology. This is the canonical "shared narrative that keeps other agents on track." (Source: sources/2026-04-13-slack-managing-context-in-long-run-agentic-applications)
Critics need tool-call introspection, not just claim inspection. Slack gives the Critic four tools that go beyond reading what the Expert said: get_tool_call (arguments + metadata of any tool call), get_tool_result (actual data returned), get_toolset_info (tool's inline documentation), list_toolsets (every toolset organised by Expert). This closes the methodology-audit loop: when an Expert cites tooluse_abc123, the Critic can check if the tool was correctly used (via documentation), whether it was well-chosen (via toolset inventory), whether the arguments were sensible, and whether the Expert's interpretation of the result is defensible. Canonical instance of the patterns/critic-tool-call-introspection-suite pattern — a Critic without these tools is a claims auditor, not a methodology auditor. (Source: sources/2026-04-13-slack-managing-context-in-long-run-agentic-applications)
Credibility is scored on a 5-level rubric with disclosed distribution. Slack's rubric maps numeric ranges to labels with explicit criteria. From highest to lowest: 0.9-1.0 Trustworthy (multiple corroborating sources, no contradictions), 0.7-0.89 Highly-plausible (single-source corroboration), 0.5-0.69 Plausible (mixed evidence), 0.3-0.49 Speculative (poor evidence), 0.0-0.29 Misguided (no evidence / misinterpreted). Disclosed distribution over 170,000 reviewed findings: 37.7% / 25.4% / 11.1% / 10.4% / 15.4%. The 25.7% sub-plausibility rate is the canonical evidence that without the Critic, a quarter of findings would reach the Director as equally authoritative as the rest. (Source: sources/2026-04-13-slack-managing-context-in-long-run-agentic-applications)
Narrative coherence as a hallucination filter. The Timeline task is the downstream check: if a finding survived the Review with a decent credibility score but contradicts the broader chronology, it gets pruned. Verbatim canonical framing: "A hallucination can only survive this process if it is more coherent with the body of evidence than any real observation it competes with." This is the load-bearing claim for concepts/narrative-coherence-as-hallucination-filter — a pairing of point-wise credibility scoring (Review) with narrative consistency scoring (Timeline) to filter hallucinations twice: once at the claim level, once at the story level. (Source: sources/2026-04-13-slack-managing-context-in-long-run-agentic-applications)
Timeline scoring rubric is narrative-building, not point-credibility. Distinct 5-level scale: Trustworthy (strong cross-source corroboration, consistent timestamps, no significant gaps), Highly-plausible (good evidence, minor gaps, mostly consistent), Plausible (uncertainty in ordering, notable gaps), Speculative (poor evidence, significant gaps, conflicted), Invalid (no evidence, confounding inconsistencies). Same numeric bands as the findings rubric but rebased to a coherence metric. Two rubrics, two stages, two filters. (Source: sources/2026-04-13-slack-managing-context-in-long-run-agentic-applications)
Consolidation rules are explicit and few. Timeline assembly follows four stated rules: (1) "include only events supported by credible citations — speculation doesn't belong on the Timeline" — the scoring output of the Review becomes a membership filter; (2) "remove duplicate entries describing the same event" — two Experts describing the same event should not double-count; (3) "when timestamps conflict, prefer sources with stronger evidence" — a log-entry timestamp beats an inferred time; (4) "maintain chronological ordering based on best available evidence". These four rules are small enough to fit in a prompt, short enough to audit, and concrete enough to explain Timeline output to a human reviewer. (Source: sources/2026-04-13-slack-managing-context-in-long-run-agentic-applications)
Gap identification is capped at top 3. "We limit gap identification to the top 3 most significant gaps. This focuses the Director's attention on what matters most rather than presenting an exhaustive list of unknowns." Three gap types: Evidential (missing data that would strengthen conclusions), Temporal (unexplained periods between events), Logical (events that don't fit the emerging narrative). Canonical instance of deliberate-scarcity-in- agent-output design: infinite gap-lists produce reader fatigue; top-3 forces triage in the Critic and makes the Director's next question concrete. (Source: sources/2026-04-13-slack-managing-context-in-long-run-agentic-applications)
Mitigations against Critic hallucination (meta-problem). Slack explicitly addresses the recursive concern — "whether the Critic's Review provides a false sense of assurance; it's also conducted by model inference" — with three stacked mitigations: (1) stronger model tier for the Critic (justified by narrow scope → manageable token budget → less hallucination-prone; cites [arxiv 2411.04368] for the stronger-models-err-less claim); (2) narrow instructions — "the agent is instructed to only make a judgement on the submitted findings" because LLMs "are more likely to hallucinate when posed larger, open-ended questions"; (3) the Timeline task as downstream coherence check. These three mitigations are the canonical answer to who critiques the critic? (Source: sources/2026-04-13-slack-managing-context-in-long-run-agentic-applications)
Each agent gets a tailored view of the same investigation state. Director sees condensed Timeline + its own Journal; Experts see the Journal + the Director's question; Critic sees Expert findings + tool-call introspection + rubrics + the running Timeline. "For each agent to optimally execute its role, it requires a tailored view of the investigation state. Each view must be carefully balanced. If agents are not anchored to the wider team, the investigation will be disconnected and incoherent. Conversely, sharing too much information stifles creativity and encourages confirmation bias." This is the confirmation-bias cost of over-sharing — a novel framing the wiki lacked: more context is not strictly better in multi-agent systems. (Source: sources/2026-04-13-slack-managing-context-in-long-run-agentic-applications)
The Journal is a planning artifact, not an event log. Distinct from the Hub/Worker/Dashboard event stream (which carries tool calls, model invocations, system events). The Journal is what the Director thinks, not what the system does. This separation — journal for reasoning state, event stream for execution state — is architecturally load-bearing: replaying an event stream doesn't reconstruct a planning history the way replaying a Journal does, and journaling an event stream doesn't produce coherent reasoning the way the Director's explicit entries do. (Source: sources/2026-04-13-slack-managing-context-in-long-run-agentic-applications)

Three context channels — canonical table¶

Channel	Produced by	Consumed by	Contents	Purpose
Director's Journal	Director (via journaling tool)	All agents (in prompt as chronology)	Six-typed entries (decision / observation / finding / question / action / hypothesis) + priority + follow-ups + citations + phase/round/timestamp	Director's working memory; shared narrative anchor
Critic's Review	Critic (one pass over Expert findings)	Director (for decision), Critic (as Timeline input)	Annotated findings with credibility scores against 5-level rubric (0.0-1.0); overall summary	Credibility filter; methodology audit
Critic's Timeline	Critic (one pass after Review)	Director (consolidated view)	Chronological event sequence from credible citations + top-3 gaps + narrative-coherence score	Narrative coherence filter; second hallucination pass

Director Journal — entry types (verbatim)¶

Type	Purpose	Example
decision	Strategic choices	"Focus investigation on authentication anomalies rather than network activity"
observation	Patterns noticed	"Multiple failed logins preceded the successful authentication"
finding	Confirmed facts	"User authenticated from IP 203.0.113.45, not in historical baseline"
question	Open items	"Was the VPN connection established before or after the suspicious activity?"
action	Steps taken/planned	"Requested Cloud Expert to examine EC2 instance activity"
hypothesis	Working theories	"This pattern suggests credential stuffing rather than account compromise"

Plus: priority (assigned by Director), follow-up actions, citation references to evidential artifacts. Tool auto-annotates each entry with phase + round + timestamp.

Critic Review — credibility rubric (verbatim + distribution)¶

Score	Label	Criteria	% of 170,000 findings
0.9-1.0	Trustworthy	Supported by multiple sources with no contradictory indicators	37.7%
0.7-0.89	Highly-plausible	Corroborated by a single source	25.4%
0.5-0.69	Plausible	Mixed evidence support	11.1%
0.3-0.49	Speculative	Poor evidence support	10.4%
0.0-0.29	Misguided	No evidence provided or misinterpreted	15.4%

Sub-plausibility rate: 25.8% (speculative + misguided). Without the Critic, these would reach the Director as equally-weighted input alongside the 63.1% above-plausibility block.

Critic Review — tool suite¶

Tool	Purpose
`get_tool_call`	Inspect the arguments and metadata of any tool call
`get_tool_result`	Examine the actual output returned by a tool use
`get_toolset_info`	List what tools were available to a specific Expert
`list_toolsets`	List all available toolsets organised by Expert

The four tools let the Critic answer four distinct methodology questions:

Did the Expert use the tool correctly? (via get_tool_call + get_toolset_info documentation)
What data did the Expert actually see? (via get_tool_result)
Was this Expert properly equipped for the question? (via list_toolsets — the Director may have posed a question to the wrong domain)
Did the Expert pick the right tool? (via get_toolset_info inventory — sometimes an Expert selected a poor tool for the job)

Critic Timeline — rubric (verbatim)¶

Score	Label	Meaning
0.9-1.0	Trustworthy	Strong corroboration across multiple sources, consistent timestamps, no significant gaps
0.7-0.89	Highly-plausible	Good evidence support, minor gaps present, mostly consistent Timeline
0.5-0.69	Plausible	Some uncertainty in event ordering, notable gaps exist
0.3-0.49	Speculative	Poor evidence support, significant gaps, conflicted narrative
0.0-0.29	Invalid	No evidence, confounding inconsistencies present

Same numeric bands as the Review rubric, rebased from credibility-per-finding to coherence-of-the-narrative.

Critic Timeline — consolidation rules (verbatim)¶

Include only events supported by credible citations — speculation doesn't belong on the Timeline.
Remove duplicate entries describing the same event — an event shouldn't appear twice because two Experts mentioned it.
When timestamps conflict, prefer sources with stronger evidence — a log entry timestamp beats an inferred time.
Maintain chronological ordering based on best available evidence — events must flow logically in time.

Specimen extracts (from a real false-positive investigation)¶

The post includes edited specimen extracts from a real investigation where a kernel-module-loading alert turned out to be a false positive caused by a developer installing a package in a dev environment. The detection rule matched kmod in the script pathname rather than actual modprobe execution. Timeline confidence: 0.83 (Highly-plausible). Four Experts agreed (Cloud / Endpoint / Identity / Config Management). 6,046 session events retrieved by the Cloud Expert's query. Event sequence reconstructed from 09:29:01Z → 09:31:26Z (ALERT) → 09:31:29Z (modprobe queries complete). Three evidential gaps identified (session init timestamp unknown, triggering command not documented, secondary analyst searched wrong path).

Canonical operational numbers (disclosed)¶

170,000 findings reviewed to date (Critic Review rubric distribution disclosed across this corpus).
Credibility distribution: 37.7% Trustworthy / 25.4% Highly-plausible / 11.1% Plausible / 10.4% Speculative / 15.4% Misguided.
Sub-plausibility rate: 25.8% (speculative + misguided combined — Slack frames as "slightly over a quarter").
Specimen investigation: 6,046 session events, 4 Experts, 0.83 Timeline confidence, 3 identified gaps.
Timeline confidence scale: 0.0-1.0 (narrative-coherence axis, distinct from per-finding credibility).
Gap identification cap: top 3 (not top 5, not all, not exhaustive).

Caveats¶

First-party numbers limited. 170,000 reviewed findings + distribution table + one specimen investigation (6,046 events, 0.83 confidence, 3 gaps). No disclosed: total investigation count, throughput (investigations/day), latency per round, token cost per investigation, false- positive-rate, time-to-triage, human override rate, Critic agreement rate with human auditors.
Rubric dimensions unclear. The five levels are named and the distribution disclosed, but how the Critic decides between e.g. 0.85 and 0.90 is not specified — no multi-dimensional breakdown (evidence strength × source reliability × interpretation defensibility), no calibration methodology, no inter-rater reliability with humans.
Journal schema partial. Six entry types disclosed with example text, but full JSON schema, enum constraints for priority, follow-up action shape, citation ref format are not shown.
Mutation semantics of Journal. Whether entries are append-only, correctable, or revisable is not disclosed. "The tool itself does nothing more than accumulate entries" suggests append-only but does not rule out edit-in-place.
Cross-investigation learning. Whether the rubric, the Journal template, or the Timeline rules evolve based on post-investigation review is not disclosed. No mention of rubric version-drift, calibration-over-time, or tuning-from-false-positives.
Second Critic / disagreement resolution. The worked example mentions "secondary analyst failed to locate parent process using incorrect field name" — suggesting multiple analyst passes. Whether this is a second Critic, a human reviewer, or another Expert round is not clarified.
Meta-gap: human-in-the-loop boundary. The post promises a future article on "human in the loop: human / agent collaboration" — this ingest does not disclose how the Dashboard's human supervisor interacts with the Journal, Review, or Timeline, or whether they can edit any of them mid-investigation.
No Claude-specific / Bedrock-specific disclosure. Model family, provider, and per-tier model identity remain undisclosed (consistent with the first post's caveats).
Token-budget numbers undisclosed. Claim is that the Critic's narrow scope keeps tokens manageable, but no absolute token counts per Review / Timeline task are given.
This is the second post in a continuing series. The post explicitly flags that a future article will discuss "artifacts as a communication channel between investigation participants, examining the artifact system that connects findings to evidence and enables the verification workflows described in this article" — so the full picture of how evidence is stored, referenced, and verified is still forthcoming.

Relationship to prior Spear coverage¶

Supersedes at context-mechanism altitude — the first post (2025-12-01) canonicalised the three-persona team + knowledge pyramid + phased progression + hub/worker/ dashboard service shape; this second post provides the context-plumbing underneath. The first post's claim that "the Critic annotates findings with credibility scores" is now specified with a five-level rubric + distribution; the claim that "the Director gets a condensed timeline" is now specified with consolidation rules + narrative-coherence rubric + gap-identification cap.
Reinforces concepts/weakly-adversarial-critic — the credibility-score distribution (25.8% sub-plausibility) is empirical support for the "weakly adversarial" stance's value. An over-cooperative critic wouldn't produce a sub-plausibility fraction this large.
Reinforces patterns/one-model-invocation-per-task — the Journal write, Review scoring, and Timeline assembly are three separate invocations, each with its own schema, each running on different model tiers.
Operationalises concepts/investigation-phase-progression — each Journal entry is tagged with phase + round, making phase-progression visible in the reasoning state, not just in the event stream.

Acknowledgements (per the post)¶

Chris Smith, Abhi Rathod, Dave Russell, Nate Reeves (same team cited in the 2025-12-01 post).

Source¶

Original: https://slack.engineering/managing-context-in-long-run-agentic-applications/
Raw markdown: raw/slack/2026-04-13-managing-context-in-long-run-agentic-applications-faf455bb.md
Series: second post; first post at sources/2025-12-01-slack-streamlining-security-investigations-with-agents