Skip to content

ATLASSIAN

Read original ↗

Long Horizon: How Atlassian Built a Reasoning Engine for Complex AI Tasks

Summary

Atlassian replaced Rovo Chat's hierarchical multi-agent orchestrator (the "Hybrid Orchestrator" — a coordinator dispatching to per-product sub-agents like JiraAgent, ConfluenceAgent, etc.) with Long Horizon, a single-LLM, single-context, iterative reasoning loop that can execute up to 150 tool-call iterations in one user turn. The architecture flattens every product's tools into a unified namespaced surface the model calls directly, uses progressive disclosure (two meta-tools per product namespace) to avoid paying schema cost for unused tools, manages context-window pressure via a dedicated compaction service, spawns child instances for wide (parallel) research tasks, and orders prompt layers from most-stable to most-volatile to maximise prefix-cache hit rates. Production results: +8.5% offline accuracy, +23% task completion on Confluence tasks, +0.83% chat success rate, 37% perceived latency reduction via streaming progress updates.

Key takeaways

  1. Single-loop replaces multi-agent hierarchy. The Hybrid Orchestrator's per-product sub-agents created lossy hand-offs: the orchestrator never saw raw tool outputs, only summaries from sub-agents. Long Horizon keeps everything — tool calls, raw responses, errors, recovery decisions — in one LLM context. This eliminates information loss and enables end-to-end error recovery as part of the same reasoning loop. (Section: "The Long Horizon architecture")

  2. Flattened tool surface. Every operation across first-party products (Jira, Confluence, Bitbucket, Jira Service Management, Compass) and third-party connectors (Google Calendar, Google Drive, Slack, GitHub, Microsoft Teams) is exposed as a typed, namespaced action (e.g. jira__search_issues, google_calendar__list_events) called directly by the orchestrator LLM. No intermediate agent paraphrases. (Section: "Flattened tool architecture")

  3. Progressive disclosure via meta-tools. Each product namespace is collapsed to two meta-tools: {product}__get_tool_schema (returns full input schema on demand; its description carries one-line summaries of all tools in that namespace) and {product}__invoke_tool (executes by name with arguments). Frequently-used tools (search, todo-list, file read/write, memory retrieval) stay flat at the top level. Schema cost is bounded: fetched once per tool per task. (Section: "Progressive disclosure")

  4. SKILL.md per product namespace. Each product ships a hand-authored guide encoding domain-specific business logic — which tool to reach for, how product concepts map to user intent, common multi-step recipes, gotchas. This replaces the implicit per-product expertise that previously lived inside sub-agent prompts. (Section: "Progressive disclosure")

  5. Context compaction service. A dedicated service runs before each model call. When the conversation approaches the token limit, older tool outputs are trimmed/summarised while recent results stay at full resolution. Pruned outputs are offloaded (not discarded) and can be read back on demand. Keeps long runs within the context window without losing earlier reasoning. (Section: "Context Compaction")

  6. Task decomposition via child instances. For wide (not deep) tasks, Long Horizon spawns child instances — each a full Long Horizon loop with its own clean context — to own independent research strands. Strands run concurrently; the slowest sets response time. The parent synthesises finished results without carrying every intermediate finding. This is structurally the same as patterns/context-segregated-sub-agents but the child is a full reasoning loop, not a narrow tool call. (Section: "Task decomposition via child instances")

  7. Prompt layer ordering for prefix-cache maximisation. Prompts are assembled from most-stable to most-volatile: (1) static system prompt, (2) stable session context (org, user, timezone, skill instructions), (3) conversation history (grows but earlier turns immutable), (4) turn-dependent context (current iteration's tool results). Longest possible byte-identical prefix stays stable across iterations. Explicit cache_control markers placed for Anthropic; OpenAI/Gemini use implicit prefix caching. The longer a task runs, the more caching pays off. (Section: "Prompt assembly and caching")

  8. Adaptive reasoning effort. The model calibrates reasoning depth per query: minimal overhead for simple lookups, deeper planning for multi-step research. Same architecture handles both extremes without a one-size-fits-all latency trade-off. (Section: "Adaptive reasoning")

  9. Observability as hierarchical trace trees. Structured LLM tracing captures every orchestrator decision, tool invocation, latency breakdown, and token cost as a hierarchical trace tree. Engineers drill from top-level orchestrator span down through each reasoning iteration. Debugging a 40-step research task works like debugging a distributed microservice call, except "services" are LLM reasoning steps and "RPCs" are tool calls. (Section: "Observability for Long-Running Agent Tasks")

  10. Production metrics. Offline accuracy: 77% vs 71% (Hybrid Orchestrator + model updates). Online chat success rate: +0.83%. Task completion on Confluence: +23% relative. Perceived latency: −37% via streaming progress. Iteration budget: 150 (up from low single-digit). Timeout: 20 min (up from 10 min). LLM calls per tool: 1 (down from 2). (Section: "Results")

Operational numbers

Metric Hybrid Orchestrator Long Horizon
Iteration budget Low single-digit 100+ (up to 150)
LLM calls per tool use 2 1
Timeout 10 min 20 min
Offline accuracy 71% 77% (+8.5%)
Task completion (Confluence) baseline +23% relative
Chat success rate (online) baseline +0.83%
Perceived latency baseline −37% (streaming)

Architecture diagram (text)

User query
┌──────────────────────────────────────────────────────────┐
│  LONG HORIZON ORCHESTRATOR (single LLM, single context)  │
│                                                          │
│  Prompt assembly (layered: static → session → history → turn) │
│       │                                                  │
│       ▼                                                  │
│  ┌─────────────────────┐                                 │
│  │  Reasoning Loop      │  (up to 150 iterations)        │
│  │  ┌─────────────────┐│                                 │
│  │  │ LLM call        ││                                 │
│  │  │   ↓             ││                                 │
│  │  │ Tool call(s)?   ││─── yes ──→ Execute (parallel    │
│  │  │   ↓             ││            if independent)      │
│  │  │ Final answer?   ││─── yes ──→ Stream to user       │
│  │  └─────────────────┘│                                 │
│  └─────────────────────┘                                 │
│       │                                                  │
│       ├── Context Compaction Service (before each call)   │
│       │                                                  │
│       └── Spawn child instances (for wide tasks)         │
│                                                          │
│  Tool surface:                                           │
│  ├── Top-level flat tools (search, memory, file I/O)     │
│  ├── {product}__get_tool_schema (meta-tool per namespace) │
│  └── {product}__invoke_tool (execution per namespace)    │
└──────────────────────────────────────────────────────────┘

Concepts and patterns extracted

Systems: - systems/atlassian-long-horizon — the Long Horizon reasoning engine itself - systems/rovo-chat — Atlassian's user-facing AI chat product (powered by Long Horizon)

Concepts: - concepts/context-window-as-token-budget — context window pressure drives compaction, decomposition, and progressive disclosure - concepts/context-compaction — trimming/summarising older tool outputs to stay within token limits - concepts/adaptive-reasoning-effort — model calibrating reasoning depth to task complexity - concepts/progressive-tool-disclosure — exposing tool schemas on demand rather than upfront - concepts/prompt-prefix-caching — ordering prompt layers for maximal byte-identical prefix reuse - concepts/agent-observability-trace-tree — hierarchical trace trees for LLM agent debugging

Patterns: - patterns/single-loop-agent-orchestration — one LLM, one context, one iterative loop replacing multi-agent hierarchy - patterns/flattened-tool-architecture — collapsing per-product sub-agents into typed namespaced tools called directly - patterns/progressive-tool-disclosure-meta-tools — two meta-tools per namespace (get_schema + invoke) for on-demand schema loading - patterns/prompt-layer-ordering-for-cache-hits — assembling prompts from stable-to-volatile to maximise prefix caching - patterns/context-compaction-service — dedicated service trimming/summarising older context before each LLM call - patterns/context-segregated-sub-agents — (existing) child instances for parallel wide-task decomposition

Caveats

  • No model cost/token-usage numbers disclosed; no $/query economics.
  • Context compaction heuristics unspecified (what gets trimmed, threshold, summarisation quality).
  • No failure-recovery details for when child instances fail or hit their own context limits.
  • No discussion of tool-selection accuracy degradation with the flattened surface (progressive disclosure is positioned as sufficient mitigation but no empirical evidence shown).
  • Adaptive reasoning mechanism unspecified — unclear whether it's model-native (extended thinking budgets) or orchestrator-controlled (iteration limits).
  • SKILL.md authoring and maintenance burden not discussed.
  • 150-iteration limit and 20-min timeout without discussion of long-tail cost or runaway-agent guardrails.

Source

Last updated · 542 distilled / 1,571 read