A Guided Tour of Multi-Agent Architectures
Single-agent to hierarchical swarms — every major multi-agent pattern, its trace shape, where it shines, and where it breaks.
A Guided Tour of Multi-Agent Architectures
The word "agent" covers a lot of ground. A single-agent loop calling three tools is architecturally nothing like a supervisor orchestrating a dozen specialized workers. The failure modes are different, the trace shapes are different, and what it takes to debug them is different.
This article walks through every major pattern in use today. For each one: what it looks like, where it wins, and where it breaks — with particular attention to the trace shape that falls out of each design, because that shape is what you're debugging when things go wrong.
The Comparison Table
| Architecture | Strengths | Failure modes | Span shape |
|---|---|---|---|
| Single-agent loop | Simple, debuggable, predictable | Context saturation, tool overuse | Linear chain; easy to read |
| Supervisor / worker | Parallelism, specialization | Coordination overhead, context loss at handoff | Star topology from supervisor |
| Hierarchical (multi-tier) | Scale, complex decomposition | Error propagation across tiers, debug depth | Deep nested tree |
| Swarm | Resilience, emergent solutions | Non-determinism, consensus failures | Dense graph, parallel branches |
| Planner-executor | Separation of concerns | Plan-reality mismatch, stale plans | Two-phase: plan phase + exec phase |
| ReAct | Tight thought-action loop | Premature convergence, hallucinated tool results | Alternating reasoning + action spans |
| Reflection | Self-correction, quality improvement | Over-correction, infinite refinement loops | Pairs of generation + critique spans |
| Tree of Thought | Exploration, complex reasoning | Combinatorial explosion, branch coherence | Fan-out tree with pruning spans |
| Multi-step with tool use | Grounded, verifiable | Cascading tool failures, context fragmentation | Interleaved LLM + tool spans |
1. Single-Agent Loop
The simplest architecture: one agent, a system prompt, a set of tools, and a loop that runs until the task is done or the context is full.
┌─────────────────────────────────────────┐
│ Single Agent │
│ │
│ [Prompt] → [LLM] → [Tool Call] │
│ ↑ ↓ │
│ [Observe] ← [Result] │
└─────────────────────────────────────────┘
Trace shape: A single root span with child spans for each tool call. The LLM reasoning step might not be a separate span (it depends on the framework) — some tools collapse the reasoning into the parent span and only create children for tool invocations.
trace: single-agent-run
└── span: agent.loop (root)
├── span: tool.search_web
├── span: tool.read_file
├── span: tool.write_file
└── span: agent.final_response
Where it shines: Tasks with bounded scope and clear completion criteria. Code generation for a single function, document summarization, question answering with RAG. The simplicity means debugging is straightforward — the trace is linear and readable.
Where it breaks: Context saturation is the primary failure mode. Long-running tasks accumulate context until the effective reasoning window shrinks. The agent starts ignoring earlier instructions, dropping constraints, or hallucinating facts that were in the context 50,000 tokens ago. The trace shape is fine — the problem is semantic: the reasoning quality degrades silently as context grows. At the trace level, all spans succeed. Only attribution reveals that a mid-run LLM call began ignoring the task constraints.
2. Supervisor / Worker
A coordinator agent breaks the task into subtasks and delegates each to a specialized worker agent. Workers report back; the supervisor aggregates and decides what to do next.
┌─────────────────────────────────────────────────────┐
│ Supervisor │
│ (decompose + coordinate) │
└──────┬──────────────┬──────────────┬────────────────┘
↓ ↓ ↓
┌──────────┐ ┌──────────┐ ┌──────────┐
│ Worker A │ │ Worker B │ │ Worker C │
│ (search) │ │ (analyze)│ │ (write) │
└──────────┘ └──────────┘ └──────────┘
Trace shape: A star topology. The supervisor span is the root; each worker invocation is a direct child. Workers may themselves have children (tool calls, LLM steps). The depth is bounded.
trace: supervisor-run
└── span: supervisor.orchestrate (root)
├── span: worker.search_agent
│ ├── span: tool.web_search
│ └── span: llm.summarize_results
├── span: worker.analyze_agent
│ └── span: llm.analyze
└── span: worker.write_agent
└── span: llm.draft_report
Where it shines: Parallelism. Workers can run concurrently. Specialization: each worker is prompted for a narrow task, so its reasoning is tighter. The supervisor's context stays clean because it only sees summaries, not the full working state of each worker.
Where it breaks: The handoff. When the supervisor passes context to a worker, it must compress the task state into a message. If that compression loses critical information — a constraint, a prior decision, a failure from an earlier worker — the receiving worker operates on incomplete context. The failure shows up in the worker's span as a seemingly correct execution that produces a wrong output. Attribution: Context Handling Failures.
Another failure mode: the supervisor's aggregation step. If two workers return contradictory results, the supervisor must adjudicate. If it doesn't have a clear rubric (or if the prompt doesn't anticipate this case), it may silently pick one and proceed. The trace shows all workers succeeding; the failure is in the supervisor's synthesis span.
3. Hierarchical (Multi-Tier)
Supervisors supervising supervisors. The task is decomposed recursively — a top-level orchestrator breaks the problem into sub-problems, each sub-problem is handled by a mid-tier agent that further decomposes, and leaf agents handle atomic tasks.
┌───────────────────────────────────────────────────────┐
│ Top Orchestrator │
└────────────────────┬─────────────────────────────────┘
↓
┌────────────┴────────────┐
↓ ↓
┌──────────────┐ ┌──────────────┐
│ Sub-manager │ │ Sub-manager │
│ A │ │ B │
└──────┬───────┘ └──────┬───────┘
↓ ↓ ↓ ↓ ↓
[leaf][leaf][leaf] [leaf][leaf]
Trace shape: A deep nested tree. Root span → tier-1 spans → tier-2 spans → leaf spans. The trace can be hundreds of spans deep for complex workflows.
Where it shines: Complex, long-horizon tasks that genuinely decompose hierarchically. Software project planning, multi-document research synthesis, multi-step code refactoring across a large codebase. The hierarchical structure maps to the problem structure.
Where it breaks: Error propagation. An error at a leaf span doesn't just affect the leaf — it propagates upward through every tier. A mid-tier agent that receives a failure from a leaf might retry, handle gracefully, or silently move on. If it moves on, the top-level orchestrator receives a subtly degraded result with no indication that something went wrong two tiers down. By the time the failure becomes visible at the top level, the causal chain spans many hops. This is the hardest debugging problem in multi-agent systems: Trace depth × Error propagation rate = Attribution difficulty.
ProveAI Origin's class-6 compression was built specifically for this pattern — it walks the parent-child chain from the failure span to the root, keeping only the causally relevant subgraph. See Common Failure Modes for the specific failure signatures that appear in hierarchical traces.
4. Swarm
A collection of agents with no fixed supervisor. Agents communicate with each other, share state via a shared context or message bus, and collectively converge on a solution. Popular in research; rare in production.
┌──────┐ ←→ ┌──────┐
│ A │ │ B │
└──────┘ └──────┘
↕ ↕
┌──────┐ ←→ ┌──────┐
│ C │ │ D │
└──────┘ └──────┘
(shared state)
Trace shape: A dense graph. Every agent-to-agent communication is a span; the topology is a graph, not a tree. Parent-child relationships may be ambiguous — multiple agents can contribute to the same downstream span.
Where it shines: Tasks where the optimal decomposition isn't known in advance. Adversarial debate (one agent proposes, another critiques, a third arbitrates). Brainstorming. Exploratory research. The emergent coordination can find solutions that a fixed hierarchy would miss.
Where it breaks: Non-determinism. Run the same swarm twice and you get different traces — different agent activations, different communication patterns, different convergence paths. This makes debugging hard (you can't replay the exact failure) and regression testing harder (there's no canonical "correct" trace to compare against). Consensus failures are another class of failure unique to swarms: agents converge on an answer that no individual agent would have endorsed, driven by the dynamics of the group rather than the quality of the reasoning.
From an observability standpoint, swarms are the hardest architecture to work with. The trace graph resists compression; attribution is genuinely ambiguous when multiple agents contributed to a failure. If you're debugging swarm behavior, start by reducing it to a simpler architecture first.
5. Planner-Executor
A dedicated planning agent generates a plan — a structured list of steps. A separate executor agent (or set of agents) carries out the steps. The planning and execution phases are architecturally distinct.
┌─────────────────┐
│ Planner │ → generates [step1, step2, step3, ...]
└─────────────────┘
↓
┌─────────────────┐
│ Executor │ → carries out each step
└─────────────────┘
↓
[step1] → [step2] → [step3] → ...
Trace shape: Two phases separated by a handoff span. The planning phase produces a structured artifact (often JSON). The execution phase produces a sequence of spans, one per step. A plan-validation span may sit between the two phases.
Where it shines: Tasks with a well-defined structure where the planning and execution competencies are genuinely different. Code generation pipelines (plan the architecture, then implement each file). Research workflows (plan the questions, then answer each one). The separation means you can inspect and edit the plan before execution begins.
Where it breaks: Plan-reality mismatch. The planner generates a plan based on its understanding of the environment. By the time the executor runs step 3, the environment has changed (a file was modified, an API returned an unexpected response, step 2 failed partially). The plan is stale. The executor either fails hard or silently compensates, producing an output that diverges from the plan's intent. Attribution: Goal Deviation or Incorrect Problem Identification, depending on whether the planner's model of the world was wrong or whether external state changed mid-execution.
The plan artifact itself can be the failure point. If the planner generates a syntactically valid but semantically wrong plan — a step that's feasible in isolation but impossible given the dependencies — the executor will dutifully execute incorrect steps before hitting the wall. Tool Definition Issues is a common L4 label in planner-executor traces.
6. ReAct (Reason + Act)
ReAct is a prompting pattern more than an architecture: the agent alternates between explicit reasoning steps ("Thought: I need to find the current price of X") and action steps ("Action: search_web(query='current price of X')"). The alternation is the loop.
[Thought] → [Action] → [Observation]
↑ ↓
└────────────────────────┘
(loop)
Trace shape: Alternating pairs of reasoning spans and action spans, wrapped in a loop span. The thought content is usually captured as a span attribute or event.
trace: react-run
└── span: react.loop (root)
├── span: react.step[1]
│ ├── event: thought="I should search for..."
│ └── span: tool.search_web
├── span: react.step[2]
│ ├── event: thought="The result shows... I should now..."
│ └── span: tool.read_document
└── span: react.finalize
└── event: thought="I have enough information to answer..."
Where it shines: Tasks that benefit from explicit reasoning transparency. The thought-action-observation loop makes the agent's reasoning process auditable. If you're debugging a ReAct agent, you can read the thought spans and understand exactly what the agent believed at each step. It's the most human-readable trace pattern.
Where it breaks: Premature convergence — the agent decides it has enough information before it actually does. The thought span says "I have enough to answer" when the observations were actually insufficient, incomplete, or misinterpreted. Attribution: Poor Information Retrieval or Incorrect Problem Identification. The trace will show a perfectly valid loop that terminates too early; only the content of the final thought span reveals the error.
Hallucinated tool results are another failure mode: the agent fabricates an observation in its reasoning rather than using the actual tool output. This is subtle — the tool call span succeeds, but the subsequent thought span references a "result" that differs from what the tool actually returned. Attribution: Tool Output Misinterpretation.
7. Reflection
The agent generates an output, then critiques its own output, then revises based on the critique. Sometimes the critic is a separate model or agent; sometimes it's the same model with a different prompt.
┌──────────────┐ ┌──────────────┐ ┌──────────────┐
│ Generate │ → │ Critique │ → │ Revise │
└──────────────┘ └──────────────┘ └──────────────┘
↓
(repeat N times or until quality threshold)
Trace shape: Pairs of generation + critique spans, possibly inside a revision loop span.
Where it shines: Quality-sensitive tasks where the first-pass output is likely to have errors. Code with edge cases, arguments that need to anticipate counterarguments, plans that need to check feasibility. Reflection reliably improves output quality on well-defined tasks.
Where it breaks: Two patterns. First, over-correction: the critique identifies a real flaw, the revision overcorrects, introducing a new flaw, the next critique catches that, and the loop oscillates. The trace shows successful critique and revision spans that collectively move away from a good answer. Attribution: Goal Deviation. Second, infinite refinement: the loop has no exit condition strong enough to terminate. The trace grows unboundedly. Attribution: Resource Abuse. Budget your reflection loops explicitly.
8. Tree of Thought
The agent generates multiple candidate next steps, evaluates each, and selects the most promising to continue. This turns the linear reasoning chain into a tree, enabling backtracking and exploration.
[root]
/ | \
[A] [B] [C]
/ | | |
[A1][A2][B1] [C1]
|
[B1a] ← selected
Trace shape: A fan-out tree with evaluation spans at each branch point and pruning spans where branches are terminated.
Where it shines: Planning problems, puzzle-solving, tasks where the right approach is not obvious and exploration helps. Tree of Thought significantly outperforms linear reasoning on problems with a well-defined correctness criterion.
Where it breaks: Combinatorial explosion. The number of nodes in the tree grows exponentially with depth if not carefully controlled. Evaluation spans become the bottleneck — if the evaluation heuristic is imprecise, the wrong branches are pruned. Attribution: Incorrect Problem Identification (wrong evaluation function) or Task Orchestration (poorly designed branching policy).
The trace for a failed ToT run is genuinely hard to read — hundreds of spans, many of which were dead ends. The failure is often not in a specific span but in the evaluation span that incorrectly scored a promising branch as worse than a dead-end branch.
9. Multi-Step with Tool Use
The most common production pattern: an agent executes a sequence of steps, each potentially involving tool calls, with LLM reasoning steps between tool invocations. This is not one of the named patterns above — it's the default behavior of most production agents.
┌──────────────────────────────────────────────────────────┐
│ Agent Execution │
│ │
│ [LLM step] → [Tool call] → [LLM step] → [Tool call] │
│ ↑ ↓ ↑ ↓ │
│ [Context] ← [Tool result] → [Context] ← [Tool result] │
└──────────────────────────────────────────────────────────┘
Trace shape: Interleaved LLM spans and tool spans. The context threading between steps is critical — if a tool result is dropped or truncated before being handed to the next LLM call, the entire subsequent chain operates on stale context.
Where it shines: Grounded tasks where every reasoning step can be verified against external state. Code generation that reads files, runs tests, reads output, and iterates. Research that fetches pages, extracts facts, and synthesizes. The tool results provide ground truth that constrains hallucination.
Where it breaks: Context fragmentation. As the step count grows, the context window fills with tool results. The LLM sees a long interleaved context and may lose track of earlier results, constraints from the original prompt, or the overall task goal. Attribution: Context Handling Failures or Incorrect Memory Usage. The failure is not in any individual span — it's in the accumulated state across many spans.
What to Take Away
No architecture is universally better. The right choice depends on:
- Task complexity — single-agent loops handle bounded tasks; hierarchical architectures handle complex decompositions
- Parallelism needs — supervisor/worker enables concurrent execution; single-agent is serial
- Debugging requirements — ReAct traces are the most readable; swarm traces are the hardest
- Failure tolerance — reflection adds robustness; it also adds latency and cost
The trace shape an architecture produces is not incidental — it's the primary diagnostic surface. Design your architecture with observability in mind. When a failure happens (and it will), you want to be able to read the trace and find the causal span in seconds, not hours.
Try It
Upload a trace from your agent system at /app/new and see how ProveAI Origin parses its span structure. The /app/insights/heatmap shows which span names appear most frequently in cited failure causes — a direct window into which part of your architecture is breaking most often.
Continue reading: Common Failure Modes maps the 14 TRAIL categories to the architectures where they appear most often. Best-Practice Architectures covers which frameworks implement these patterns and how to pick between them.