Common Failure Modes in Multi-Agent Systems
The complete TRAIL taxonomy of 14 failure categories — with span-level signatures, real-world examples, and which architectures are most prone to each.
Common Failure Modes in Multi-Agent Systems
Multi-agent failures fall into two fundamentally different categories. Execution failures are mechanical: the tool returned an error, the API timed out, the output schema was wrong. Semantic failures are inferential: the reasoning was flawed, the information was hallucinated, the instructions were misread. Both kinds of failure can produce the same observable outcome — a wrong answer — but they require entirely different fixes.
ProveAI Origin classifies every failure on two axes:
- L1 axis —
execution(the plan was correct, the execution went wrong) orsemantic(the reasoning itself was wrong) - L4 category — one of 14 specific failure types from the TRAIL taxonomy
This article walks through all 14 L4 categories: what they look like in a span, which architectures they hit hardest, and how to recognize them in your traces.
The Failure Decision Tree
Use this to triage a failing trace before diving into span-level detail:
Did the agent produce any output?
│
├─ No → Did a tool call fail?
│ ├─ Yes → execution failure
│ │ ├─ Schema/format issue? → Formatting Errors
│ │ ├─ Tool crashed / timeout? → Tool-related
│ │ ├─ Wrong tool invoked? → Tool Selection Errors
│ │ └─ Tool definition was wrong? → Tool Definition Issues
│ └─ No → Did the agent get stuck?
│ └─ Yes → Task Orchestration
│
└─ Yes → Is the output wrong?
├─ Wrong facts or fabricated content? → semantic
│ ├─ Hallucinated data → Language-only
│ ├─ Wrong interpretation of question → Incorrect Problem ID
│ ├─ Ignored constraints in prompt → Instruction Non-compliance
│ └─ Answered a different question → Goal Deviation
└─ Correct facts, wrong behavior?
├─ Misread tool result → Tool Output Misinterpretation
├─ Lost prior context → Context Handling Failures
├─ Forgot retrieved info → Poor Information Retrieval
├─ Memory corrupted/lost → Incorrect Memory Usage
├─ Unsafe output → Safety Violations
├─ Excessive resource use → Resource Abuse
└─ Repeated same action → Repetition
Execution Failures
These are the mechanical failures — the trace shows a span that raised an error, returned an unexpected status, or produced structurally invalid output. They're easier to detect than semantic failures, but not always easier to fix.
1. Formatting Errors
What it is: The agent produces output that is syntactically incorrect for the expected format — malformed JSON, invalid XML, broken markdown, wrong field names, missing required keys.
Span signature:
{
"operationName": "llm.completion",
"status": { "code": "OK" },
"attributes": {
"gen_ai.response.finish_reason": "stop"
},
"events": [{
"name": "gen_ai.completion",
"attributes": {
"gen_ai.event.content": "{ \"action\": \"search\", \"query\": unterminated"
}
}]
}
The span succeeds from the LLM's perspective. The failure is in the content — the JSON is unterminated. A downstream parser will error; the agent may retry or fail.
Real-world example: An agent is asked to return a structured action plan as JSON. The LLM produces a response that mixes prose and JSON, breaking the parser that feeds the result into the executor.
Most prone architectures: Planner-executor (the plan artifact is structured), multi-step with tool use (tool input schemas must be exact).
Fix direction: Add output validation as a post-step, not a retry. If the formatter fails, classify it before retrying — otherwise you accumulate latency without understanding the cause.
2. Instruction Non-compliance
What it is: The agent received clear instructions and didn't follow them. Not because it couldn't — because it didn't prioritize them correctly in the context.
Span signature:
{
"operationName": "llm.completion",
"attributes": {
"gen_ai.request.system": "You must respond in French. Never use English.",
"gen_ai.event.content": "Here is the summary in English..."
}
}
The system prompt is visible in the span attributes. The output contradicts it directly. Attribution is clear; the fix requires understanding why the instruction was deprioritized (context length? conflicting instructions? temperature?).
Real-world example: A compliance agent is instructed to always cite sources for regulatory claims. It produces a correct-seeming answer without citations because the citation instruction was buried 3,000 tokens into the system prompt and the response completion tokens ran out.
Most prone architectures: Single-agent loop (growing context buries early instructions), ReAct (the thought steps can override the original instruction).
3. Goal Deviation
What it is: The agent completed a task — but not the task it was given. It substituted a related but incorrect goal, usually because the original goal was ambiguous or the context accumulated enough drift to shift the target.
Span signature: No single span shows an error. The failure is in the final output span: the task was accomplished but the wrong task. Diagnosis requires comparing the root span's task description against the final output.
Real-world example: An agent is asked to "summarize the meeting and identify action items." It produces a summary but lists decisions (not action items) as the action item list. The agent completed a plausible interpretation of the task, not the intended one.
Most prone architectures: Hierarchical (goal drift accumulates across tiers), planner-executor (the executor's interpretation of the plan diverges from the planner's intent).
4. Resource Abuse
What it is: The agent consumes significantly more tokens, API calls, compute, or time than expected for the task. Often a symptom of another failure (a retry loop, a reasoning loop that doesn't terminate, a context that keeps growing).
Span signature:
{
"operationName": "agent.loop",
"duration": 847392,
"attributes": {
"agent.step_count": 47,
"gen_ai.usage.input_tokens": 284920,
"gen_ai.usage.output_tokens": 18472
}
}
47 steps for a task that should take 5-8. Input tokens 10x expected. The loop span's duration and step count are the diagnostic signals.
Most prone architectures: Reflection (refinement loop without termination), Tree of Thought (exponential branch growth), swarm (agents repeatedly query each other).
5. Tool-related
What it is: A tool call failed — the external API returned an error, the tool timed out, the tool threw an exception. The agent may or may not handle this gracefully.
Span signature:
{
"operationName": "tool.web_search",
"status": { "code": "ERROR", "message": "HTTPError: 429 Too Many Requests" },
"duration": 30001,
"attributes": {
"gen_ai.tool.name": "web_search",
"gen_ai.tool.call.id": "call_x7f2"
}
}
The span status is ERROR. The message tells you the failure mechanism. If the agent catches this and retries without attribution, you've lost the diagnostic signal.
Real-world example: A research agent calls a third-party API that rate-limits after 10 requests per minute. The agent hits the limit mid-task, the tool fails with 429, and the agent halts without producing any output or surfacing the cause to the user.
Most prone architectures: Any architecture with external tool dependencies.
6. Language-only
What it is: The agent produces a confident, fluent answer that contains fabricated information — hallucinated facts, invented citations, made-up statistics. The language is correct; the content is wrong.
Span signature: The span succeeds. The failure is entirely in the event content. Detecting language-only failures requires comparing the output against ground truth or running a verification step.
{
"operationName": "llm.completion",
"status": { "code": "OK" },
"events": [{
"name": "gen_ai.completion",
"attributes": {
"gen_ai.event.content": "According to Smith et al. (2024), the latency improvement was 47%..."
}
}]
}
"Smith et al. (2024)" doesn't exist. The 47% is invented. The span is fine.
Most prone architectures: Single-agent loop (no grounding), ReAct (when the agent fabricates observations in its thought steps).
7. Task Orchestration
What it is: The agent fails to coordinate subtasks correctly — wrong ordering, missing dependencies, deadlocks, incomplete task graphs. The individual steps might be correct; their coordination is not.
Span signature: Parallel spans that should have been sequential, or a span that starts before its dependency completed. Often visible in span timestamps.
span: step_b start: 14:22:31.000 (depends on step_a)
span: step_a start: 14:22:31.100 (started after step_b — wrong order)
Real-world example: A code-generation agent generates a test file before generating the implementation it's supposed to test, because the planner didn't enforce the dependency.
Most prone architectures: Supervisor/worker (coordination logic is in the supervisor prompt), hierarchical (dependency graph grows complex with depth).
8. Tool Selection Errors
What it is: The agent invokes the wrong tool for the task — calling a database read tool when a write was needed, using a summarization tool when extraction was required, or calling a deprecated tool when an updated version exists.
Span signature: The tool call span succeeds but the result doesn't satisfy the task requirement. The error becomes apparent one or more steps later when the downstream step receives the wrong type of data.
Real-world example: An agent needs to update a user record. It has access to user.read and user.write. It calls user.read, gets the current data, and returns it as if the update happened — because the tool names were similar and the prompt was ambiguous about which to call.
Most prone architectures: Multi-step with tool use (large tool sets increase selection ambiguity), planner-executor (planner may not have precise tool descriptions).
9. Context Handling Failures
What it is: Information is lost, corrupted, or incorrectly carried between steps. The agent's working context — the accumulated state of the conversation — fails to correctly maintain information across the span boundary.
Span signature: A step whose input doesn't include information that was clearly established in a prior step. The prior step's span shows the information was generated; the current step's input span doesn't contain it.
Real-world example: A document analysis agent reads five files in sequence. By the time it analyzes the fifth file, the findings from the first file have been truncated from the context. The final synthesis is missing 20% of the information it was supposed to integrate.
Most prone architectures: Single-agent loop (context fills linearly), hierarchical (summarization at each tier can drop information), multi-step with tool use (long interleaved contexts).
10. Poor Information Retrieval
What it is: The agent retrieves information but retrieves the wrong information — wrong chunk from a RAG index, wrong document from a search, wrong records from a database. The retrieval mechanism works; the retrieval result is wrong.
Span signature: The retrieval tool span succeeds and returns results. The failure is that the results don't match the query intent. Detectable only by comparing the query in the tool call input to the results in the tool call output.
Real-world example: A support agent is asked about pricing for Plan B. It retrieves a chunk about Plan A pricing because the embedding similarity was higher (Plan A and Plan B are described in adjacent paragraphs with similar language).
Most prone architectures: RAG-based single agents, any architecture where retrieval is a first step before reasoning.
11. Incorrect Problem Identification
What it is: The agent misunderstands what problem it's solving. Not goal drift (where the problem shifts mid-task) — this is a failure at the very first step, where the agent's initial framing is wrong.
Span signature: The root span's first reasoning step or planning step reflects a problem framing that doesn't match the user's actual request. All subsequent steps are internally consistent with the wrong framing.
Real-world example: A user asks "why is my agent slow?" The agent interprets this as a latency question and analyzes network timing. The actual problem is context saturation causing quadratic attention overhead — a fundamentally different issue.
Most prone architectures: Planner-executor (the plan is built on the wrong problem model), ReAct (the initial thought sets the wrong direction for all subsequent steps).
12. Tool Output Misinterpretation
What it is: The agent receives a valid tool result but interprets it incorrectly. The tool succeeded; the agent's understanding of what the tool returned is wrong.
Span signature:
{
"operationName": "tool.database_query",
"events": [{
"name": "gen_ai.tool.message",
"attributes": {
"gen_ai.event.content": "{\"count\": 0, \"rows\": []}"
}
}]
},
{
"operationName": "llm.reasoning",
"events": [{
"name": "gen_ai.completion",
"attributes": {
"gen_ai.event.content": "The query returned results. Proceeding with the assumption that records exist..."
}
}]
}
The database returned zero rows. The agent said "the query returned results." Classic misinterpretation — the agent pattern-matched on "the query ran successfully" and missed the empty result.
Most prone architectures: ReAct (observation step can misread tool results), multi-step with tool use.
13. Tool Definition Issues
What it is: The tool itself is incorrectly defined — wrong parameter schema, incorrect docstring, missing required parameters, incorrect return type documentation. The agent calls the tool correctly according to the definition, but the definition is wrong.
Span signature: The tool call matches the schema exactly. The tool errors, or returns unexpected results. The failure is in the tool definition, not the agent's call.
Real-world example: A tool is defined with required: ["query"] but actually requires ["query", "limit"]. The agent generates a valid call (according to the schema), the tool rejects it (because limit is missing), and the error message blames a parameter that the agent had no reason to include.
Most prone architectures: Any — but especially planner-executor where the planner generates tool calls from schema descriptions without runtime validation.
14. Incorrect Memory Usage
What it is: The agent has a memory system (vector store, key-value store, episodic memory) and uses it incorrectly — writing to the wrong slot, reading stale data, failing to write, or writing corrupted data.
Span signature: A memory read span returns stale or incorrect data that was never overwritten. Or a memory write span succeeds but stores data in a key that differs from the key the subsequent read will use.
Real-world example: A long-running agent stores intermediate results in a vector store. On a re-run, it retrieves results from a previous run (similar embedding) rather than running fresh. The output is confidently wrong, based on a stale previous result that happened to be nearest-neighbor.
Most prone architectures: Agents with explicit memory systems, multi-session agents.
Patterns to Watch
Three L4 categories account for the majority of failures in production systems:
- Context Handling Failures — universal; grows with task length
- Tool-related — grows with external dependency count
- Instruction Non-compliance — grows with prompt complexity
If you're building a multi-agent system for the first time: instrument for these three first. The rest will show up as your system scales.
See the heatmap of your own system's failure distribution at /app/insights/heatmap. If one L4 category dominates, that's your highest-leverage fix.
Try It
Paste a failing trace at /app/new and see which of these 14 categories ProveAI Origin assigns — with the specific span and evidence quoted. The L1 + L4 combination is the starting point for every repair.
Continue reading: Agent Architectures shows which architectures are structurally prone to each failure category. Agentic Coding and Failures explains why coding agents can't detect their own failures — and what you need to add to your workflow to catch them.