What Are Agent Traces?
From OpenTelemetry's HTTP roots to LLM span DAGs — a complete history of distributed tracing and where gen-ai observability is heading.
What Are Agent Traces?
Before you can fix a multi-agent failure, you need to know what happened. Not at the system level — at the step level. Which tool call returned malformed JSON? Which sub-agent was handed a corrupted context? Which prompt crossed a token limit three hops into a five-hop chain?
That's what a trace is for. And understanding how traces work — where they came from, how they evolved, and where they're going — is the foundation for everything else in this wiki.
The Origin: Distributed Tracing in HTTP Systems
The idea of tracing a request across multiple services predates LLMs by roughly two decades.
In 2010, Google published Dapper, a paper describing how they tracked requests through their internal microservices. The core insight was simple: give every request a unique trace ID, propagate that ID through every downstream call, and collect structured timing data at each hop. You end up with a tree of spans that shows you exactly where time was spent and which service caused the bottleneck.
Dapper wasn't open-source. But it inspired two open-source projects that shaped the industry:
- Zipkin (2012, Twitter) — a distributed tracing system that implemented Dapper-style request tracking with a web UI
- Jaeger (2016, Uber) — a more scalable take on the same idea, eventually donated to the CNCF
Both systems shared the same conceptual model: a trace is a directed acyclic graph (DAG) of spans, each representing a unit of work. Spans have a start time, duration, status, and arbitrary key-value attributes. A span can have a parent span. The root span has no parent.
In 2019, the OpenTelemetry project merged the OpenTracing and OpenCensus standards into a single vendor-neutral instrumentation framework. Today, OpenTelemetry (OTEL) is the de facto standard for distributed tracing across the industry. If you're using Datadog, Honeycomb, New Relic, or Grafana Tempo, OTEL is almost certainly how your data gets in.
The Anatomy of a Span
A span is the atomic unit of observability. Here's what a minimal OTEL span looks like in JSON:
{
"traceId": "4bf92f3577b34da6a3ce929d0e0e4736",
"spanId": "00f067aa0ba902b7",
"parentSpanId": "00f067aa0ba902b6",
"operationName": "api.search",
"startTime": "2026-05-19T03:12:44.000Z",
"duration": 342,
"status": { "code": "OK" },
"attributes": {
"http.method": "GET",
"http.url": "https://api.example.com/search",
"http.status_code": 200,
"service.name": "search-service"
}
}
The parentSpanId is the link that turns a flat list of spans into a tree. Every span knows who called it. The root span has no parent. This parent-child relationship is the foundation of causal attribution — you can walk the tree from any leaf span back to the root, following the chain of what caused what.
The Evolution: From HTTP to LLM
For a decade, distributed tracing was almost entirely about HTTP services. The failure modes were predictable: slow database queries, network timeouts, misconfigured load balancers. Spans tracked latency and status codes.
Then large language models arrived in production systems, and the tracing model cracked.
The problem is that LLM calls aren't like HTTP calls. An HTTP call to a search API either returns results or it doesn't. An LLM call returns a token stream that might contain correct information, hallucinated information, a perfectly formatted response, or a response that ignores the instructions in the prompt. The "success" status code is meaningless. You need to understand the content, not just the timing.
The early LLM tracing tools emerged from the framework ecosystem:
- LangSmith (2023, LangChain) — first-party tracing for LangChain runs. Made it possible to view prompt inputs, model outputs, and tool call results for a LangChain agent run. Framework-specific: if you weren't using LangChain, you got nothing.
- Langfuse (2023, open-source) — framework-agnostic LLM observability. Introduced the concept of
generationspans (one per LLM call) andtraceas a first-class object with user/session metadata. Crucially, Langfuse introduced the idea of evaluation scores attached to spans — a precursor to attribution. - Helicone, Arize Phoenix, Braintrust — various tools that emerged in 2023-2024, each with a different take on what LLM observability should look like.
Each had its own schema. A LangSmith trace and a Langfuse trace described the same run in incompatible formats. The ecosystem fragmented.
The Convergence: OTEL Gen-AI Semantic Conventions
OpenTelemetry's answer was the gen-ai semantic conventions — a standard vocabulary for LLM-related spans. The goal: any compliant instrumentation library produces spans that any compliant backend can understand.
The gen-ai semconv introduced a new set of span attributes:
| Attribute | Description | Example |
|---|---|---|
gen_ai.system | The LLM provider | "openai", "anthropic" |
gen_ai.request.model | The model requested | "gpt-4o", "claude-3-7-sonnet" |
gen_ai.response.model | The model that actually responded | "gpt-4o-2026-03-01" |
gen_ai.usage.input_tokens | Prompt token count | 1247 |
gen_ai.usage.output_tokens | Completion token count | 384 |
gen_ai.request.temperature | Temperature setting | 0.7 |
gen_ai.tool.name | Name of a tool call | "search_web" |
gen_ai.tool.call.id | Unique tool call ID | "call_abc123" |
Here's what an annotated gen-ai span looks like in practice:
{
"traceId": "7f2c9a4d1e8b3f6c9d2e5a8b1f4c7d0e",
"spanId": "3a5f8c2d9b1e4f7a",
"parentSpanId": "1c4e7a0d3f6b9c2e",
"operationName": "gen_ai.chat",
"startTime": "2026-05-19T14:22:31.100Z",
"duration": 2341,
"status": { "code": "OK" },
"attributes": {
"gen_ai.system": "anthropic",
"gen_ai.request.model": "claude-sonnet-4-6",
"gen_ai.response.model": "claude-sonnet-4-6",
"gen_ai.usage.input_tokens": 3847,
"gen_ai.usage.output_tokens": 212,
"gen_ai.request.temperature": 0.0,
"gen_ai.tool.name": "search_codebase",
"gen_ai.tool.call.id": "call_7x9q2m",
"service.name": "code-review-agent",
"rj.span.role": "tool_call"
},
"events": [
{
"name": "gen_ai.tool.message",
"timestamp": "2026-05-19T14:22:33.441Z",
"attributes": {
"gen_ai.tool.call.id": "call_7x9q2m",
"gen_ai.event.content": "{\"files\": [], \"error\": \"index out of range\"}"
}
}
]
}
The events array is critical. This is where tool call results live — and this is where failures hide. A span with status: OK can contain a tool call result that is silently malformed, an error the agent will misinterpret, or data that will corrupt the next step in the chain. Status codes don't tell you this. The event content does.
The Timeline
| Year | Event |
|---|---|
| 2010 | Google publishes Dapper — distributed tracing enters the field |
| 2012 | Zipkin released by Twitter, first major open-source tracer |
| 2016 | Jaeger released by Uber; OpenTracing standard proposed |
| 2018 | OpenCensus (Google) gains traction as alternative to OpenTracing |
| 2019 | OpenTracing + OpenCensus merge → OpenTelemetry (CNCF) |
| 2020 | OTEL reaches 1.0 for tracing; becomes de facto standard |
| 2022 | LangChain launches; LangSmith starts as internal tooling |
| 2023 | LangSmith public launch; Langfuse open-sources; gen-ai observability category emerges |
| 2024 | OTEL gen-ai semantic conventions (experimental); OpenLLMetry bridges OTEL → LLM calls |
| 2025 | OTEL gen-ai semconv hits stable; tool calls become first-class spans; multi-agent tracing formalized |
| 2026 | Span-level failure attribution enters the picture — tracing data becomes regression test input |
Parent-Child Relationships in Agent Traces
What makes agent traces different from service traces isn't the span format — it's the semantics of the parent-child relationship.
In a microservice trace, the parent-child relationship represents delegation: the parent service called the child service. The relationship is structural and mechanical.
In an agent trace, the parent-child relationship represents reasoning: the orchestrator decided to invoke a sub-agent, which decided to call a tool, which returned a result that informed a reasoning step, which produced an output that became the input to the next tool call. The chain is not just computational — it's inferential.
This matters for attribution. When a multi-agent system fails, the fault isn't necessarily in the span that raised the error. The fault is often in a span that produced a subtly wrong output that cascaded through the reasoning chain until it caused an observable failure three or four hops later. Tracing gives you the chain; attribution tells you which link broke.
A five-hop chain looks like this in span ID space:
trace: a3f9b2c1...
├── span: 001 orchestrator.plan [root]
│ ├── span: 002 web_search.call [child of 001]
│ │ └── span: 003 http.get [child of 002]
│ ├── span: 004 synthesize.results [child of 001]
│ │ └── span: 005 llm.completion [child of 004] ← failure here
│ └── span: 006 format.output [child of 001]
Span 005 is where the LLM call hallucinates a fact. Span 006 formats the output faithfully, propagating the error. The trace-level result is "format.output succeeded." The attribution is span 005: l4_category: "Incorrect Problem Identification".
Where Gen-AI Tracing Is Heading
Three trends are converging:
1. Structured tool calls as first-class spans. The gen-ai semconv is moving toward treating every tool call (and its result) as its own span, not just an event on the parent LLM call span. This means tool-level failure attribution becomes structurally precise — you can cite span_id: 003 instead of span_id: 002 with event_index: 1. ProveAI Origin already parses both formats; see how it handles these at /app/new.
2. Multi-agent propagation standards. When Agent A calls Agent B over HTTP or a message queue, how does the trace ID propagate? The W3C Trace Context standard handles this for HTTP, but agent-to-agent calls via function calling, tool use, or direct API don't always respect W3C headers. A trace might fragment across two systems with no shared context. The OTEL working group is actively defining multi-agent context propagation for exactly this scenario.
3. Tracing the coding agent itself. The next frontier is instrumenting the coding agent's own reasoning — not just the agent it's writing, but the agent doing the writing. When Cursor, Claude Code, or Devin makes a multi-step change to a codebase, each step in that reasoning chain is a span. If the coding agent hallucinates a dependency, misreads the test output, or makes a change that introduces a regression, those failures have a span-level signature too. The tools that capture and attribute those traces will define the next decade of AI engineering observability.
What This Means for You
If you're building a multi-agent system today, the practical takeaways are:
- Use OTEL gen-ai semconv if you're instrumenting from scratch. It's the converging standard, and every major backend (Langfuse, LangSmith, Phoenix, Honeycomb) is shipping or shipping-soon OTEL-native ingest.
- Capture event content, not just span metadata. Status codes lie. The failure is in the payload.
- Ensure parent-child propagation across agent boundaries. If you're calling a sub-agent over HTTP, pass the W3C
traceparentheader. If you're calling via tool use, embed the parent span ID in the tool call metadata. - Treat your traces as test inputs, not just logs. Every production failure trace is a regression test waiting to be named. ProveAI Origin automates that labeling.
Try It
Paste a failing agent trace at /app/new and see which span ProveAI Origin cites as the root cause — with evidence from the span's content, not just its status code. Browse the heatmap of your most-cited failure spans at /app/insights/heatmap.
Continue reading: Agent Architectures walks through how different multi-agent designs produce different trace shapes — and different failure patterns. Core Principles explains why span-level granularity is the only granularity that matters for attribution.