Try it
foundations10 min readUpdated 2026-05-20

What Are Agent Traces?

From OpenTelemetry's HTTP roots to LLM span DAGs — a complete history of distributed tracing and where gen-ai observability is heading.

What Are Agent Traces?

Before you can fix a multi-agent failure, you need to know what happened. Not at the system level — at the step level. Which tool call returned malformed JSON? Which sub-agent was handed a corrupted context? Which prompt crossed a token limit three hops into a five-hop chain?

That's what a trace is for. And understanding how traces work — where they came from, how they evolved, and where they're going — is the foundation for everything else in this wiki.

The Origin: Distributed Tracing in HTTP Systems

The idea of tracing a request across multiple services predates LLMs by roughly two decades.

In 2010, Google published Dapper, a paper describing how they tracked requests through their internal microservices. The core insight was simple: give every request a unique trace ID, propagate that ID through every downstream call, and collect structured timing data at each hop. You end up with a tree of spans that shows you exactly where time was spent and which service caused the bottleneck.

Dapper wasn't open-source. But it inspired two open-source projects that shaped the industry:

  • Zipkin (2012, Twitter) — a distributed tracing system that implemented Dapper-style request tracking with a web UI
  • Jaeger (2016, Uber) — a more scalable take on the same idea, eventually donated to the CNCF

Both systems shared the same conceptual model: a trace is a directed acyclic graph (DAG) of spans, each representing a unit of work. Spans have a start time, duration, status, and arbitrary key-value attributes. A span can have a parent span. The root span has no parent.

In 2019, the OpenTelemetry project merged the OpenTracing and OpenCensus standards into a single vendor-neutral instrumentation framework. Today, OpenTelemetry (OTEL) is the de facto standard for distributed tracing across the industry. If you're using Datadog, Honeycomb, New Relic, or Grafana Tempo, OTEL is almost certainly how your data gets in.

The Anatomy of a Span

A span is the atomic unit of observability. Here's what a minimal OTEL span looks like in JSON:

{
  "traceId": "4bf92f3577b34da6a3ce929d0e0e4736",
  "spanId": "00f067aa0ba902b7",
  "parentSpanId": "00f067aa0ba902b6",
  "operationName": "api.search",
  "startTime": "2026-05-19T03:12:44.000Z",
  "duration": 342,
  "status": { "code": "OK" },
  "attributes": {
    "http.method": "GET",
    "http.url": "https://api.example.com/search",
    "http.status_code": 200,
    "service.name": "search-service"
  }
}

The parentSpanId is the link that turns a flat list of spans into a tree. Every span knows who called it. The root span has no parent. This parent-child relationship is the foundation of causal attribution — you can walk the tree from any leaf span back to the root, following the chain of what caused what.

The Evolution: From HTTP to LLM

For a decade, distributed tracing was almost entirely about HTTP services. The failure modes were predictable: slow database queries, network timeouts, misconfigured load balancers. Spans tracked latency and status codes.

Then large language models arrived in production systems, and the tracing model cracked.

The problem is that LLM calls aren't like HTTP calls. An HTTP call to a search API either returns results or it doesn't. An LLM call returns a token stream that might contain correct information, hallucinated information, a perfectly formatted response, or a response that ignores the instructions in the prompt. The "success" status code is meaningless. You need to understand the content, not just the timing.

The early LLM tracing tools emerged from the framework ecosystem:

  • LangSmith (2023, LangChain) — first-party tracing for LangChain runs. Made it possible to view prompt inputs, model outputs, and tool call results for a LangChain agent run. Framework-specific: if you weren't using LangChain, you got nothing.
  • Langfuse (2023, open-source) — framework-agnostic LLM observability. Introduced the concept of generation spans (one per LLM call) and trace as a first-class object with user/session metadata. Crucially, Langfuse introduced the idea of evaluation scores attached to spans — a precursor to attribution.
  • Helicone, Arize Phoenix, Braintrust — various tools that emerged in 2023-2024, each with a different take on what LLM observability should look like.

Each had its own schema. A LangSmith trace and a Langfuse trace described the same run in incompatible formats. The ecosystem fragmented.

The Convergence: OTEL Gen-AI Semantic Conventions

OpenTelemetry's answer was the gen-ai semantic conventions — a standard vocabulary for LLM-related spans. The goal: any compliant instrumentation library produces spans that any compliant backend can understand.

The gen-ai semconv introduced a new set of span attributes:

AttributeDescriptionExample
gen_ai.systemThe LLM provider"openai", "anthropic"
gen_ai.request.modelThe model requested"gpt-4o", "claude-3-7-sonnet"
gen_ai.response.modelThe model that actually responded"gpt-4o-2026-03-01"
gen_ai.usage.input_tokensPrompt token count1247
gen_ai.usage.output_tokensCompletion token count384
gen_ai.request.temperatureTemperature setting0.7
gen_ai.tool.nameName of a tool call"search_web"
gen_ai.tool.call.idUnique tool call ID"call_abc123"

Here's what an annotated gen-ai span looks like in practice:

{
  "traceId": "7f2c9a4d1e8b3f6c9d2e5a8b1f4c7d0e",
  "spanId": "3a5f8c2d9b1e4f7a",
  "parentSpanId": "1c4e7a0d3f6b9c2e",
  "operationName": "gen_ai.chat",
  "startTime": "2026-05-19T14:22:31.100Z",
  "duration": 2341,
  "status": { "code": "OK" },
  "attributes": {
    "gen_ai.system": "anthropic",
    "gen_ai.request.model": "claude-sonnet-4-6",
    "gen_ai.response.model": "claude-sonnet-4-6",
    "gen_ai.usage.input_tokens": 3847,
    "gen_ai.usage.output_tokens": 212,
    "gen_ai.request.temperature": 0.0,
    "gen_ai.tool.name": "search_codebase",
    "gen_ai.tool.call.id": "call_7x9q2m",
    "service.name": "code-review-agent",
    "rj.span.role": "tool_call"
  },
  "events": [
    {
      "name": "gen_ai.tool.message",
      "timestamp": "2026-05-19T14:22:33.441Z",
      "attributes": {
        "gen_ai.tool.call.id": "call_7x9q2m",
        "gen_ai.event.content": "{\"files\": [], \"error\": \"index out of range\"}"
      }
    }
  ]
}

The events array is critical. This is where tool call results live — and this is where failures hide. A span with status: OK can contain a tool call result that is silently malformed, an error the agent will misinterpret, or data that will corrupt the next step in the chain. Status codes don't tell you this. The event content does.

The Timeline

YearEvent
2010Google publishes Dapper — distributed tracing enters the field
2012Zipkin released by Twitter, first major open-source tracer
2016Jaeger released by Uber; OpenTracing standard proposed
2018OpenCensus (Google) gains traction as alternative to OpenTracing
2019OpenTracing + OpenCensus merge → OpenTelemetry (CNCF)
2020OTEL reaches 1.0 for tracing; becomes de facto standard
2022LangChain launches; LangSmith starts as internal tooling
2023LangSmith public launch; Langfuse open-sources; gen-ai observability category emerges
2024OTEL gen-ai semantic conventions (experimental); OpenLLMetry bridges OTEL → LLM calls
2025OTEL gen-ai semconv hits stable; tool calls become first-class spans; multi-agent tracing formalized
2026Span-level failure attribution enters the picture — tracing data becomes regression test input

Parent-Child Relationships in Agent Traces

What makes agent traces different from service traces isn't the span format — it's the semantics of the parent-child relationship.

In a microservice trace, the parent-child relationship represents delegation: the parent service called the child service. The relationship is structural and mechanical.

In an agent trace, the parent-child relationship represents reasoning: the orchestrator decided to invoke a sub-agent, which decided to call a tool, which returned a result that informed a reasoning step, which produced an output that became the input to the next tool call. The chain is not just computational — it's inferential.

This matters for attribution. When a multi-agent system fails, the fault isn't necessarily in the span that raised the error. The fault is often in a span that produced a subtly wrong output that cascaded through the reasoning chain until it caused an observable failure three or four hops later. Tracing gives you the chain; attribution tells you which link broke.

A five-hop chain looks like this in span ID space:

trace: a3f9b2c1...
├── span: 001  orchestrator.plan          [root]
│   ├── span: 002  web_search.call        [child of 001]
│   │   └── span: 003  http.get           [child of 002]
│   ├── span: 004  synthesize.results     [child of 001]
│   │   └── span: 005  llm.completion     [child of 004]  ← failure here
│   └── span: 006  format.output          [child of 001]

Span 005 is where the LLM call hallucinates a fact. Span 006 formats the output faithfully, propagating the error. The trace-level result is "format.output succeeded." The attribution is span 005: l4_category: "Incorrect Problem Identification".

Where Gen-AI Tracing Is Heading

Three trends are converging:

1. Structured tool calls as first-class spans. The gen-ai semconv is moving toward treating every tool call (and its result) as its own span, not just an event on the parent LLM call span. This means tool-level failure attribution becomes structurally precise — you can cite span_id: 003 instead of span_id: 002 with event_index: 1. ProveAI Origin already parses both formats; see how it handles these at /app/new.

2. Multi-agent propagation standards. When Agent A calls Agent B over HTTP or a message queue, how does the trace ID propagate? The W3C Trace Context standard handles this for HTTP, but agent-to-agent calls via function calling, tool use, or direct API don't always respect W3C headers. A trace might fragment across two systems with no shared context. The OTEL working group is actively defining multi-agent context propagation for exactly this scenario.

3. Tracing the coding agent itself. The next frontier is instrumenting the coding agent's own reasoning — not just the agent it's writing, but the agent doing the writing. When Cursor, Claude Code, or Devin makes a multi-step change to a codebase, each step in that reasoning chain is a span. If the coding agent hallucinates a dependency, misreads the test output, or makes a change that introduces a regression, those failures have a span-level signature too. The tools that capture and attribute those traces will define the next decade of AI engineering observability.

What This Means for You

If you're building a multi-agent system today, the practical takeaways are:

  • Use OTEL gen-ai semconv if you're instrumenting from scratch. It's the converging standard, and every major backend (Langfuse, LangSmith, Phoenix, Honeycomb) is shipping or shipping-soon OTEL-native ingest.
  • Capture event content, not just span metadata. Status codes lie. The failure is in the payload.
  • Ensure parent-child propagation across agent boundaries. If you're calling a sub-agent over HTTP, pass the W3C traceparent header. If you're calling via tool use, embed the parent span ID in the tool call metadata.
  • Treat your traces as test inputs, not just logs. Every production failure trace is a regression test waiting to be named. ProveAI Origin automates that labeling.

Try It

Paste a failing agent trace at /app/new and see which span ProveAI Origin cites as the root cause — with evidence from the span's content, not just its status code. Browse the heatmap of your most-cited failure spans at /app/insights/heatmap.

Continue reading: Agent Architectures walks through how different multi-agent designs produce different trace shapes — and different failure patterns. Core Principles explains why span-level granularity is the only granularity that matters for attribution.

Related articles

Try it