For agentic developers reviewing AI-authored PRs

Snapshot the failure. Replay every change.

Span-level attribution. Snapshot suites + replay against AI-authored PRs.

Or scroll for the 6-step walkthrough — each step has a Show me trigger.

64% step@1~5s P50$0.0006 / case4+ tracersSee the bench →
The walkthrough

From a failing trace to a green PR check

Each panel below shows one step of the cycle. Click Show me on any panel to pin a live-shaped sample.

  1. 01

    Trace

    Wherever your agent already ships traces, we ingest. Above your tracer — no SDK swap.

  2. 02

    Find patterns

    Semantic clustering on L1 × L4 — not just lexical. The cited cause IS the label.

  3. 03

    Make the change

    Cursor, Claude Code, Codex, or your own hands. We don't care who edits — we verify what changed.

  4. 04

    Failure

    our spike

    When a candidate patch breaks something else, the snapshot's cited cause shifts — same span, different L4. That's how we detect a regression.

  5. 05

    Test

    our spike

    Snapshot suites + K=N counterfactual replay. Every cited failure auto-labels as a regression test — no human-curated dataset.

  6. 06

    Regression check + Ship

    Framework-neutral CI gate. PR comment shows fixed vs regressed snapshots. Merge with confidence — or block before it lands.

What works today

Three primitives, one loop

Each capability is its own swappable module — versioned, pinned per attribution, and recombinable.

Ingest

Above your tracer, inside your coding agent.

  • OTEL gen-ai semconvFlat semconv — OTLP wire normalization Sprint 18
  • LangSmith export
  • Custom JSON
  • Langfuse bridge
  • LangSmith pull bridge
  • Claude Code / Codex CLI (MCP)Private prototype — Sprint 19 publish
  • Phoenix bridgeOn the roadmap
  • Cursor extensionComing soon
  • Devin webhookComing soon
Attribute

Span-level cause with substring-level evidence.

  • Span-level rank-1 cause
  • L1 axis (execution / semantic)
  • L4 category (14 TRAIL labels)
  • Cited evidence quote
  • Cited, evidence-grounded verdicts — see /bench
  • Cascade preset (Fast → Standard)
  • K-consensus (K=3)
  • Multi-judgeResearch
Replay

Auto-labeled snapshots from production failures.

  • K=N counterfactual replay
  • Snapshot suites
  • Verify-fix (patch → replay → score)
  • GitHub Action CI gate
  • Auto-PR on verified fixExperiment — webhook idempotence Sprint 18
  • Drift detection from prod
  • Time-travel replayOn the roadmap
  • Simulation envOn the roadmap
Vs. the field

What only we do today

Capabilities checked against the public-facing product surface of each vendor as of 2026-05-19.

CapabilityEngineBraintrustPromptfooRJ
Span-level attribution
Auto-labeled snapshots from prod traces
Counterfactual replay + significance
Cross-tracer ingestpartial
Verify selected GitHub PRs against snapshot suitesorg-scale idempotence Sprint 18partial
Framework lock-inLangChownnonenone

Comparisons are best-effort from public surfaces. Mistakes → tell us.

The bench

We rank the models, not just the claims

Hand every model the same trace and the same prompt at temperature 0, then measure who finds the true root cause — at what cost and how fast. Preliminary, with confidence intervals shown.

Cost vs quality

step@1 accuracy against price per call — find the frontier.

Fastest models

median and p95 latency per attribution, lower is faster.

Trace length

how cost and latency scale from 3 to 60-span traces.

Failure categories

which categories the judges reach for across the corpus.

See the benchRead the methodologyPreliminary · n=22 corpus
On the roadmap

What we're still building

In active development
  • Reliable PR auto-comments at org scale (webhook idempotence)
  • Cursor agent-mode extension
  • On-prem / Docker-compose deploy
  • Devin webhook integration
On the roadmap
  • Phoenix tracer bridge
  • Time-travel replay (rewind any span)
  • Simulation env for agent-vs-agent runs
  • Public CLI + API-key issuance
On the bench
  • Multi-judge ensembles (research)
  • Team accounts + SSO
  • Usage metering + billing

Paste a trace. See the cause.

No SDK. No instrumentation change. From paste to a rank-1 cited span — seconds, not minutes.