For agentic developers reviewing AI-authored PRs

Snapshot the failure. Replay every change.

Span-level attribution. Snapshot suites + replay against AI-authored PRs.

Try it now See the bench Try the GAIA sample

Or scroll for the 6-step walkthrough — each step has a Show me trigger.

64% step@1~5s P50$0.0006 / case4+ tracersSee the bench →

The walkthrough

From a failing trace to a green PR check

Each panel below shows one step of the cycle. Click Show me on any panel to pin a live-shaped sample.

01
Trace
Wherever your agent already ships traces, we ingest. Above your tracer — no SDK swap.
02
Find patterns
Semantic clustering on L1 × L4 — not just lexical. The cited cause IS the label.
03
Make the change
Cursor, Claude Code, Codex, or your own hands. We don't care who edits — we verify what changed.
04
Failure
our spike
When a candidate patch breaks something else, the snapshot's cited cause shifts — same span, different L4. That's how we detect a regression.
05
Test
our spike
Snapshot suites + K=N counterfactual replay. Every cited failure auto-labels as a regression test — no human-curated dataset.
06
Regression check + Ship
Framework-neutral CI gate. PR comment shows fixed vs regressed snapshots. Merge with confidence — or block before it lands.

What works today

Three primitives, one loop

Each capability is its own swappable module — versioned, pinned per attribution, and recombinable.

Ingest

Above your tracer, inside your coding agent.

OTEL gen-ai semconvFlat semconv — OTLP wire normalization Sprint 18
LangSmith export
Custom JSON
Langfuse bridge
LangSmith pull bridge
Claude Code / Codex CLI (MCP)Private prototype — Sprint 19 publish
Phoenix bridgeOn the roadmap
Cursor extensionComing soon
Devin webhookComing soon

Attribute

Span-level cause with substring-level evidence.

Span-level rank-1 cause
L1 axis (execution / semantic)
L4 category (14 TRAIL labels)
Cited evidence quote
Cited, evidence-grounded verdicts — see /bench
Cascade preset (Fast → Standard)
K-consensus (K=3)
Multi-judgeResearch

Replay

Auto-labeled snapshots from production failures.

K=N counterfactual replay
Snapshot suites
Verify-fix (patch → replay → score)
GitHub Action CI gate
Auto-PR on verified fixExperiment — webhook idempotence Sprint 18
Drift detection from prod
Time-travel replayOn the roadmap
Simulation envOn the roadmap

Vs. the field

What only we do today

Capabilities checked against the public-facing product surface of each vendor as of 2026-05-19.

Capability	Engine	Braintrust	Promptfoo	RJ
Span-level attribution
Auto-labeled snapshots from prod traces
Counterfactual replay + significance
Cross-tracer ingest		partial
Verify selected GitHub PRs against snapshot suitesorg-scale idempotence Sprint 18				partial
Framework lock-in	LangCh	own	none	none

Comparisons are best-effort from public surfaces. Mistakes → tell us.

The bench

We rank the models, not just the claims

Hand every model the same trace and the same prompt at temperature 0, then measure who finds the true root cause — at what cost and how fast. Preliminary, with confidence intervals shown.

Cost vs quality

step@1 accuracy against price per call — find the frontier.

Fastest models

median and p95 latency per attribution, lower is faster.

Trace length

how cost and latency scale from 3 to 60-span traces.

Failure categories

which categories the judges reach for across the corpus.

See the bench Read the methodologyPreliminary · n=22 corpus

On the roadmap

What we're still building

In active development

Reliable PR auto-comments at org scale (webhook idempotence)
Cursor agent-mode extension
On-prem / Docker-compose deploy
Devin webhook integration

On the roadmap

Phoenix tracer bridge
Time-travel replay (rewind any span)
Simulation env for agent-vs-agent runs
Public CLI + API-key issuance

On the bench

Multi-judge ensembles (research)
Team accounts + SSO
Usage metering + billing

Paste a trace. See the cause.

No SDK. No instrumentation change. From paste to a rank-1 cited span — seconds, not minutes.

Try it now See the bench