Snapshot the failure. Replay every change.
Span-level attribution. Snapshot suites + replay against AI-authored PRs.
Or scroll for the 6-step walkthrough — each step has a Show me trigger.
From a failing trace to a green PR check
Each panel below shows one step of the cycle. Click Show me on any panel to pin a live-shaped sample.
Three primitives, one loop
Each capability is its own swappable module — versioned, pinned per attribution, and recombinable.
What only we do today
Capabilities checked against the public-facing product surface of each vendor as of 2026-05-19.
| Capability | Engine | Braintrust | Promptfoo | RJ |
|---|---|---|---|---|
| Span-level attribution | ||||
| Auto-labeled snapshots from prod traces | ||||
| Counterfactual replay + significance | ||||
| Cross-tracer ingest | partial | |||
| Verify selected GitHub PRs against snapshot suitesorg-scale idempotence Sprint 18 | partial | |||
| Framework lock-in | LangCh | own | none | none |
Comparisons are best-effort from public surfaces. Mistakes → tell us.
We rank the models, not just the claims
Hand every model the same trace and the same prompt at temperature 0, then measure who finds the true root cause — at what cost and how fast. Preliminary, with confidence intervals shown.
Cost vs quality
step@1 accuracy against price per call — find the frontier.
Fastest models
median and p95 latency per attribution, lower is faster.
Trace length
how cost and latency scale from 3 to 60-span traces.
Failure categories
which categories the judges reach for across the corpus.
What we're still building
Paste a trace. See the cause.
No SDK. No instrumentation change. From paste to a rank-1 cited span — seconds, not minutes.