How ProveAI Origin works
You give us a multi-agent trace. We find which span caused the failure, tell you why — in ~2–7 seconds, with cited evidence — suggest a fix, then let you save it as a snapshot so the same regression can't silently ship next sprint.
Three journeys: acute debug (one trace, root cause now) · dev loop (snapshot then replay your change) · CI gate (GitHub Action runs suites on every PR). See /docs/journeys for the deep dive.
Learn
Long-form articles on agent architectures, common failure modes, and where the frontier of failure attribution lies.
7 articles →
Cookbook
Step-by-step guides — wire RJ into your GitHub Action, bridge LangSmith traces, build a snapshot suite for regression gating.
12 guides →
The pipeline
Four stages, all swappable; two more close the loop.
Every module is versioned and pinned per attribution (algoVersions JSONB). Re-run the same trace against a future compressor / judge / repair version anytime to compare.
Coding agents (Claude Code, Codex CLI) can drive the same pipeline through @runtime-judgement/mcp-server: three tools (verify_change, attribute_trace, suggest_snapshot) map onto the same API routes the UI uses.
The verdict
Two labels, picked from a fixed vocabulary.
Pipeline presets
Pick cost/accuracy tradeoff per attribution. Switchable on /app/new.
API
The UI is one consumer of these endpoints. The same routes back the GitHub Action and the snapshot suite runner.
All routes are auth-gated by Clerk session cookie. The MCP server and GitHub Action use a Bearer token via repo secret today; first-class API-key issuance for headless callers (CLI, CI, third-party tooling) is on the roadmap.
Limits + retention
What we ship today
The full loop — paste → attribute → snapshot → replay → repair → verify → share / PR.
- — Attribution: span-level root-cause with L1 axis (execution vs semantic), L4 category (14 TRAIL labels), evidence quote, confidence. Cited, evidence-grounded verdicts — see /bench for measured step@1 with confidence intervals.
- — Counterfactual replay: perturb the cited span + re-run; verdict shifts surface as PASSED / CHANGED-INTENTIONAL / CHANGED-UNEXPECTED.
- — K-consensus: K=3 self-consistency vote shipped as the Consistent preset; Cascade auto-escalates Fast→Standard on low confidence.
- — Snapshots + suites: save an attribution as a baseline, replay on demand, group into a suite, accept the new run to roll the baseline forward.
- — Repair + verify-fix: repair-v0 proposes a prose or prompt-patch fix; verify-fix replays the patched trace and reports whether the cited cause shifts.
- — GitHub Action + PR comment: run a snapshot suite on every PR; verdict posts as a PR comment with cited cause and a link to the result page.
- — Cross-tracer ingest: OTEL gen-ai semconv, LangSmith export, our custom JSON. Auto-detected.
- — Share / open as issue: one-click public share link, Slack Block-Kit readout, markdown copy, or a GitHub-Issue body. Same evidence the judge cited, formatted for each channel.
- — Insights + patterns: recurring-failure clustering by L1/L4 + cited-span pattern; cost trend, latency trend, distribution dashboards, and most-cited-spans heatmap at /app/insights.
- — Onboarding tour + docs: first-time tour resumes through each step as you progress; canonical journeys at /docs/journeys; contextual help popovers across the feature surfaces.
- — Auto-PR on verified fix: when verify-fix passes a confidence threshold on a snapshot suite, RJ opens a GitHub PR with the patch, the cited cause, and a link back to the result page. Per-suite opt-in.
- — Drift detection from prod: every new trace from a pull bridge is matched against the nearest snapshot; deltas auto-create regression runs that surface on /app/snapshots and /app/repairs.
- — Tracer pull bridges: first-class pull from Langfuse and LangSmith. Paste is still the fastest path; the bridges are for hands-off continuous ingest from prod.
- — MCP server for Claude Code / Codex CLI: register @runtime-judgement/mcp-server and your coding agent gets three tools: rj.verify_change, rj.attribute_trace, and rj.suggest_snapshot. Inner-loop verification without leaving the agent.
- — Snapshot suggestion engine: after a verified attribution, RJ surfaces “save this as a snapshot” with the cited-cause baseline pre-filled so the same regression can't silently ship next sprint.
- — Repairs dashboard: every proposed fix — prose or prompt-patch — accumulates on /app/repairs with status filters, confidence, and a link back to the attribution it came from.
- — Attribution feedback: thumbs / R3 signal on every cited cause feeds the judge-calibration loop — first signal of “did we get the right span” beyond accuracy on the research bench.
- — Sub-2s p50: attribution stream now hits sub-2s p50 on a 60-span trace. Static imports replaced per-request dynamic loads in the orchestrator; the SSE connection flushes a connected event before auth runs, killing perceived TTFB on cold start.
What we deliberately don't do (yet)
- — PR-verifier GitHub App: one-click install on any repo, runs the snapshot suite the moment a PR opens, and posts the cited cause + verify-fix verdict back to the diff. Coming soon.
- — Cursor agent-mode extension: same three tools as the MCP server (rj.verify_change, rj.attribute_trace, rj.suggest_snapshot), wired to Cursor's composer. Coming soon.
- — On-prem / Docker-compose deploy: dockerized attributor + Postgres + worker for regulated industries. Banked against the first design-partner ask; on the roadmap.
- — Devin webhook integration: Devin posts the final trace; RJ runs attribution + verify-fix and replies on the same Devin session. Coming soon.
- — Phoenix tracer bridge: Langfuse + LangSmith pull bridges ship today; Phoenix is on the roadmap.
- — Time-travel replay + simulation env: rewind any span and replay forward with different perturbations; agent-vs-agent simulation runs for stress tests. On the roadmap.
- — Public CLI / API keys: the MCP server and GitHub Action use a Bearer token via repo secret today. First-class API-key issuance for third-party tooling is on the roadmap.
- — Team accounts / SSO: single-user workspaces today.
- — Billing: Usage metering is coming soon (no migration when billing turns on). Talk to us for paid until then.