How ProveAI Origin works

You give us a multi-agent trace. We find which span caused the failure, tell you why — in ~2–7 seconds, with cited evidence — suggest a fix, then let you save it as a snapshot so the same regression can't silently ship next sprint.

Three journeys: acute debug (one trace, root cause now) · dev loop (snapshot then replay your change) · CI gate (GitHub Action runs suites on every PR). See /docs/journeys for the deep dive.

Learn

Long-form articles on agent architectures, common failure modes, and where the frontier of failure attribution lies.

7 articles →

Cookbook

Step-by-step guides — wire RJ into your GitHub Action, bridge LangSmith traces, build a snapshot suite for regression gating.

12 guides →

The pipeline

Four stages, all swappable; two more close the loop.

Stage 1

Parse

OTEL gen-ai semconv / LangSmith export / our custom JSON. Auto-detected. Output: a normalized span DAG persisted alongside the trace.

Stage 2

Compress (class-6)

Walks parent-chain + n-gram dependency closure from the failure span. Drops 90%+ of spans, keeps a ~6KB causal subgraph. Cached per trace.

Stage 3

Judge

GLM-4.6 (Standard, default), Q3-Next-80B (Fast), K=3 consensus (Consistent), or Fast→Standard cascade on low confidence. Returns L1 + L4 + cited spans + evidence.

Stage 4

Aggregate

Passthrough for K=1; self-consistency vote for K=3 (Consistent preset). Returns rank-1 cited span + confidence.

Stage 5

Snapshot + replay

Save an attribution as a baseline. Re-run produces PASSED / CHANGED-INTENTIONAL / CHANGED-UNEXPECTED. Group snapshots into suites for CI.

Stage 6

Repair + verify

Repair-v0 proposes a prose or prompt-patch fix. Verify-fix replays the patched trace and shows whether the cited cause shifts — proof, not vibes.

Stage 7

Ship + watch

Auto-PR opens with the patch + cited cause when verify-fix passes the confidence threshold. Drift detection on the Langfuse / LangSmith pull bridges flags new prod traces that no longer match the nearest snapshot — they become new regression runs on /app/repairs.

Every module is versioned and pinned per attribution (algoVersions JSONB). Re-run the same trace against a future compressor / judge / repair version anytime to compare.

Coding agents (Claude Code, Codex CLI) can drive the same pipeline through @runtime-judgement/mcp-server: three tools (verify_change, attribute_trace, suggest_snapshot) map onto the same API routes the UI uses.

The verdict

Two labels, picked from a fixed vocabulary.

L1 axis

executionPlan was right, execution went wrong (tool broke, output malformed, format off)

semanticReasoning was wrong (wrong plan, wrong info, wrong instruction-following)

L4 category (one of 14, from TRAIL)

Formatting ErrorsInstruction Non-complianceGoal DeviationResource AbuseTool-relatedLanguage-onlyTask OrchestrationTool Selection ErrorsContext Handling FailuresPoor Information RetrievalIncorrect Problem IdentificationTool Output MisinterpretationTool Definition IssuesIncorrect Memory Usage

Pipeline presets

Pick cost/accuracy tradeoff per attribution. Switchable on /app/new.

Fast

Q3-Next 80B (MoE) single-shot. High-volume triage, dev smoke, CI gate where latency matters.

$0.0002·~2s

Standard

GLM-4.6 (constellation default) single-shot. See /bench for measured step@1 with confidence intervals.

$0.0006·~5s

Consistent

Q3-Next 80B with K=3 consensus + self-consistency vote. Higher confidence on ambiguous traces.

$0.0006·~3s

Cascade

Fast tier first; auto-escalates to Standard only when judge confidence is low. Saves cost on the easy traces, pays the Standard-tier premium on the hard ones.

$0.0004·~5s

API

The UI is one consumer of these endpoints. The same routes back the GitHub Action and the snapshot suite runner.

Upload a trace

POST /api/traces
Content-Type: application/json

{ "trace_id": "...", "spans": [...] }

Run attribution (streaming SSE — per-stage progress + cost ticker)

POST /api/attributions/stream

{ "traceId": "...", "pipeline": "standard", "label": "..." }

Run attribution (one-shot, no stream)

POST /api/attributions
GET /api/attributions/[id]

Save attribution as a snapshot; group into a suite

POST /api/snapshots          { "attributionId": "...", "name": "...", "suiteName?": "..." }
POST /api/snapshots/[id]/replay        (re-runs the baseline trace)
POST /api/snapshot-suites/[id]/run     (replays every snapshot in the suite)
POST /api/snapshot-runs/[id]/accept    (promote new run to baseline)

Verify a suggested fix: apply patch as perturbation, replay

POST /api/attributions/[id]/verify-fix

{ "fixId": "..." }   →   { verdict, fixed, regressed, unchanged }

Share — public link, Slack block-kit, GitHub-issue body, markdown

POST /api/shares       { "attributionId": "..." }
GET  /api/shares/[id]/slack       (Block Kit JSON for webhook)
GET  /api/shares/[id]/readout     (rich plain-text readout)
GET  /api/shares/[id]/markdown    (GitHub-Issue-ready markdown)

All routes are auth-gated by Clerk session cookie. The MCP server and GitHub Action use a Bearer token via repo secret today; first-class API-key issuance for headless callers (CLI, CI, third-party tooling) is on the roadmap.

Limits + retention

Free tier

100 traces / month

Trace retention

7 days

Attribution retention

90 days

Max trace size (inline)

4 MB

Max trace size (Blob)

200 MB

Share link expiry

30 days

What we ship today

The full loop — paste → attribute → snapshot → replay → repair → verify → share / PR.

— Attribution: span-level root-cause with L1 axis (execution vs semantic), L4 category (14 TRAIL labels), evidence quote, confidence. Cited, evidence-grounded verdicts — see /bench for measured step@1 with confidence intervals.
— Counterfactual replay: perturb the cited span + re-run; verdict shifts surface as PASSED / CHANGED-INTENTIONAL / CHANGED-UNEXPECTED.
— K-consensus: K=3 self-consistency vote shipped as the Consistent preset; Cascade auto-escalates Fast→Standard on low confidence.
— Snapshots + suites: save an attribution as a baseline, replay on demand, group into a suite, accept the new run to roll the baseline forward.
— Repair + verify-fix: repair-v0 proposes a prose or prompt-patch fix; verify-fix replays the patched trace and reports whether the cited cause shifts.
— GitHub Action + PR comment: run a snapshot suite on every PR; verdict posts as a PR comment with cited cause and a link to the result page.
— Cross-tracer ingest: OTEL gen-ai semconv, LangSmith export, our custom JSON. Auto-detected.
— Share / open as issue: one-click public share link, Slack Block-Kit readout, markdown copy, or a GitHub-Issue body. Same evidence the judge cited, formatted for each channel.
— Insights + patterns: recurring-failure clustering by L1/L4 + cited-span pattern; cost trend, latency trend, distribution dashboards, and most-cited-spans heatmap at /app/insights.
— Onboarding tour + docs: first-time tour resumes through each step as you progress; canonical journeys at /docs/journeys; contextual help popovers across the feature surfaces.
— Auto-PR on verified fix: when verify-fix passes a confidence threshold on a snapshot suite, RJ opens a GitHub PR with the patch, the cited cause, and a link back to the result page. Per-suite opt-in.
— Drift detection from prod: every new trace from a pull bridge is matched against the nearest snapshot; deltas auto-create regression runs that surface on /app/snapshots and /app/repairs.
— Tracer pull bridges: first-class pull from Langfuse and LangSmith. Paste is still the fastest path; the bridges are for hands-off continuous ingest from prod.
— MCP server for Claude Code / Codex CLI: register @runtime-judgement/mcp-server and your coding agent gets three tools: rj.verify_change, rj.attribute_trace, and rj.suggest_snapshot. Inner-loop verification without leaving the agent.
— Snapshot suggestion engine: after a verified attribution, RJ surfaces “save this as a snapshot” with the cited-cause baseline pre-filled so the same regression can't silently ship next sprint.
— Repairs dashboard: every proposed fix — prose or prompt-patch — accumulates on /app/repairs with status filters, confidence, and a link back to the attribution it came from.
— Attribution feedback: thumbs / R3 signal on every cited cause feeds the judge-calibration loop — first signal of “did we get the right span” beyond accuracy on the research bench.
— Sub-2s p50: attribution stream now hits sub-2s p50 on a 60-span trace. Static imports replaced per-request dynamic loads in the orchestrator; the SSE connection flushes a connected event before auth runs, killing perceived TTFB on cold start.

What we deliberately don't do (yet)

— PR-verifier GitHub App: one-click install on any repo, runs the snapshot suite the moment a PR opens, and posts the cited cause + verify-fix verdict back to the diff. Coming soon.
— Cursor agent-mode extension: same three tools as the MCP server (rj.verify_change, rj.attribute_trace, rj.suggest_snapshot), wired to Cursor's composer. Coming soon.
— On-prem / Docker-compose deploy: dockerized attributor + Postgres + worker for regulated industries. Banked against the first design-partner ask; on the roadmap.
— Devin webhook integration: Devin posts the final trace; RJ runs attribution + verify-fix and replies on the same Devin session. Coming soon.
— Phoenix tracer bridge: Langfuse + LangSmith pull bridges ship today; Phoenix is on the roadmap.
— Time-travel replay + simulation env: rewind any span and replay forward with different perturbations; agent-vs-agent simulation runs for stress tests. On the roadmap.
— Public CLI / API keys: the MCP server and GitHub Action use a Bearer token via repo secret today. First-class API-key issuance for third-party tooling is on the roadmap.
— Team accounts / SSO: single-user workspaces today.
— Billing: Usage metering is coming soon (no migration when billing turns on). Talk to us for paid until then.

Try it with a sample trace