Use RJ from Claude Code via MCP
Register the RJ MCP server in Claude Code or Codex CLI and get span-level failure attribution without leaving your coding session.
Use RJ from Claude Code via MCP
The normal RJ loop takes you out of your editor: paste a trace into the UI, read the attribution, write a fix, come back to CI to see if it passed. The MCP server collapses that loop to three tool calls without leaving your coding session.
Register @runtime-judgement/rj-mcp-server and Claude Code (or Codex CLI) gains three tools:
rj.attribute_trace— ingest a failing trace and get cited span + L1/L4 verdict in the same responserj.suggest_snapshot— lock the verdict as a regression baselinerj.verify_change— run a snapshot suite against the current pipeline; block commit if it regresses
This guide covers setup, the two most useful workflows, and the pitfalls that trip people up first.
Read Agentic coding and failures if you want context on why inner-loop verification matters before you invest in setting this up.
Step 1: Install the server
The server ships as an npm binary. The fastest path is npx — no global install required:
npx @runtime-judgement/mcp-server
If you want it permanently on PATH:
pnpm add -g @runtime-judgement/mcp-server
What you should see: The server starts and prints a connected line to stderr. It uses stdio transport, so when invoked by an agent it pipes JSON-RPC over stdin/stdout — you won't see this directly during normal use.
Step 2: Get your API credentials
The server authenticates every tool call against RJ using a Bearer token.
Go to /app/settings and generate a token. Tokens are prefixed rj_live_.
Tokens appear under API Keys on the settings page. They are shown once — copy the value before closing the dialog.
You also need your RJ deployment URL:
RJ_API_URL=https://runtime-judgement-app.vercel.app
RJ_API_KEY=rj_live_...
Step 3: Register with your coding agent
Claude Code
Edit ~/.config/claude-code/mcp_servers.json (create it if it doesn't exist):
{
"mcpServers": {
"runtime-judgement": {
"command": "npx",
"args": ["@runtime-judgement/mcp-server"],
"env": {
"RJ_API_URL": "https://runtime-judgement-app.vercel.app",
"RJ_API_KEY": "rj_live_..."
}
}
}
}
Restart Claude Code after saving. The server spawns as a child process on the first tool call.
Codex CLI
In ~/.config/codex/mcp_servers.toml:
[mcp_servers.runtime-judgement]
command = "npx"
args = ["@runtime-judgement/mcp-server"]
env = { RJ_API_URL = "https://runtime-judgement-app.vercel.app", RJ_API_KEY = "rj_live_..." }
Aider
In .aider.conf.yml (project root or home directory):
mcp_servers:
- name: runtime-judgement
command: ["npx", "@runtime-judgement/mcp-server"]
env:
RJ_API_URL: https://runtime-judgement-app.vercel.app
RJ_API_KEY: rj_live_...
What you should see: After registering, ask your agent: "What MCP tools do you have available?" It should list rj.verify_change, rj.attribute_trace, and rj.suggest_snapshot.
The three tools
rj.attribute_trace
Takes a raw trace JSON and returns a cited cause, L1/L4 verdict, and suggested fix — the same pipeline the web UI runs. Use this when a test fails and you have the trace but don't know which span to look at.
Required arguments:
trace— raw trace JSON (OTEL gen-ai, LangSmith export, or RJ custom JSON)errorSpanId— the span where the user-visible failure surfaced
Optional arguments:
errorDescription— a sentence describing what went wrong (improves judge precision on ambiguous traces)errorEvidence— a verbatim quote from the failure outputpipeline— preset override:"q72-k1"(Standard, default),"q3-k1"(Fast),"q3-k3"(Consistent)
Returns:
{
attributionId: string // use this in rj.suggest_snapshot
l1: { axis: string; confidence: number }
l4: { category: string; confidence: number }
citedSpans: string[] // span IDs that caused the failure
explanation: string // prose causal summary
suggestedFix: string // proposed fix from repair-v0
}
rj.suggest_snapshot
Locks an attribution verdict as a regression baseline. Call this immediately after rj.attribute_trace on any failure you want to guard against going forward.
Required arguments:
attributionId— the ULID fromrj.attribute_tracename— descriptive snapshot name (unique per account)
Optional arguments:
description— longer description for the suite UIsuiteName— adds to this suite (lazy-creates it); defaults to "Unfiled"
Returns:
{
snapshotId: string
suiteId?: string
nextStep: string // hint pointing the agent toward rj.verify_change
}
Common errors: hint: "name_conflict" means the name is already taken — pick a more specific one. hint: "attribution_not_found" means the attributionId is wrong or belongs to a different user.
rj.verify_change
Runs a snapshot suite against the current pipeline and returns a verdict. Call this before every commit on a patch that touches traced code.
Required arguments:
suiteId— snapshot suite ULID (visible in the URL at /app/snapshot-suites/[id])
Optional arguments:
tags— run only snapshots with these tagsperturbation— forward-compat hint about what the agent changed (not used in v0.1 routing, reserved for v0.2)
Returns:
{
verdict: "pass" | "regression" | "drift" | "error" | "empty"
counts: { total, passed, changedIntentional, changedUnexpected, skipped, errored }
outcomes: [{ snapshotId, status, citedSpanIds?, message? }]
runIds: string[]
}
The verdict field is the decision: pass → proceed, regression → stop and inspect outcomes, drift → confirm intentionality with the user.
Use case 1: Attribute a failing trace from a Langfuse export
You're debugging a multi-agent pipeline. Tests pass, but a prod trace from Langfuse shows the orchestrator calling the wrong tool. You have the trace JSON.
Ask Claude Code:
I have a failing trace from Langfuse. The orchestrator picked
"web-search" when it should have picked "internal-kb". The
failure surfaced at span "orchestrate-tool-selection".
Here's the trace:
<paste JSON>
Call rj.attribute_trace with errorSpanId="orchestrate-tool-selection"
and errorDescription="orchestrator selected web-search instead of internal-kb".
What you should see: Claude calls rj.attribute_trace, returns something like:
Attribution complete:
- Cited span: orchestrate-tool-selection
- L1: semantic (the reasoning was wrong, not the execution)
- L4: Tool Selection Errors
- Confidence: 0.87
- Explanation: The orchestrator's context window contained stale tool descriptions.
The "internal-kb" tool's description was truncated, leaving the model with
insufficient signal to prefer it over "web-search".
- Suggested fix: Refresh tool descriptions on every invocation rather than
caching them across sessions.
Then immediately lock it in:
Save that attribution as a snapshot named
"orchestrator picks web-search over internal-kb — stale tool descriptions"
and add it to my "prod-incidents" suite.
Claude calls rj.suggest_snapshot. The snapshot is now in /app/snapshot-suites and will be replayed on every future rj.verify_change call against that suite.
Use case 2: Verify a fix against a production-incident suite
You've patched the tool-description caching logic. Before you commit, verify the fix doesn't break the locked-in behaviour:
Before I commit this patch, run rj.verify_change against
suite 01HZSUITE... and tell me the verdict.
What you should see: Claude calls rj.verify_change and returns:
Suite run complete (suite: prod-incidents):
- Verdict: pass
- Total: 6 snapshots
- Passed: 6
- Changed-intentional: 0
- Changed-unexpected: 0
Run IDs: 01HZRUN1, 01HZRUN2, ...
If the verdict is regression:
Suite run complete:
- Verdict: regression
- Total: 6
- Passed: 5
- Changed-unexpected: 1
Regressed snapshot: "memory-retrieval-returns-wrong-session"
The cited span shifted from "memory-lookup" to "session-init".
The fix appears to have moved the failure upstream rather than resolved it.
Claude returns the specific outcomes entry that regressed, including the new cited span. You can deep-link directly to https://runtime-judgement-app.vercel.app/app/snapshot-suites using the run IDs to inspect the full result.
Fix the regression, run again. The loop terminates when verdict=pass.
The full inner loop
┌──────────────────────────────────────┐
│ agent observes failing test / trace │
└──────────────────────────────────────┘
│
▼
rj.attribute_trace
│
┌────────────▼────────────┐
│ cited span + L1/L4 + │
│ suggested fix │
└────────────┬────────────┘
│
rj.suggest_snapshot
│
┌────────────▼────────────┐
│ snapshot locked in │
│ suite: prod-incidents │
└────────────┬────────────┘
│
agent writes patch based on suggestedFix
│
▼
rj.verify_change
│
┌────────────────┼─────────────────┐
pass drift regression
│ │ │
commit confirm fix + re-loop
w/ user
This mirrors what a human does on the web UI — compressed into three tool calls with no context switching.
Pitfalls
API token in the MCP config
The env block in mcp_servers.json is stored in plaintext on disk. Don't commit that file to version control. Add it to .gitignore. For team environments, use a secrets manager or environment variable injection at agent launch time rather than hardcoding in the config file.
If Claude Code is running in a sandboxed or remote execution environment, set RJ_API_KEY and RJ_API_URL as environment secrets in the environment's configuration rather than in the config file — the MCP server inherits whatever env it's launched in.
Network egress from Claude Code's sandbox
Claude Code running on code.claude.com operates in a managed remote environment with outbound network access governed by the environment's network policy. The MCP server needs to reach runtime-judgement-app.vercel.app (or your RJ_API_URL). If the tool calls time out silently, check that the environment's egress policy allows outbound HTTPS to the RJ host.
Suite ID not found
If rj.verify_change returns verdict=empty, the suite ULID is wrong or the suite has no snapshots. Get the ULID from the URL at /app/snapshot-suites — it looks like 01HZABC.... Suites with zero snapshots also return empty; add at least one snapshot via rj.suggest_snapshot first.
Large traces
rj.attribute_trace accepts up to 4 MB of inline JSON. For larger traces, use the POST /api/traces endpoint directly with a Blob reference (up to 200 MB). The MCP server's inline path covers the vast majority of real traces.
Concurrent calls
The MCP server is stateless — each tool call is a fully independent HTTP round-trip to the RJ API. Multiple agents or agent threads can call it concurrently without coordination.
What next
- Gate the same suite in CI: Wire RJ into GitHub Actions — the
suiteIdyou pass torj.verify_changeis the same one the GitHub Action runs - Build your first snapshot suite: Build a snapshot suite for regression gating — the detailed guide on paste-to-suite, including suite architecture recommendations
- Understand the architecture behind the failures your agents produce: Agent architectures and where they fail
- Where the frontier is moving: Where the frontier lies