Try it
cookbookintermediate9 min readUpdated 2026-05-20

Use RJ from Claude Code via MCP

Register the RJ MCP server in Claude Code or Codex CLI and get span-level failure attribution without leaving your coding session.

Use RJ from Claude Code via MCP

The normal RJ loop takes you out of your editor: paste a trace into the UI, read the attribution, write a fix, come back to CI to see if it passed. The MCP server collapses that loop to three tool calls without leaving your coding session.

Register @runtime-judgement/rj-mcp-server and Claude Code (or Codex CLI) gains three tools:

  • rj.attribute_trace — ingest a failing trace and get cited span + L1/L4 verdict in the same response
  • rj.suggest_snapshot — lock the verdict as a regression baseline
  • rj.verify_change — run a snapshot suite against the current pipeline; block commit if it regresses

This guide covers setup, the two most useful workflows, and the pitfalls that trip people up first.

Read Agentic coding and failures if you want context on why inner-loop verification matters before you invest in setting this up.


Step 1: Install the server

The server ships as an npm binary. The fastest path is npx — no global install required:

npx @runtime-judgement/mcp-server

If you want it permanently on PATH:

pnpm add -g @runtime-judgement/mcp-server

What you should see: The server starts and prints a connected line to stderr. It uses stdio transport, so when invoked by an agent it pipes JSON-RPC over stdin/stdout — you won't see this directly during normal use.


Step 2: Get your API credentials

The server authenticates every tool call against RJ using a Bearer token.

Go to /app/settings and generate a token. Tokens are prefixed rj_live_.

Tokens appear under API Keys on the settings page. They are shown once — copy the value before closing the dialog.

You also need your RJ deployment URL:

RJ_API_URL=https://runtime-judgement-app.vercel.app
RJ_API_KEY=rj_live_...

Step 3: Register with your coding agent

Claude Code

Edit ~/.config/claude-code/mcp_servers.json (create it if it doesn't exist):

{
  "mcpServers": {
    "runtime-judgement": {
      "command": "npx",
      "args": ["@runtime-judgement/mcp-server"],
      "env": {
        "RJ_API_URL": "https://runtime-judgement-app.vercel.app",
        "RJ_API_KEY": "rj_live_..."
      }
    }
  }
}

Restart Claude Code after saving. The server spawns as a child process on the first tool call.

Codex CLI

In ~/.config/codex/mcp_servers.toml:

[mcp_servers.runtime-judgement]
command = "npx"
args = ["@runtime-judgement/mcp-server"]
env = { RJ_API_URL = "https://runtime-judgement-app.vercel.app", RJ_API_KEY = "rj_live_..." }

Aider

In .aider.conf.yml (project root or home directory):

mcp_servers:
  - name: runtime-judgement
    command: ["npx", "@runtime-judgement/mcp-server"]
    env:
      RJ_API_URL: https://runtime-judgement-app.vercel.app
      RJ_API_KEY: rj_live_...

What you should see: After registering, ask your agent: "What MCP tools do you have available?" It should list rj.verify_change, rj.attribute_trace, and rj.suggest_snapshot.


The three tools

rj.attribute_trace

Takes a raw trace JSON and returns a cited cause, L1/L4 verdict, and suggested fix — the same pipeline the web UI runs. Use this when a test fails and you have the trace but don't know which span to look at.

Required arguments:

  • trace — raw trace JSON (OTEL gen-ai, LangSmith export, or RJ custom JSON)
  • errorSpanId — the span where the user-visible failure surfaced

Optional arguments:

  • errorDescription — a sentence describing what went wrong (improves judge precision on ambiguous traces)
  • errorEvidence — a verbatim quote from the failure output
  • pipeline — preset override: "q72-k1" (Standard, default), "q3-k1" (Fast), "q3-k3" (Consistent)

Returns:

{
  attributionId: string        // use this in rj.suggest_snapshot
  l1: { axis: string; confidence: number }
  l4: { category: string; confidence: number }
  citedSpans: string[]         // span IDs that caused the failure
  explanation: string          // prose causal summary
  suggestedFix: string         // proposed fix from repair-v0
}

rj.suggest_snapshot

Locks an attribution verdict as a regression baseline. Call this immediately after rj.attribute_trace on any failure you want to guard against going forward.

Required arguments:

  • attributionId — the ULID from rj.attribute_trace
  • name — descriptive snapshot name (unique per account)

Optional arguments:

  • description — longer description for the suite UI
  • suiteName — adds to this suite (lazy-creates it); defaults to "Unfiled"

Returns:

{
  snapshotId: string
  suiteId?: string
  nextStep: string  // hint pointing the agent toward rj.verify_change
}

Common errors: hint: "name_conflict" means the name is already taken — pick a more specific one. hint: "attribution_not_found" means the attributionId is wrong or belongs to a different user.

rj.verify_change

Runs a snapshot suite against the current pipeline and returns a verdict. Call this before every commit on a patch that touches traced code.

Required arguments:

Optional arguments:

  • tags — run only snapshots with these tags
  • perturbation — forward-compat hint about what the agent changed (not used in v0.1 routing, reserved for v0.2)

Returns:

{
  verdict: "pass" | "regression" | "drift" | "error" | "empty"
  counts: { total, passed, changedIntentional, changedUnexpected, skipped, errored }
  outcomes: [{ snapshotId, status, citedSpanIds?, message? }]
  runIds: string[]
}

The verdict field is the decision: pass → proceed, regression → stop and inspect outcomes, drift → confirm intentionality with the user.


Use case 1: Attribute a failing trace from a Langfuse export

You're debugging a multi-agent pipeline. Tests pass, but a prod trace from Langfuse shows the orchestrator calling the wrong tool. You have the trace JSON.

Ask Claude Code:

I have a failing trace from Langfuse. The orchestrator picked
"web-search" when it should have picked "internal-kb". The
failure surfaced at span "orchestrate-tool-selection".

Here's the trace:
<paste JSON>

Call rj.attribute_trace with errorSpanId="orchestrate-tool-selection"
and errorDescription="orchestrator selected web-search instead of internal-kb".

What you should see: Claude calls rj.attribute_trace, returns something like:

Attribution complete:
- Cited span: orchestrate-tool-selection
- L1: semantic (the reasoning was wrong, not the execution)
- L4: Tool Selection Errors
- Confidence: 0.87
- Explanation: The orchestrator's context window contained stale tool descriptions.
  The "internal-kb" tool's description was truncated, leaving the model with
  insufficient signal to prefer it over "web-search".
- Suggested fix: Refresh tool descriptions on every invocation rather than
  caching them across sessions.

Then immediately lock it in:

Save that attribution as a snapshot named
"orchestrator picks web-search over internal-kb — stale tool descriptions"
and add it to my "prod-incidents" suite.

Claude calls rj.suggest_snapshot. The snapshot is now in /app/snapshot-suites and will be replayed on every future rj.verify_change call against that suite.


Use case 2: Verify a fix against a production-incident suite

You've patched the tool-description caching logic. Before you commit, verify the fix doesn't break the locked-in behaviour:

Before I commit this patch, run rj.verify_change against
suite 01HZSUITE... and tell me the verdict.

What you should see: Claude calls rj.verify_change and returns:

Suite run complete (suite: prod-incidents):
- Verdict: pass
- Total: 6 snapshots
- Passed: 6
- Changed-intentional: 0
- Changed-unexpected: 0
Run IDs: 01HZRUN1, 01HZRUN2, ...

If the verdict is regression:

Suite run complete:
- Verdict: regression
- Total: 6
- Passed: 5
- Changed-unexpected: 1

Regressed snapshot: "memory-retrieval-returns-wrong-session"
The cited span shifted from "memory-lookup" to "session-init".
The fix appears to have moved the failure upstream rather than resolved it.

Claude returns the specific outcomes entry that regressed, including the new cited span. You can deep-link directly to https://runtime-judgement-app.vercel.app/app/snapshot-suites using the run IDs to inspect the full result.

Fix the regression, run again. The loop terminates when verdict=pass.


The full inner loop

                 ┌──────────────────────────────────────┐
                 │ agent observes failing test / trace  │
                 └──────────────────────────────────────┘
                                   │
                                   ▼
                         rj.attribute_trace
                                   │
                      ┌────────────▼────────────┐
                      │ cited span + L1/L4 +     │
                      │ suggested fix            │
                      └────────────┬────────────┘
                                   │
                         rj.suggest_snapshot
                                   │
                      ┌────────────▼────────────┐
                      │ snapshot locked in       │
                      │ suite: prod-incidents    │
                      └────────────┬────────────┘
                                   │
                   agent writes patch based on suggestedFix
                                   │
                                   ▼
                          rj.verify_change
                                   │
                  ┌────────────────┼─────────────────┐
               pass             drift            regression
                  │                │                  │
              commit          confirm          fix + re-loop
                               w/ user

This mirrors what a human does on the web UI — compressed into three tool calls with no context switching.


Pitfalls

API token in the MCP config

The env block in mcp_servers.json is stored in plaintext on disk. Don't commit that file to version control. Add it to .gitignore. For team environments, use a secrets manager or environment variable injection at agent launch time rather than hardcoding in the config file.

If Claude Code is running in a sandboxed or remote execution environment, set RJ_API_KEY and RJ_API_URL as environment secrets in the environment's configuration rather than in the config file — the MCP server inherits whatever env it's launched in.

Network egress from Claude Code's sandbox

Claude Code running on code.claude.com operates in a managed remote environment with outbound network access governed by the environment's network policy. The MCP server needs to reach runtime-judgement-app.vercel.app (or your RJ_API_URL). If the tool calls time out silently, check that the environment's egress policy allows outbound HTTPS to the RJ host.

Suite ID not found

If rj.verify_change returns verdict=empty, the suite ULID is wrong or the suite has no snapshots. Get the ULID from the URL at /app/snapshot-suites — it looks like 01HZABC.... Suites with zero snapshots also return empty; add at least one snapshot via rj.suggest_snapshot first.

Large traces

rj.attribute_trace accepts up to 4 MB of inline JSON. For larger traces, use the POST /api/traces endpoint directly with a Blob reference (up to 200 MB). The MCP server's inline path covers the vast majority of real traces.

Concurrent calls

The MCP server is stateless — each tool call is a fully independent HTTP round-trip to the RJ API. Multiple agents or agent threads can call it concurrently without coordination.


What next

Related articles

Try it