Build a snapshot suite for regression gating

The core RJ loop is six steps: paste a failing trace → get attribution → save as snapshot → group into a suite → replay on every change → gate in CI. Once you've done it once, adding the next failure takes under two minutes.

This guide walks the full loop from a raw trace to a working CI gate. By the end you'll have a suite that runs on every PR and blocks merge the moment the same failure pattern resurfaces.

Before you start: if you're not sure what "spans" and "verdicts" mean in the RJ sense, read What are traces and why do they matter. If you want context on why multi-agent systems fail the way they do, Common failure modes in multi-agent systems maps to the L4 taxonomy you'll see in the attribution UI.

The full picture

Here's what you're building:

failing trace
      │
      ▼
  /app/new  ──→  attribution  ──→  snapshot
                                       │
                                       ▼
                               /app/snapshot-suites
                                       │
                                  suite run
                                       │
                          ┌────────────┼────────────┐
                       PASSED    CHANGED-         CHANGED-
                                INTENTIONAL      UNEXPECTED
                                                     │
                                               blocks PR ←── GitHub Action

Each verdict bucket means something specific:

PASSED — the cited span and failure category match the baseline exactly.
CHANGED-INTENTIONAL — something shifted, but the judge classified it as an expected drift (e.g. a refactor that moved logic without changing behaviour). Review and accept.
CHANGED-UNEXPECTED — the failure pattern changed in a way the judge didn't expect. Stop and inspect.

Step 1: Paste the failing trace

Go to /app/new.

Paste your trace in one of three formats:

OTEL gen-ai semconv (JSON)

{
  "resourceSpans": [
    {
      "resource": { "attributes": [] },
      "scopeSpans": [
        {
          "spans": [
            {
              "traceId": "abc123...",
              "spanId": "span001",
              "name": "search-tool",
              "startTimeUnixNano": "1716105600000000000",
              "endTimeUnixNano": "1716105601000000000",
              "status": { "code": 2, "message": "tool returned empty results" },
              "attributes": [
                { "key": "tool.name", "value": { "stringValue": "search" } }
              ]
            }
          ]
        }
      ]
    }
  ]
}

LangSmith export (JSON array of runs)

[
  {
    "id": "run-uuid",
    "name": "my-chain",
    "run_type": "chain",
    "inputs": { "question": "..." },
    "outputs": {},
    "error": "ValueError: search returned no results",
    "start_time": "2026-05-15T10:00:00Z",
    "end_time": "2026-05-15T10:00:03Z",
    "parent_run_id": null
  }
]

Custom JSON (RJ's own format — /docs has the schema)

RJ auto-detects the format. You don't need to specify it.

What you should see: The trace ingests immediately and the attribution pipeline begins. An SSE stream shows progress through the four stages: Parse → Compress → Judge → Aggregate. On a Standard pipeline preset with a 60-span trace, expect sub-7 seconds from paste to verdict. Switch to Fast (top-right preset selector) for sub-2s triage when you just want the span name.

Step 2: Review the attribution

The result page at /app/attributions/[id] shows:

Cited span — the specific span the judge pinpointed as root cause
L1 axis — execution (plan was right, something broke running it) or semantic (the reasoning itself was wrong)
L4 category — one of 14 TRAIL labels, e.g. Tool-related, Context Handling Failures, Incorrect Problem Identification
Evidence quote — the verbatim excerpt from the cited span that supports the verdict
Explanation — a short prose summary of the causal chain
Suggested fix — a prose or prompt-patch suggestion from repair-v0
Confidence — how certain the judge is (used by Cascade preset to decide whether to escalate)

What you should see: A single highlighted span in the trace DAG. If the evidence quote doesn't look right, use the thumbs-down signal to flag it — that feeds the judge-calibration loop.

If the attribution looks wrong — wrong cited span, wrong category — check:

Is the trace complete? A truncated trace missing parent spans will cause the compressor to miss the causal chain.
Did you describe the failure? Adding a text description at /app/new (the "What went wrong?" field) gives the judge a prior and meaningfully improves accuracy on ambiguous traces.

Step 3: Save the attribution as a snapshot

Click Save as snapshot on the attribution result page. Give it a name that describes the failure precisely:

search-tool returns empty — planner proceeds instead of retrying

Good snapshot names are:

Specific about the cited span (search-tool, not LLM)
Specific about the failure pattern (returns empty — planner proceeds, not bad result)
Stable enough to be meaningful six months later

What you should see: A snapshot is created at /app/snapshots/[id]. The verdict from the attribution becomes the baseline — this is what future runs are compared against. The snapshot records: the trace, the cited span, the L1 axis, the L4 category, and the confidence.

You can also save via the API:

curl -X POST https://runtime-judgement-app.vercel.app/api/snapshots \
  -H "Authorization: Bearer $RJ_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{"attributionId": "01HZATTR...", "name": "search-tool returns empty — planner proceeds"}'

Step 4: Create a suite (or add to an existing one)

Suites are groups of snapshots that run together. One suite per logical component or feature works well — "payment-agent regressions", "search-planner regressions".

From the snapshot detail page: Click Add to suite and either type a new suite name or pick an existing one from the dropdown. RJ creates the suite lazily if it doesn't exist.

From the suites list: Go to /app/snapshot-suites, click New suite, name it, then add your snapshot from the suite detail page.

Via the API:

curl -X POST https://runtime-judgement-app.vercel.app/api/snapshots \
  -H "Authorization: Bearer $RJ_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "attributionId": "01HZATTR...",
    "name": "search-tool returns empty — planner proceeds",
    "suiteName": "search-planner regressions"
  }'

The suiteName field lazy-creates the suite if it doesn't exist.

What you should see: The suite appears at /app/snapshot-suites with a count of 1 (or more, if you added to an existing suite). The ULID in the URL is the suite-id you'll use in CI.

Step 5: Run the suite and read the verdict

From the suite detail page at /app/snapshot-suites/[id], click Run suite. RJ replays every snapshot in the suite against the current pipeline and returns a verdict for each one.

What you should see: A results table with one row per snapshot, each showing one of three statuses:

Status	What it means	What to do
`PASSED`	Verdict matches the baseline exactly	Nothing — this is the goal
`CHANGED-INTENTIONAL`	Verdict changed, but the judge ruled the change expected	Review and click Accept to roll the baseline forward
`CHANGED-UNEXPECTED`	Verdict changed in an unexpected way	Investigate — something regressed

A suite-level verdict aggregates the rows:

pass — all snapshots passed
drift — some changed but all intentional
regression — at least one unexpected change (this is what CI will block on)

Accepting a drift baseline: When you refactor your agent intentionally and the snapshot's verdict shifts to CHANGED-INTENTIONAL, click Accept on that run to roll the baseline forward. The next run will compare against the new baseline. This is how you keep the suite current as the agent evolves.

Via the API:

# Run the suite
curl -X POST https://runtime-judgement-app.vercel.app/api/snapshot-suites/01HZSUITE.../run \
  -H "Authorization: Bearer $RJ_API_KEY"
# Returns: { verdict, counts: { passed, changedIntentional, changedUnexpected, ... }, runIds: [...] }

# Accept a run (promote new run to baseline)
curl -X POST https://runtime-judgement-app.vercel.app/api/snapshot-runs/01HZRUN.../accept \
  -H "Authorization: Bearer $RJ_API_KEY"

Step 6: Wire the suite into CI

Get the suite ULID from the URL: https://runtime-judgement-app.vercel.app/app/snapshot-suites/01HZSUITE...

Follow Wire RJ into GitHub Actions for the full workflow setup. In brief:

- name: ProveAI Origin suite run
  uses: runtime-judgement/rj-action@v1
  with:
    suite-id: 01HZSUITE_REPLACE_WITH_YOURS
    rj-api-token: ${{ secrets.RJ_API_TOKEN }}

This runs on every PR. When the suite returns regression, the step exits non-zero and blocks merge. When it returns pass or drift, it exits zero and the PR proceeds.

Growing the suite over time

A healthy suite grows snapshot by snapshot as real failures happen in production. The workflow:

A new failure surfaces in production (from Langfuse, LangSmith, or a manual paste)
Attribute it at /app/new
Save as snapshot and add to the relevant suite
The next CI run automatically includes it

RJ's snapshot suggestion engine (visible after each successful attribution) surfaces a pre-filled "Save this as a snapshot" card with the cited-cause baseline already populated. One click and it's in your suite.

What makes a good snapshot set:

Cover different failure categories — don't have 10 snapshots of the same Tool-related pattern
Keep the trace small — the compressor handles large traces, but smaller traces run faster and are easier to debug when they regress
Name snapshots precisely — they're regression documentation, not just test identifiers
Accept baseline drift promptly — stale baselines create false regressions and erode trust in the gate

Recommended suite architecture

For a system with multiple agent components:

prod-incidents/          ← failures that actually hit users
  └── search-tool-empty-response
  └── planner-goal-deviation
  └── memory-context-overflow

component-smoke/         ← one representative trace per major component
  └── search-tool-happy-path
  └── planner-tool-selection
  └── memory-retrieval

regression-gate/         ← everything that blocked a PR once
  └── <all previous incidents that got fixed>

Run prod-incidents and regression-gate on every PR (fail-on-unexpected). Run component-smoke on every push to main (advisory mode, fail-on-unexpected: false) to catch early drift before it hits a PR.

What next

Block PRs with the suite: Wire RJ into GitHub Actions
Pull traces from LangSmith automatically: Bridge LangSmith traces into RJ — run the ingest cycle hourly so new prod failures become snapshots without manual paste
Run the suite from inside your coding agent: Use RJ from Claude Code via MCP — verify your fix before you commit
Understand the failure taxonomy your snapshots use: Core principles of failure attribution

Build a snapshot suite for regression gating

Build a snapshot suite for regression gating

The full picture

Step 1: Paste the failing trace

Step 2: Review the attribution

Step 3: Save the attribution as a snapshot

Step 4: Create a suite (or add to an existing one)

Step 5: Run the suite and read the verdict

Step 6: Wire the suite into CI

Growing the suite over time

Recommended suite architecture

What next

Related articles

Try it