Try it
cookbookbeginner10 min readUpdated 2026-05-20

Build a snapshot suite for regression gating

Turn a single failing trace into a locked regression baseline, group it into a suite, and replay it on every change — the complete development loop.

Build a snapshot suite for regression gating

The core RJ loop is six steps: paste a failing trace → get attribution → save as snapshot → group into a suite → replay on every change → gate in CI. Once you've done it once, adding the next failure takes under two minutes.

This guide walks the full loop from a raw trace to a working CI gate. By the end you'll have a suite that runs on every PR and blocks merge the moment the same failure pattern resurfaces.

Before you start: if you're not sure what "spans" and "verdicts" mean in the RJ sense, read What are traces and why do they matter. If you want context on why multi-agent systems fail the way they do, Common failure modes in multi-agent systems maps to the L4 taxonomy you'll see in the attribution UI.


The full picture

Here's what you're building:

failing trace
      │
      ▼
  /app/new  ──→  attribution  ──→  snapshot
                                       │
                                       ▼
                               /app/snapshot-suites
                                       │
                                  suite run
                                       │
                          ┌────────────┼────────────┐
                       PASSED    CHANGED-         CHANGED-
                                INTENTIONAL      UNEXPECTED
                                                     │
                                               blocks PR ←── GitHub Action

Each verdict bucket means something specific:

  • PASSED — the cited span and failure category match the baseline exactly.
  • CHANGED-INTENTIONAL — something shifted, but the judge classified it as an expected drift (e.g. a refactor that moved logic without changing behaviour). Review and accept.
  • CHANGED-UNEXPECTED — the failure pattern changed in a way the judge didn't expect. Stop and inspect.

Step 1: Paste the failing trace

Go to /app/new.

Paste your trace in one of three formats:

OTEL gen-ai semconv (JSON)

{
  "resourceSpans": [
    {
      "resource": { "attributes": [] },
      "scopeSpans": [
        {
          "spans": [
            {
              "traceId": "abc123...",
              "spanId": "span001",
              "name": "search-tool",
              "startTimeUnixNano": "1716105600000000000",
              "endTimeUnixNano": "1716105601000000000",
              "status": { "code": 2, "message": "tool returned empty results" },
              "attributes": [
                { "key": "tool.name", "value": { "stringValue": "search" } }
              ]
            }
          ]
        }
      ]
    }
  ]
}

LangSmith export (JSON array of runs)

[
  {
    "id": "run-uuid",
    "name": "my-chain",
    "run_type": "chain",
    "inputs": { "question": "..." },
    "outputs": {},
    "error": "ValueError: search returned no results",
    "start_time": "2026-05-15T10:00:00Z",
    "end_time": "2026-05-15T10:00:03Z",
    "parent_run_id": null
  }
]

Custom JSON (RJ's own format — /docs has the schema)

RJ auto-detects the format. You don't need to specify it.

What you should see: The trace ingests immediately and the attribution pipeline begins. An SSE stream shows progress through the four stages: Parse → Compress → Judge → Aggregate. On a Standard pipeline preset with a 60-span trace, expect sub-7 seconds from paste to verdict. Switch to Fast (top-right preset selector) for sub-2s triage when you just want the span name.


Step 2: Review the attribution

The result page at /app/attributions/[id] shows:

  • Cited span — the specific span the judge pinpointed as root cause
  • L1 axisexecution (plan was right, something broke running it) or semantic (the reasoning itself was wrong)
  • L4 category — one of 14 TRAIL labels, e.g. Tool-related, Context Handling Failures, Incorrect Problem Identification
  • Evidence quote — the verbatim excerpt from the cited span that supports the verdict
  • Explanation — a short prose summary of the causal chain
  • Suggested fix — a prose or prompt-patch suggestion from repair-v0
  • Confidence — how certain the judge is (used by Cascade preset to decide whether to escalate)

What you should see: A single highlighted span in the trace DAG. If the evidence quote doesn't look right, use the thumbs-down signal to flag it — that feeds the judge-calibration loop.

If the attribution looks wrong — wrong cited span, wrong category — check:

  1. Is the trace complete? A truncated trace missing parent spans will cause the compressor to miss the causal chain.
  2. Did you describe the failure? Adding a text description at /app/new (the "What went wrong?" field) gives the judge a prior and meaningfully improves accuracy on ambiguous traces.

Step 3: Save the attribution as a snapshot

Click Save as snapshot on the attribution result page. Give it a name that describes the failure precisely:

search-tool returns empty — planner proceeds instead of retrying

Good snapshot names are:

  • Specific about the cited span (search-tool, not LLM)
  • Specific about the failure pattern (returns empty — planner proceeds, not bad result)
  • Stable enough to be meaningful six months later

What you should see: A snapshot is created at /app/snapshots/[id]. The verdict from the attribution becomes the baseline — this is what future runs are compared against. The snapshot records: the trace, the cited span, the L1 axis, the L4 category, and the confidence.

You can also save via the API:

curl -X POST https://runtime-judgement-app.vercel.app/api/snapshots \
  -H "Authorization: Bearer $RJ_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{"attributionId": "01HZATTR...", "name": "search-tool returns empty — planner proceeds"}'

Step 4: Create a suite (or add to an existing one)

Suites are groups of snapshots that run together. One suite per logical component or feature works well — "payment-agent regressions", "search-planner regressions".

From the snapshot detail page: Click Add to suite and either type a new suite name or pick an existing one from the dropdown. RJ creates the suite lazily if it doesn't exist.

From the suites list: Go to /app/snapshot-suites, click New suite, name it, then add your snapshot from the suite detail page.

Via the API:

curl -X POST https://runtime-judgement-app.vercel.app/api/snapshots \
  -H "Authorization: Bearer $RJ_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "attributionId": "01HZATTR...",
    "name": "search-tool returns empty — planner proceeds",
    "suiteName": "search-planner regressions"
  }'

The suiteName field lazy-creates the suite if it doesn't exist.

What you should see: The suite appears at /app/snapshot-suites with a count of 1 (or more, if you added to an existing suite). The ULID in the URL is the suite-id you'll use in CI.


Step 5: Run the suite and read the verdict

From the suite detail page at /app/snapshot-suites/[id], click Run suite. RJ replays every snapshot in the suite against the current pipeline and returns a verdict for each one.

What you should see: A results table with one row per snapshot, each showing one of three statuses:

StatusWhat it meansWhat to do
PASSEDVerdict matches the baseline exactlyNothing — this is the goal
CHANGED-INTENTIONALVerdict changed, but the judge ruled the change expectedReview and click Accept to roll the baseline forward
CHANGED-UNEXPECTEDVerdict changed in an unexpected wayInvestigate — something regressed

A suite-level verdict aggregates the rows:

  • pass — all snapshots passed
  • drift — some changed but all intentional
  • regression — at least one unexpected change (this is what CI will block on)

Accepting a drift baseline: When you refactor your agent intentionally and the snapshot's verdict shifts to CHANGED-INTENTIONAL, click Accept on that run to roll the baseline forward. The next run will compare against the new baseline. This is how you keep the suite current as the agent evolves.

Via the API:

# Run the suite
curl -X POST https://runtime-judgement-app.vercel.app/api/snapshot-suites/01HZSUITE.../run \
  -H "Authorization: Bearer $RJ_API_KEY"
# Returns: { verdict, counts: { passed, changedIntentional, changedUnexpected, ... }, runIds: [...] }

# Accept a run (promote new run to baseline)
curl -X POST https://runtime-judgement-app.vercel.app/api/snapshot-runs/01HZRUN.../accept \
  -H "Authorization: Bearer $RJ_API_KEY"

Step 6: Wire the suite into CI

Get the suite ULID from the URL: https://runtime-judgement-app.vercel.app/app/snapshot-suites/01HZSUITE...

Follow Wire RJ into GitHub Actions for the full workflow setup. In brief:

- name: ProveAI Origin suite run
  uses: runtime-judgement/rj-action@v1
  with:
    suite-id: 01HZSUITE_REPLACE_WITH_YOURS
    rj-api-token: ${{ secrets.RJ_API_TOKEN }}

This runs on every PR. When the suite returns regression, the step exits non-zero and blocks merge. When it returns pass or drift, it exits zero and the PR proceeds.


Growing the suite over time

A healthy suite grows snapshot by snapshot as real failures happen in production. The workflow:

  1. A new failure surfaces in production (from Langfuse, LangSmith, or a manual paste)
  2. Attribute it at /app/new
  3. Save as snapshot and add to the relevant suite
  4. The next CI run automatically includes it

RJ's snapshot suggestion engine (visible after each successful attribution) surfaces a pre-filled "Save this as a snapshot" card with the cited-cause baseline already populated. One click and it's in your suite.

What makes a good snapshot set:

  • Cover different failure categories — don't have 10 snapshots of the same Tool-related pattern
  • Keep the trace small — the compressor handles large traces, but smaller traces run faster and are easier to debug when they regress
  • Name snapshots precisely — they're regression documentation, not just test identifiers
  • Accept baseline drift promptly — stale baselines create false regressions and erode trust in the gate

Recommended suite architecture

For a system with multiple agent components:

prod-incidents/          ← failures that actually hit users
  └── search-tool-empty-response
  └── planner-goal-deviation
  └── memory-context-overflow

component-smoke/         ← one representative trace per major component
  └── search-tool-happy-path
  └── planner-tool-selection
  └── memory-retrieval

regression-gate/         ← everything that blocked a PR once
  └── <all previous incidents that got fixed>

Run prod-incidents and regression-gate on every PR (fail-on-unexpected). Run component-smoke on every push to main (advisory mode, fail-on-unexpected: false) to catch early drift before it hits a PR.


What next

Related articles

Try it