Build a snapshot suite for regression gating
Turn a single failing trace into a locked regression baseline, group it into a suite, and replay it on every change — the complete development loop.
Build a snapshot suite for regression gating
The core RJ loop is six steps: paste a failing trace → get attribution → save as snapshot → group into a suite → replay on every change → gate in CI. Once you've done it once, adding the next failure takes under two minutes.
This guide walks the full loop from a raw trace to a working CI gate. By the end you'll have a suite that runs on every PR and blocks merge the moment the same failure pattern resurfaces.
Before you start: if you're not sure what "spans" and "verdicts" mean in the RJ sense, read What are traces and why do they matter. If you want context on why multi-agent systems fail the way they do, Common failure modes in multi-agent systems maps to the L4 taxonomy you'll see in the attribution UI.
The full picture
Here's what you're building:
failing trace
│
▼
/app/new ──→ attribution ──→ snapshot
│
▼
/app/snapshot-suites
│
suite run
│
┌────────────┼────────────┐
PASSED CHANGED- CHANGED-
INTENTIONAL UNEXPECTED
│
blocks PR ←── GitHub Action
Each verdict bucket means something specific:
- PASSED — the cited span and failure category match the baseline exactly.
- CHANGED-INTENTIONAL — something shifted, but the judge classified it as an expected drift (e.g. a refactor that moved logic without changing behaviour). Review and accept.
- CHANGED-UNEXPECTED — the failure pattern changed in a way the judge didn't expect. Stop and inspect.
Step 1: Paste the failing trace
Go to /app/new.
Paste your trace in one of three formats:
OTEL gen-ai semconv (JSON)
{
"resourceSpans": [
{
"resource": { "attributes": [] },
"scopeSpans": [
{
"spans": [
{
"traceId": "abc123...",
"spanId": "span001",
"name": "search-tool",
"startTimeUnixNano": "1716105600000000000",
"endTimeUnixNano": "1716105601000000000",
"status": { "code": 2, "message": "tool returned empty results" },
"attributes": [
{ "key": "tool.name", "value": { "stringValue": "search" } }
]
}
]
}
]
}
]
}
LangSmith export (JSON array of runs)
[
{
"id": "run-uuid",
"name": "my-chain",
"run_type": "chain",
"inputs": { "question": "..." },
"outputs": {},
"error": "ValueError: search returned no results",
"start_time": "2026-05-15T10:00:00Z",
"end_time": "2026-05-15T10:00:03Z",
"parent_run_id": null
}
]
Custom JSON (RJ's own format — /docs has the schema)
RJ auto-detects the format. You don't need to specify it.
What you should see: The trace ingests immediately and the attribution pipeline begins. An SSE stream shows progress through the four stages: Parse → Compress → Judge → Aggregate. On a Standard pipeline preset with a 60-span trace, expect sub-7 seconds from paste to verdict. Switch to Fast (top-right preset selector) for sub-2s triage when you just want the span name.
Step 2: Review the attribution
The result page at /app/attributions/[id] shows:
- Cited span — the specific span the judge pinpointed as root cause
- L1 axis —
execution(plan was right, something broke running it) orsemantic(the reasoning itself was wrong) - L4 category — one of 14 TRAIL labels, e.g.
Tool-related,Context Handling Failures,Incorrect Problem Identification - Evidence quote — the verbatim excerpt from the cited span that supports the verdict
- Explanation — a short prose summary of the causal chain
- Suggested fix — a prose or prompt-patch suggestion from
repair-v0 - Confidence — how certain the judge is (used by Cascade preset to decide whether to escalate)
What you should see: A single highlighted span in the trace DAG. If the evidence quote doesn't look right, use the thumbs-down signal to flag it — that feeds the judge-calibration loop.
If the attribution looks wrong — wrong cited span, wrong category — check:
- Is the trace complete? A truncated trace missing parent spans will cause the compressor to miss the causal chain.
- Did you describe the failure? Adding a text description at /app/new (the "What went wrong?" field) gives the judge a prior and meaningfully improves accuracy on ambiguous traces.
Step 3: Save the attribution as a snapshot
Click Save as snapshot on the attribution result page. Give it a name that describes the failure precisely:
search-tool returns empty — planner proceeds instead of retrying
Good snapshot names are:
- Specific about the cited span (
search-tool, notLLM) - Specific about the failure pattern (
returns empty — planner proceeds, notbad result) - Stable enough to be meaningful six months later
What you should see: A snapshot is created at /app/snapshots/[id]. The verdict from the attribution becomes the baseline — this is what future runs are compared against. The snapshot records: the trace, the cited span, the L1 axis, the L4 category, and the confidence.
You can also save via the API:
curl -X POST https://runtime-judgement-app.vercel.app/api/snapshots \
-H "Authorization: Bearer $RJ_API_KEY" \
-H "Content-Type: application/json" \
-d '{"attributionId": "01HZATTR...", "name": "search-tool returns empty — planner proceeds"}'
Step 4: Create a suite (or add to an existing one)
Suites are groups of snapshots that run together. One suite per logical component or feature works well — "payment-agent regressions", "search-planner regressions".
From the snapshot detail page: Click Add to suite and either type a new suite name or pick an existing one from the dropdown. RJ creates the suite lazily if it doesn't exist.
From the suites list: Go to /app/snapshot-suites, click New suite, name it, then add your snapshot from the suite detail page.
Via the API:
curl -X POST https://runtime-judgement-app.vercel.app/api/snapshots \
-H "Authorization: Bearer $RJ_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"attributionId": "01HZATTR...",
"name": "search-tool returns empty — planner proceeds",
"suiteName": "search-planner regressions"
}'
The suiteName field lazy-creates the suite if it doesn't exist.
What you should see: The suite appears at /app/snapshot-suites with a count of 1 (or more, if you added to an existing suite). The ULID in the URL is the suite-id you'll use in CI.
Step 5: Run the suite and read the verdict
From the suite detail page at /app/snapshot-suites/[id], click Run suite. RJ replays every snapshot in the suite against the current pipeline and returns a verdict for each one.
What you should see: A results table with one row per snapshot, each showing one of three statuses:
| Status | What it means | What to do |
|---|---|---|
PASSED | Verdict matches the baseline exactly | Nothing — this is the goal |
CHANGED-INTENTIONAL | Verdict changed, but the judge ruled the change expected | Review and click Accept to roll the baseline forward |
CHANGED-UNEXPECTED | Verdict changed in an unexpected way | Investigate — something regressed |
A suite-level verdict aggregates the rows:
pass— all snapshots passeddrift— some changed but all intentionalregression— at least one unexpected change (this is what CI will block on)
Accepting a drift baseline: When you refactor your agent intentionally and the snapshot's verdict shifts to CHANGED-INTENTIONAL, click Accept on that run to roll the baseline forward. The next run will compare against the new baseline. This is how you keep the suite current as the agent evolves.
Via the API:
# Run the suite
curl -X POST https://runtime-judgement-app.vercel.app/api/snapshot-suites/01HZSUITE.../run \
-H "Authorization: Bearer $RJ_API_KEY"
# Returns: { verdict, counts: { passed, changedIntentional, changedUnexpected, ... }, runIds: [...] }
# Accept a run (promote new run to baseline)
curl -X POST https://runtime-judgement-app.vercel.app/api/snapshot-runs/01HZRUN.../accept \
-H "Authorization: Bearer $RJ_API_KEY"
Step 6: Wire the suite into CI
Get the suite ULID from the URL: https://runtime-judgement-app.vercel.app/app/snapshot-suites/01HZSUITE...
Follow Wire RJ into GitHub Actions for the full workflow setup. In brief:
- name: ProveAI Origin suite run
uses: runtime-judgement/rj-action@v1
with:
suite-id: 01HZSUITE_REPLACE_WITH_YOURS
rj-api-token: ${{ secrets.RJ_API_TOKEN }}
This runs on every PR. When the suite returns regression, the step exits non-zero and blocks merge. When it returns pass or drift, it exits zero and the PR proceeds.
Growing the suite over time
A healthy suite grows snapshot by snapshot as real failures happen in production. The workflow:
- A new failure surfaces in production (from Langfuse, LangSmith, or a manual paste)
- Attribute it at /app/new
- Save as snapshot and add to the relevant suite
- The next CI run automatically includes it
RJ's snapshot suggestion engine (visible after each successful attribution) surfaces a pre-filled "Save this as a snapshot" card with the cited-cause baseline already populated. One click and it's in your suite.
What makes a good snapshot set:
- Cover different failure categories — don't have 10 snapshots of the same
Tool-relatedpattern - Keep the trace small — the compressor handles large traces, but smaller traces run faster and are easier to debug when they regress
- Name snapshots precisely — they're regression documentation, not just test identifiers
- Accept baseline drift promptly — stale baselines create false regressions and erode trust in the gate
Recommended suite architecture
For a system with multiple agent components:
prod-incidents/ ← failures that actually hit users
└── search-tool-empty-response
└── planner-goal-deviation
└── memory-context-overflow
component-smoke/ ← one representative trace per major component
└── search-tool-happy-path
└── planner-tool-selection
└── memory-retrieval
regression-gate/ ← everything that blocked a PR once
└── <all previous incidents that got fixed>
Run prod-incidents and regression-gate on every PR (fail-on-unexpected). Run component-smoke on every push to main (advisory mode, fail-on-unexpected: false) to catch early drift before it hits a PR.
What next
- Block PRs with the suite: Wire RJ into GitHub Actions
- Pull traces from LangSmith automatically: Bridge LangSmith traces into RJ — run the ingest cycle hourly so new prod failures become snapshots without manual paste
- Run the suite from inside your coding agent: Use RJ from Claude Code via MCP — verify your fix before you commit
- Understand the failure taxonomy your snapshots use: Core principles of failure attribution