Counterfactual replay and pipeline comparison
Predict the verdict diff for a different pipeline version, prompt SHA, or commit — without actually re-executing the suite. Power-user time-travel for runs.
Counterfactual replay and pipeline comparison
You just got a regression on PR #243. Before you bisect, you want to ask: "would this have regressed if I'd been on the previous pipeline version? Or with the previous prompt? Or against the base commit?" The Counterfactual replay surface answers in under a second — heuristically today, with real re-execution coming alongside the orchestrator stitch.
This is P6 of the AI-native principles ADR at the run level — every saved run becomes a reproducible artifact you can swap any single dimension on.
When to use it
- Triaging a regression. Did my new pipeline cause the flip, or was the underlying behaviour already broken?
- Cost-quality frontier exploration. Would the cheaper pipeline have passed all the same tests? Run a counterfactual against the candidate before adopting it.
- Audit trail. Before accepting a baseline as a known-good drift, prove the change correlates with a swap dimension and not random noise.
When NOT to use it: anything that needs the real model output. The Sprint 14 service is dryRun-only — it predicts the verdict diff from a small set of well-characterised heuristics (pipeline replay history, cited-span overlap, commit line keys). Real re-execution lands when the run orchestrator stitches in (search [REAL-REPLAY-SWAP-IN] in lib/runs/counterfactual.ts).
Path 1: From the Runs tab
- Open
/app/agents/[id]/runsfor the affected agent. - Click any run to expand it. The action row at the bottom carries a Counterfactual replay ▾ button.
- Pick a dimension —
pipeline,prompt, orcommit. A submenu shows the available values (pipeline versions you've used, recent prompt SHAs, recent commit SHAs). - Select a value. A toast appears with the templated summary: e.g. "Swapping pipeline to q72-k1 would recover 2 tests."
The picker UI lives at components/counterfactual-menu.tsx; it's the same component the LiveRunCard and Failure-cluster drilldown will use in Sprint 13+.
Path 2: Programmatic via POST /api/runs/counterfactual
Same primitive, exposed as an HTTP/JSON surface for scripting, CI dashboards, or your own UI:
curl -s -X POST https://runtime-judgement-app.vercel.app/api/runs/counterfactual \
-H "Authorization: Bearer $RJ_AUTH_TOKEN" \
-H "Content-Type: application/json" \
-d '{
"baseRun": {
"id": "01HZRUN...",
"agentId": "01HZAGENT...",
"triggerKind": "pr",
"pipelineVersion": "q72-k0.9",
"promptSha": "abc123",
"commitSha": "def456",
"startedAt": "2026-05-26T17:30:00Z",
"verdict": "CHANGED-UNEXPECTED",
"perTest": [
{"testName": "search-tool returns empty", "verdict": "CHANGED-UNEXPECTED"},
{"testName": "planner-goal-deviation", "verdict": "PASSED"}
]
},
"swap": {"kind": "pipeline", "value": "q72-k1"},
"dryRun": true
}' | jq
The response carries the full UI-friendly envelope:
{
"syntheticRun": { ... },
"diff": {
"summary": "Swapping pipeline to q72-k1 would recover 1 test.",
"flippedTests": [
{"testName": "search-tool returns empty", "before": "FAILED", "after": "PASSED"}
],
"unchangedCount": 1,
"pipelineVersionChange": {"from": "q72-k0.9", "to": "q72-k1"},
"promptShaChange": null,
"commitShaChange": null
},
"isDryRun": true,
"costDelta": 0.0023,
"latencyDelta": 410,
"confidence": 0.78
}
confidence reflects how well the predictor matches measured replay behaviour: 0.78 for pipeline swaps (best — we have replay history for named pipeline versions), 0.62 for prompt swaps, 0.55 for commit swaps. isDryRun: true means you got a heuristic prediction, not a real re-execution.
dryRun: false is accepted in the body for forward compatibility, but the underlying service still ignores it today — the route surfaces this via isDryRun so a client always knows what it got back.
Path 3: Pure programmatic via lib/runs/counterfactual.ts
If you're working inside the RJ codebase or a Node script with direct access to the service:
import {
counterfactualReplay,
type RunSummary,
} from "@/lib/runs/counterfactual"
const baseRun: RunSummary = {
runId: "01HZRUN...",
pipelineVersion: "q72-k0.9",
promptSha: "abc123",
commitSha: "def456",
perTest: [
{ testId: "search-tool returns empty", verdict: "FAILED" },
{ testId: "planner-goal-deviation", verdict: "PASSED" },
],
costUsd: 0,
latencyMs: 0,
}
const result = counterfactualReplay({
baseRun,
swap: { kind: "pipeline", value: "q72-k1" },
dryRun: true,
})
// result.diff.flippedTests, result.confidence, result.costDelta, ...
The pure service has no I/O — it's fully synchronous, runs in microseconds, and is safe to call from a hot path. The route (app/api/runs/counterfactual/route.ts) is the only file that adds auth + the UI ↔ service shape adapter.
Pipeline comparison vs counterfactual replay
Two related-but-distinct surfaces:
| Surface | What it answers | Where |
|---|---|---|
| Counterfactual replay (this guide) | "What would this one run look like with a different pipeline/prompt/commit?" | Per-run, on demand |
| Two-pipeline replay on history | "If we'd been on pipeline B for the last 30 days, how would the suite have scored?" | Pipeline tab, time-travel slider (P6) |
The two-pipeline view replays a whole window of runs against a candidate pipeline; counterfactual replay is the inner loop for one specific case. Use the per-run view when triaging; use the pipeline view when evaluating a candidate before adopting it.
Pitfalls
The confidence isn't 1.0
It can't be — until real re-execution lands, the predictor is a heuristic. Treat confidence < 0.65 as a hint, not a verdict. The orchestrator stitch (Sprint 15+) will swap the heuristic for real model calls and bump confidence to 1.0 for executed runs.
Counterfactual against a deleted pipeline version
The cost/latency heuristic falls back to PIPELINE_DEFAULT when the swap value isn't a known pipeline version. The verdict prediction still works (it doesn't depend on the cost lookup), but the costDelta reads as zero.
Ownership errors
The route's Zod schema accepts the UI RunSummary shape, but baseRun.agentId is cross-checked against getAgentForUser(). A mismatch returns HTTP 403 even if the body is otherwise well-formed.
Next steps
- Lock in the verdict you just predicted: Build a snapshot suite for regression gating
- Run the same counterfactual from the terminal: Use the rj CLI (write tools land in Sprint 15+ alongside the audit-log tier)
- Why time-travel is a first-class concept: What's an AI-native agent page? — P6 of the principles
- The ADR that motivated time-travel as a default:
dev-docs/strategy/2026-05-26-adr-ai-native-principles.md