Agentic Coding and Why It Can't Catch Its Own Failures

Agentic coding tools are the fastest-growing category in software development right now. Cursor's agent mode, Devin, Aider, Claude Code, Codex CLI, GitHub Copilot Workspace — these are not autocomplete tools. They're multi-step agents that read files, run tests, make changes, observe results, and iterate until a task is done.

They're also, structurally, incapable of catching their own failures.

This isn't a criticism of the individual tools. It's a description of an architectural gap that every one of them shares — and that no amount of making the individual tool smarter can fully close. This article explains what that gap is, why it exists, and what the missing layer looks like.

What Agentic Coding Actually Is

Let's be precise. "Agentic coding" means a system where an AI model:

Receives a task — "fix the bug where the summarizer hallucinates legal citations"
Plans a multi-step approach — read the codebase, identify the relevant prompt, understand the failure, propose a change
Executes the steps — opens files, reads test output, modifies prompts, runs tests
Observes the results — sees whether tests pass, whether the failure is still present
Iterates — repeats until the task is complete or the agent gives up

Every tool in this space follows this loop. The vocabulary differs (Devin uses "shell commands and web browsing"; Claude Code uses "tools: Read, Write, Bash, Edit"; Cursor agent mode uses "composer with codebase indexing") but the pattern is the same: plan, act, observe, iterate.

The better tools in this category are impressively capable. They can navigate complex codebases, understand the interplay between components, and generate fixes that a junior engineer would not. The GAIA benchmark — a standard for general AI assistants on real-world tasks — shows the best agentic systems reaching 50-60% task completion on hard tasks that require multi-step web research and reasoning. TODO: verify current GAIA leaderboard numbers — benchmark results change frequently.

The problem is not capability. The problem is verification.

The Four Structural Reasons Coding Agents Can't Verify Themselves

1. Stateless Per-Session Reasoning

Every coding agent session starts fresh. The agent has no memory of previous sessions unless you explicitly provide it. This means: if the same bug recurs two sprints later, the agent approaches it without any knowledge of what the previous fix tried, what regressions it introduced, or what alternative approaches were evaluated and rejected.

This isn't a memory limitation that can be solved by adding a longer context window. It's a verification problem: the agent has no baseline to compare against. It can read the current codebase and run the current tests. It cannot compare the current behavior against the behavior of two weeks ago, because it has no representation of "behavior two weeks ago."

The agent's verification loop is: "did I introduce any new test failures?" That's a necessary check, but it's not sufficient. A fix can pass all tests and still:

Regress on production traces that the test suite doesn't cover
Introduce a new failure mode that the existing tests don't exercise
Fix the symptom (the test case) while leaving the root cause intact

2. No Replay Against Prior Baselines

The agent can run your test suite. It cannot replay your production failure traces.

This distinction matters enormously. Your test suite tests what you thought to test. Production traces capture what actually happened — the inputs your users actually provided, the edge cases that emerged in the wild, the failure modes you didn't anticipate.

If your summarizer agent hallucinated a legal citation in three production runs last week, those three traces are the ground truth for whether your fix works. The unit test for "does the summarizer return a string" is not.

No coding agent, of any description, currently has native access to your production failure traces as a verification target. The fix they propose is verified against your test suite (if it's configured), not against the failure that motivated the fix in the first place.

3. No Span-Level Attribution of Which Step Went Wrong

When a multi-step agent task fails, the failure has a location: it happened in a specific span, at a specific step, in the agent's execution. A planning step reasoned incorrectly. A tool call returned a malformed result. A context update dropped critical information.

Coding agents observe the final outcome — the test passed or failed, the code compiled or didn't — but they don't have a structured attribution of which step in their own multi-step execution produced the error. They see the end-of-task result, not the causal chain that led to it.

This creates a systematic debugging blind spot. Consider a coding agent that:

Reads prompts/summarizer.md correctly
Identifies the relevant prompt section correctly
Generates an edit that looks correct in isolation
Misapplies the edit to the wrong line number (off-by-one in its file understanding)
Runs tests — tests pass, because the test doesn't exercise the affected code path
Reports: "fix applied successfully"

The agent saw "tests pass" and concluded "fix successful." But step 4 was wrong. Without span-level attribution of the agent's own execution, neither the agent nor the engineer can identify where in the 6-step process the error occurred.

4. No Regression-Test Feedback Loop

The core of software quality is the feedback loop: make a change, run the regression tests, see if anything broke. For traditional code, this loop runs in seconds. For agentic code — code that controls how an AI agent behaves — the regression tests are the production failure traces, and there's no standard mechanism for running them before merge.

When you change a prompt, a tool definition, or an orchestration pattern, the only way to verify the change is to run it against the failure cases that motivated the change. But those failure cases are in your production traces, not in your test suite. And "running a trace" against a code change is not something any CI system does out of the box.

The result: every change to an agentic system ships without verified regression coverage. The developer hopes the fix works. The product manager waits for the next production failure to find out.

What the Missing Layer Looks Like

The gap is not inside the coding agent — it's between the coding agent and the production trace.

What's needed:

Production failure traces captured and attributed — every production failure turned into a named, cited, attributed record. This is what ProveAI Origin's ingestion and attribution pipeline does.
Attribution saved as a baseline — the attribution (cited span + L1 axis + L4 category + evidence) saved as a snapshot that can be re-run. This is the snapshot primitive.
Snapshot replay available as a verification target — when a coding agent proposes a change, the snapshot suite runs against the patched code and returns {fixed: N, regressed: M, unchanged: K} with cited evidence for each.
The coding agent receives the verification result before committing — and uses it to decide whether to proceed, pivot, or surface the conflict to the engineer.

This is the inner loop position. ProveAI Origin's MCP server (@runtime-judgement/mcp-server) implements exactly this: three tools (verify_change, attribute_trace, suggest_snapshot) that a coding agent can call via Claude Code's MCP integration before committing a change.

The concrete workflow:

Engineer → Claude Code: "fix the hallucination in the summarizer agent"

Claude Code:
1. Reads codebase, identifies prompts/summarizer.md
2. Proposes change to line 47
3. Calls rj.verify_change(patch, suite="summarizer-regressions")

ProveAI Origin:
1. Replays 12 snapshot traces with the patch applied
2. Returns: {fixed: 8, regressed: 2, unchanged: 2, evidence: [...]}

Claude Code:
"The patch fixes 8/12 cases but regresses on snapshot #7 (Context Handling Failure)
and snapshot #11 (Instruction Non-compliance). The regression in #7 is caused by
line 47's new instruction conflicting with the constraint on line 23. Trying a
different approach."

→ Revised patch proposed
→ rj.verify_change called again
→ {fixed: 10, regressed: 0, unchanged: 2}
→ Commit created

The engineer never saw a bad commit. The coding agent self-corrected using the snapshot suite as the verification oracle.

Why This Matters Now

Agentic coding is accelerating. The tools are getting better. Tasks that took a senior engineer two hours are taking a coding agent ten minutes. But the failure rate on subtle, non-obvious bugs — exactly the kind of bugs that agentic changes introduce — is not going down at the same rate as the task-completion rate is going up.

The risk profile is changing. Instead of one engineer making one careful change with known regression coverage, you have a coding agent making dozens of changes rapidly, each change without verified regression coverage. The velocity gain is real. The verification gap is also real.

Every team running agentic coding at scale will hit this gap eventually. The symptoms:

"We fixed a bug and introduced two more"
"The agent's change passed CI but broke something in production we didn't have a test for"
"We can't tell if this sprint's changes improved or degraded the agent's behavior on the cases that matter"

These are not signs of a bad coding agent. They're signs of a missing verification layer.

The Position of ProveAI Origin

ProveAI Origin does not replace the coding agent. It sits between the coding agent and the production runtime — as the verification oracle the coding agent calls before committing, and as the failure attribution layer that captures and labels production failures into a snapshot suite the agent can work against.

Three surfaces that specifically address this gap:

/app/new — paste a failing production trace, get span-level attribution in seconds. The attribution becomes the named test case.
/app/repairs — the repair + verification dashboard. Every proposed fix is replayed against the snapshot suite; you see {fixed, regressed, unchanged} before the fix ships.
@runtime-judgement/mcp-server — the integration point for coding agents. verify_change maps directly onto the repair verification pipeline.

The coding agent writes the fix. ProveAI Origin verifies it. The engineer reviews the verified result.

Try It

If you have a failing production trace from a multi-agent system, upload it at /app/new. The attribution becomes the first snapshot in your regression suite. Every subsequent change to that system can be verified against it — by you, by your CI pipeline, or by the coding agent that made the change.

Continue reading: Where the Frontier Lies covers what comes after the basic verification loop — counterfactual replay, automated repair generation, and self-improving pipelines. Common Failure Modes catalogs the specific failure types that agentic coding changes most commonly introduce.

Agentic Coding and Why It Can't Catch Its Own Failures

Agentic Coding and Why It Can't Catch Its Own Failures

What Agentic Coding Actually Is

The Four Structural Reasons Coding Agents Can't Verify Themselves

1. Stateless Per-Session Reasoning

2. No Replay Against Prior Baselines

3. No Span-Level Attribution of Which Step Went Wrong

4. No Regression-Test Feedback Loop

What the Missing Layer Looks Like

Why This Matters Now

The Position of ProveAI Origin

Try It

Related articles

Try it