Try it
foundations6 min readUpdated 2026-05-20

The Core Principles of ProveAI Origin

Seven design principles — not features, not capabilities, but the beliefs that determine how ProveAI Origin makes hard product decisions.

The Core Principles of ProveAI Origin

Products accumulate features. Principles accumulate conviction.

This article is not a feature list. It's an account of the beliefs that drive our architectural decisions — the claims that determine what we build, what we refuse to build, and where we draw the line when two reasonable approaches conflict.

There are seven of them.


1. Cited Cause IS the Label

In traditional testing, you write a test case and a pass/fail criterion. Someone decides what counts as correct. That decision is manual, it's expensive, and it drifts as the system changes.

Our claim: when an AI judge identifies the span that caused a multi-agent failure — and cites the specific evidence from that span — the attribution itself is the test label. No human needs to label it. No curator needs to decide whether the attribution is "good enough." The judge's cited cause is the ground truth for that failure case.

This is not a convenience argument. It's a structural one. Attribution-as-label is what makes it possible to turn every production failure into a regression test automatically. If you had to curate each attribution manually, the economics would collapse — you'd curate the obvious failures and skip the subtle ones, which are exactly the failures that matter.

The corollary: our judge must be precise enough that its cited causes are worth trusting. A 95% step@1 accuracy on the TRAIL benchmark is not a vanity metric — it's the threshold below which "cited cause = label" breaks down. If the judge is wrong 20% of the time, the snapshot suite is corrupted with false attributions, and the regression test value disappears. Accuracy is not a quality-of-life feature. It's a prerequisite for the whole system working.


2. Span-Level Granularity Is Not Negotiable

Trace-level verdicts ("this trace failed") are not actionable. A failing trace without a cited span is a bug report without a stack trace — you know something is wrong, but you can't locate the problem.

The entire observability industry knows this for HTTP systems. A slow API call is useless information without knowing which service call was slow and why. Distributed tracing solved this by making the service call (the span) the unit of analysis. Multi-agent debugging requires the same move: the failing span is the unit, not the failing trace.

We didn't design around this principle — we discovered it empirically. Early versions of ProveAI Origin returned trace-level verdicts. Engineers couldn't act on them. The first time we returned a cited span with quoted evidence, the response was qualitatively different: "now I know what to fix." The granularity wasn't a nice-to-have. It was the prerequisite for any action at all.

This principle has a concrete implication for what we don't build: trace-level pattern clustering that doesn't attribute to spans. Engine (LangSmith's failure analysis product) clusters failures by topic — "you have a lot of failures about topic X." That's useful for pattern detection. It's not useful for repair. We don't compete there because we're optimizing for a different outcome: not "understand the distribution" but "locate the failure and fix it."


3. Framework-Agnostic Ingest Is the Only Defensible Position

We sit above the trace. We don't own the tracer.

This is a deliberate architectural choice, not a limitation. The LLM observability ecosystem has at least six major tracing backends (LangSmith, Langfuse, Phoenix, Helicone, Honeycomb, Grafana Tempo with OTEL) and is adding more. Every team has already chosen one. If we required them to swap their tracer for ours, we'd lose every design partner who has existing tooling — which is every design partner.

More importantly: the failure shape doesn't change with the framework. A Context Handling Failure in a LangGraph trace looks like a Context Handling Failure in a Langfuse trace. The TRAIL taxonomy is framework-agnostic because the failure mechanisms are framework-agnostic. Our attribution quality should be identical regardless of how the trace was generated.

The practical constraint: we maintain parsers for OTEL gen-ai semconv, LangSmith's export format, Langfuse's trace format, and a flexible custom JSON format. Every new tracer requires a new parser. This is ongoing work. We do it because the alternative — pick one tracer and lock in — would contradict the framework-agnostic position that makes us useful to teams who aren't using that tracer.


4. Replay Is Verification

Saying a fix "should work" is speculation. Replaying the failing trace with the fix applied is evidence.

This is the principle that separates ProveAI Origin from a sophisticated debugging tool and makes it a verification system. Debugging finds the problem. Verification proves the fix. These are different activities that require different primitives.

The replay primitive is: take a snapshot (the original attributed failure), apply a patch (the proposed fix), re-run the attribution pipeline against the patched system, compare the result against the snapshot baseline. If the cited cause changes — if the repair shifts the attribution away from the original failing span — the fix had causal impact. If the cited cause stays the same, the fix was cosmetic.

This principle has a hard dependency: you can only replay a trace if the trace is deterministic enough to produce consistent attributions. Non-deterministic agents (temperature > 0, stochastic tool responses) produce attribution noise that makes replay results unreliable. We're explicit about this: if your agent is non-deterministic, your replay results will have variance, and you should interpret them with appropriate caution. We don't pretend the variance away.


5. CI as the Merge Gate

The snapshot suite is only valuable if it runs before code ships.

A snapshot that you check manually, once a week, catches the failures you happened to remember to check. A snapshot suite that runs on every PR catches every regression, every time. These are not equivalent. The CI position is the only position where the snapshot suite has full value.

This means: our GitHub Action is a first-class product surface, not a developer experience add-on. The experience of setting up the Action, configuring the suite, and interpreting the check result needs to be smooth enough that engineers actually do it — not just know they could.

The principle also defines a clear success metric for the CI integration: are engineers blocking PRs on snapshot failures? If yes, the gate is working. If engineers are merging with failing snapshots (overriding the check), the gate is producing noise — false positives or low-confidence attributions that erode trust in the system. We track this as a product health signal.


6. Above the Tracer, Inside the Coding Agent

ProveAI Origin does not replace your observability stack. It sits above it.

Your tracer (Langfuse, LangSmith, Honeycomb) collects and stores your traces. ProveAI Origin reads those traces and adds a layer of attributed understanding. These are different jobs. We don't want to own the trace collection path — that's competitive with every backend that teams are already invested in, and we'd lose. We do want to own the attribution layer — what the traces mean, which span caused which failure, whether a proposed fix changes the causal picture.

"Inside the coding agent" is the other side of this principle. The MCP server (@runtime-judgement/mcp-server) puts ProveAI Origin's verification API inside the coding agent's tool loop. The agent calls verify_change before committing. This is not a UI that the engineer clicks — it's a tool that the coding agent invokes. The integration point is the agent's toolset, not the engineer's browser.

Together, these two positions define the product surface: we sit above the tracer (reading and attributing traces from whatever backend you use) and inside the coding agent (providing verification APIs that the agent can call autonomously). The engineer is the beneficiary, not the primary user interface.


7. Honesty Over Completeness

If a capability isn't shipped, we say so.

This sounds obvious. It's harder than it sounds. The pressure to describe a roadmap item as a shipped feature, to round up an 80% solution to 100%, to downplay a known limitation — these are constant pressures in product development, and the first time you give in to them, your documentation becomes unreliable and engineers stop trusting it.

Our rule: if something is in progress, it has a "Sprint 12" badge or a "TODO: verify" comment. If something is speculative, we describe it as speculative. If a capability has known limitations (non-deterministic agents, attribution noise at high temperatures), we document the limitations before we document the capability.

The TRAIL paper citation isn't a decoration — it's a commitment that the taxonomy we use is externally defined and independently verifiable, not a proprietary label set we invented and own. Our accuracy claims (64–93% step@1 by preset, measured on /bench with confidence intervals and n counts) are tied to in-repo measurements against real traces. You can verify them. We expect you to.

This principle shapes how we write documentation, how we write product copy, and how we communicate what ProveAI Origin is versus what we're building toward. The things we're building toward are in Where the Frontier Lies. What's available today is at /app/new.


The Throughline

Read across these seven principles and one idea connects them all: reliability through specificity.

Cited spans, not trace-level verdicts. Framework-agnostic parsers, not tracer lock-in. Replay evidence, not speculation about fixes. CI gates, not manual snapshot checks. Honest capability claims, not marketing completeness.

The specificity is not a stylistic preference. It's the architecture of trust. Every engineering team that adopts ProveAI Origin is betting that the attribution is accurate enough to trust as a regression test label, that the replay is faithful enough to trust as verification, and that the CI check is reliable enough to trust as a merge gate. That trust is built or destroyed one specific, honest, verifiable claim at a time.

Try It

The fastest way to test the principles is to use the product. Paste a failing trace at /app/new — you'll see span-level attribution with quoted evidence within seconds. If the attribution is wrong, tell us. That feedback directly improves the judge. See the verification loop in action at /app/repairs.

Continue reading: What Are Agent Traces? provides the technical foundation for why span-level granularity matters — the history and anatomy of distributed tracing. Where the Frontier Lies describes the research directions these principles point toward.

Related articles

Try it