Bench methodology

How the model bench is run and scored

The model bench asks one question: handed an identical agent-failure trace and an identical prompt, which LLM points at the true root-cause span — at what cost and how fast? These are preliminary results on a small labeled corpus. Everything below is what we did to keep them honest.

The fairness invariant

Every model receives the same compressed trace and the same canonical Q72 judge prompt, at temperature 0, with no per-model tuning. The prompt is the one ProveAI Origin ships in production — loaded from a single source, never hand-edited per family. A model that does badly does badly on the prompt we actually use; we do not coax better answers out of one vendor than another. This is the single most important property of the bench: it is a comparison of models on a fixed task, not a prompt-engineering contest.

The corpus

Twenty-two labeled cases, in two groups. Seventeen are hand-authored sample-suite snapshots where the true root-cause span is known and is deliberately distinct from the obvious error symptom — these are the cases that can discriminate a model that reasons about cause from one that just echoes the symptom. Five are real GAIA agent traces (11–60 spans) used to exercise the long-trace cost, latency and format axes; they carry no root-cause gold distinct from the anchor, so they do not contribute to the accuracy number. Traces span 3 to 60 spans. Of the seventeen sample-suite cases, fourteen survive the step@1 exclusion rules below with a checkable gold; the other three collapse to a degenerate anchor or a name collision.

The deterministic anchor

Each case hands the judge one “anchor” span — the symptom, the place the failure surfaced. The anchor is resolved deterministically and identically for every model: no paid or non-deterministic model pass runs during anchor selection (the detector is forced down its structural path), so the starting point is the same byte-for-byte regardless of which model is being scored. The judge's job is to get from the symptom to the cause.

step@1, and why the discriminating corpus is small

step@1is the fraction of cases where the model's single top citation is the true root-cause span. We exclude a case from step@1 — but keep it for the cost/latency/length axes — in three situations, because in each the metric cannot discriminate:

Degenerate anchor. The symptom span already isthe root cause. The judge is handed the anchor, so it could “score” just by repeating it — no signal.
Name collision.The gold span's name is shared by another kept span (see the next section on why names matter). A name-shaped citation then cannot prove the model meant the gold rather than its twin, so the case is uncheckable.
No gold. The GAIA traces have no root-cause label distinct from the anchor.

What remains is a small discriminating set — fourteen cases. That is not many, and we do not pretend otherwise: every step@1 figure is shown with its Wilson 95% interval and its n, and at this sample size those intervals are wide. A one- or two-point gap between models is almost always a statistical tie. The cost, latency and trace-length axes, by contrast, span all 22 cases × every model and are far more precise.

Finding 1 — the judge cites span names, not IDs

The compressed trace the judge reads labels every span by its name (e.g. [process_refund]), never its internal ID — the renderer is a faithful port of the production extractor, and changing it would be a production change, out of scope for a bench. The prompt asks the model to “cite the span_ids”, but no IDs appear in what it sees, so models reasonably cite names. Scoring therefore counts a citation correct if it matches the gold span's ID or its name. Comparing a name-shaped answer against an ID-shaped gold — which an earlier internal matrix did — would score every model at zero and tell you nothing. This fix is applied identically to all models.

Finding 2 — fair recovery of trailing-prose JSON

Some models emit a valid JSON verdict and then keep talking (a **Rationale:** … paragraph after the closing fence, for instance). A strict parse throws on the trailing text, which would log a correct answer as a parse failure and unfairly tank that model. The bench extracts the first balanced JSON object and parses that, recovering the verdict. This is applied uniformly to every model — it is normalization that rescues a correct answer, the opposite of per-model tuning. The production parser is untouched; the recovery lives only in the bench. (In practice this recovered a run's worth of correct verdicts from a model whose only sin was being chatty after the JSON.)

Statistics & error accounting

Every published proportion (step@1, L1-axis agreement, valid-citation rate) carries a Wilson 95% score interval — the right small-n choice, since it does not fall off the [0,1] edges the way a normal approximation does. Latency is summarized as median and p95, never a mean, because LLM latency is right-skewed and one slow call would drag a mean. Infrastructure errors — timeouts, rate limits (HTTP 429), and max_tokens truncation that returns empty content — together with JSON parse failures are counted in their own column and never folded into quality — a model is never penalized on accuracy for an error that was the provider's fault, and a flaky provider is never flattered by dropping its failures silently. (Concretely: a reasoning model that spent its entire token budget on a hidden thinking trace and returned empty content was being scored as 0% until we recognized that as a truncation artifact and moved it to the error column — never the quality average.)

Cost

Per-call cost uses each model's posted input/output token prices as of the date shown on the bench, multiplied by the actual tokens each call consumed. We report mean cost per call and, where a model got any case right, cost-per-correct-attribution (total spend on the discriminating calls ÷ number it got right). The whole sweep runs under a hard budget cap with durable, resumable per-call writes.

Serving stack & provider disclosure

For open-weights models the model is not the whole story — the serving stack is part of the system under test. Quantization (FP8 vs FP16), batching and speculative decoding, max_tokens defaults and truncation, request routing, and free-tier rate limits all shift a model's measured quality or latency. So the bench publishes the serving provider per model, and numbers that are only comparable within one stack — latency above all — must not be read across providers as if the model alone explained them.

When a model is starved by its provider — rate-limited or timing out — we do not let that masquerade as low quality. We re-home it to a healthier provider we already hold a key for, pin that provider's live price, and document the switch. Two cases from the current run: gpt-oss-120breturned HTTP 429 on 11 of 12 calls on one provider's free tier, so it was moved to a router that served a clean 22 of 22. Kimi K2.6, a slow reasoning model, timed out past the shared 60-second per-call budget on both providers we tried — so rather than score it off a single lucky call it is shown at its true n (one call) and left out of the cost/quality plot. Raising the timeout for that one model would break the fairness invariant, so we disclose the limitation instead.

Combined pipeline vs single-shot

The second comparison on the bench asks the product question: does running a trace through the full ProveAI Origin pipeline (parse → compress → judge) beat handing the same model the same canonical prompt over the raw, uncompressed trace? It is an ablation: within each pair the model and the prompt are held fixed and the onlything that changes is the trace representation, so a difference can be attributed to compression alone — a tighter version of the fairness invariant. The live arms run on the same n=14 discriminating set, scored with the same step@1 helper, so an arm's accuracy here is directly comparable to a model's on the leaderboard.

Read the result honestly. The headline — a frontier model gaining accuracy from compression — has overlapping Wilson intervals at n=14, so the surface marks it suggestive, not significant. And a cheap open model plus the pipeline is far cheaper than a frontier single-shot but notas accurate: we render cost and accuracy together so “Nx cheaper” is never shown without the quality gap beside it. Compression is a quality lever for a capable model and a capability unlock for long traces — not a way to make a weak model frontier-grade.

The compression ratios shown are deterministic and free — computed offline over the whole corpus from a chars / 4token proxy (labeled as such; the live arms carry real tokenizer counts). Its strongest claim is a capability one: a small number of real agent traces exceed a representative 200K-token context window in raw form, so a naive single-shot cannot read them at all, while the pipeline compresses them by two orders of magnitude to fit. The 200K line is a reference point, not a claim about any one model's exact window.

What we are NOT claiming yet

We have no production usage data, so anything that needs real traffic is deferred: failure prevalence over time, failures by use case, by industry, by framework, by horizon length. The failure-category distribution on the bench is the shape of this corpus, not of failures in the wild, and is descriptive rather than scored (the gold category vocabulary differs from the judge's). When real usage arrives, those time-series and segment cuts can be added — but not before they would be honest.

Reproducibility

The runner writes a durable per-call record (one JSON line per call, including the raw model response) as it goes. The analyzer re-derives every published number from those raw records using the same scoring helpers the runner used — so the figures on the bench are reproducible from the evidence, not trusted from a cached field that could drift.