Model bench

Preliminary · n=22 corpus · n=14 discriminating

Which model is best at finding the failure?

Every model is handed the same compressed trace and the same canonical Q72 prompt at temperature 0 — no per-model tuning. We measure who points at the true root-cause span (step@1), at what cost, and how fast. Four honest comparisons, not one ranked winner.

models: 17
calls: 374
errored/parse-fail: 21/0
spend: $1.6530
prices as of: 2026-06-01
run: 2026-06-02 (live)

How to read this. step@1 is measured only on the small set of cases where the symptom span differs from the true cause (the discriminating corpus, n=14), so its confidence intervals are wide — we show them rather than hide them. Cost, latency and trace-length axes span every case × model and are far more precise. Failure categories are descriptive (the gold vocabulary differs from the judge's, so they are not scored). ProveAI Origin runs a constellation of models — a precision tier, a fast/standard tier, and a cost-optimised tier — and this bench is the research that decides which model fills each role. Full methodology →

Cost vs quality

The core trade-off: root-cause accuracy (step@1) against the price of a single attribution call. Up-and-to-the-left is the frontier — high accuracy, low cost. Cheap open models cluster surprisingly close to the frontier. Cost is first-class here, not a footnote: attribution runs on every failing trace, so call price compounds across context management, troubleshooting and remediation — a cheap model that stays near the frontier is a structural advantage, and it's why we run a constellation rather than one model.

Cost vs quality

step@1 root-cause accuracy vs mean cost per call · whiskers are Wilson 95% intervals

$0.00012cost / call (log scale)$0.018

Not plotted (fewer than 2 scored discriminating calls — usually provider rate-limiting, not a quality result): Kimi K2.6 (n=1, 20 errored). See the table below for their raw figures.

Model	Provider	step@1	95% CI · n	L1 axis	valid cite	med ms	p95 ms	$/call	$/correct	err/pf
Claude Opus 4.8	openrouter	79%	52–92% · n=14	76%53–90% · n=17	91%72–97% · n=22	1247	1664	$0.012	$0.012	0/0
GLM-4.6	deepinfra	64%	39–84% · n=14	76%53–90% · n=17	91%72–97% · n=22	4241	9612	$0.00053	$0.00062	0/0
Claude Haiku 4.5	openrouter	64%	39–84% · n=14	82%59–94% · n=17	82%61–93% · n=22	1006	1706	$0.00247	$0.00291	0/0
Claude Sonnet 4.5	openrouter	64%	39–84% · n=14	82%59–94% · n=17	91%72–97% · n=22	1677	2142	$0.00523	$0.00657	0/0
Gemini 3.5 Flash	openrouter	64%	39–84% · n=14	76%53–90% · n=17	95%78–99% · n=22	3206	3709	$0.011	$0.015	0/0
Gemini 3.1 Pro	openrouter	64%	39–84% · n=14	88%66–97% · n=17	91%72–97% · n=22	4928	6321	$0.016	$0.022	0/0
DeepSeek-V3.1	deepinfra	57%	33–79% · n=14	59%36–78% · n=17	95%78–99% · n=22	6489	10672	$0.00037	$0.00050	0/0
GPT-4.1 mini	openai	57%	33–79% · n=14	65%41–83% · n=17	91%72–97% · n=22	1724	2229	$0.00055	$0.00073	0/0
GPT-4.1	openai	57%	33–79% · n=14	82%59–94% · n=17	95%78–99% · n=22	1263	3057	$0.00273	$0.00368	0/0
Llama-4 Maverick	deepinfra	50%	27–73% · n=14	71%47–87% · n=17	86%67–95% · n=22	1429	1911	$0.00033	$0.00051	0/0
Qwen3-235B-A22B Instruct	deepinfra	43%	21–67% · n=14	47%26–69% · n=17	91%72–97% · n=22	4662	8823	$0.00019	$0.00034	0/0
GPT-5.5	openrouter	43%	21–67% · n=14	71%47–87% · n=17	100%85–100% · n=22	439	553	$0.018	$0.036	0/0
GPT-5 mini	openrouter	36%	16–61% · n=14	82%59–94% · n=17	91%72–97% · n=22	365	622	$0.00198	$0.00513	0/0
GPT-4o	openai	36%	16–61% · n=14	65%41–83% · n=17	95%78–99% · n=22	978	1130	$0.00339	$0.00741	0/0
Qwen2.5-72B	deepinfra	29%	12–55% · n=14	53%31–74% · n=17	95%78–99% · n=22	3115	3614	$0.00044	$0.00112	0/0
gpt-oss-120b	openrouter	21%	8–48% · n=14	59%36–78% · n=17	86%65–95% · n=21	1131	3271	$0.00012	$0.00050	1/0
Kimi K2.6	openrouter	0%	0–79% · n=1	100%34–100% · n=2	100%34–100% · n=2	823	971	$0.00088	—	20/0

Combined pipeline vs single-shot

The product thesis under test: does running a trace through the full RJ pipeline (parse → compress → judge) beat handing the same model the same prompt over the raw trace? Within each pair the model is held fixed — only the trace representation changes — so any difference is compression alone. Preliminary at n=14: intervals are wide and we flag where they overlap.

qwen2.5-72b

compression 0pp step@1

RJ pipeline · compressed trace21% (8–48% · n=14)

Naive single-shot · raw trace21% (8–48% · n=14)

$/call: $0.00032 vs $0.00042
prompt tokens: 713 vs 952

No accuracy change for this model — but compression is still cheaper per call. Compression is not a substitute for model capability.

claude-opus-4-8

compression +21pp step@1

RJ pipeline · compressed trace93% (69–99% · n=14)

Naive single-shot · raw trace71% (45–88% · n=14)

$/call: $0.00940 vs $0.011
prompt tokens: 1111 vs 1423

Compression improves accuracy and cuts cost for this model. At n=14 the two Wilson intervals overlap — suggestive, not yet significant.

Cheap + compressed vs frontier single-shot

The production candidate (RJ pipeline on a cheap open model) against an expensive frontier model reading the raw trace. Cost and accuracy shown together — a cheaper call that is also less accurate is a trade-off, not a free win.

vs frontier-naive	candidate step@1	frontier step@1	Δ step@1	cost
naive-opus	21%	71%	-50pp	34× cheaper
naive-gpt55	21%	50%	-29pp	45× cheaper
naive-gemini	21%	69%	-48pp	48× cheaper

Read this honestly: the cheap candidate is dramatically cheaper but also less accurate on this corpus. The win this run actually proves is the one above — compression helps the strong model both ways. Pairing compression with a frontier model, not a weak one, is where the proposition holds.

How much compression?

deterministic · chars/4 token proxy

median: 1.75×; smaller trace
max: 290.63×; longest trace
don't fit raw: 2/22; exceed 200K context

The capability claim. 2 of 22 real traces exceed a 200K-token context window in raw form — a naive single-shot literally cannot read them. RJ compresses them to fit: gaia/0140b3f657eddf76ca82f72c49ac8e58 (290.63×), gaia/01c5727165fc43899b3b594b9bef5f19 (275.92×). The aggregate corpus reduction is 85.85×.

Arm	Trace	Provider	step@1	95% CI · n	$/call	$/correct	med ms	err/pf
rj-opus	compressed	openrouter	93%	69–99% · n=14	$0.00940	$0.010	1283	0/0
naive-opus	raw	openrouter	71%	45–88% · n=14	$0.011	$0.015	1306	0/0
naive-gemini	raw	openrouter	69%	42–87% · n=13	$0.016	$0.024	3384	1/0
naive-gpt55	raw	openrouter	50%	27–73% · n=14	$0.015	$0.029	435	0/0
rj-qwen	compressed	deepinfra	21%	8–48% · n=14	$0.00032	$0.00150	3831	0/0
naive-qwen	raw	deepinfra	21%	8–48% · n=14	$0.00042	$0.00195	5424	0/0

Fastest models

Wall-clock latency per attribution call, median and p95. This axis spans every case, so it is precise even where step@1 is not. Watch the spread between median and p95 — some models have a heavy slow tail.

Fastest models

median latency (bar) and p95 (marker), in ms — lower is faster

GPT-5 mini

365 / 622

GPT-5.5

439 / 553

Kimi K2.6

823 / 971

GPT-4o

978 / 1130

Claude Haiku 4.5

1006 / 1706

gpt-oss-120b

1131 / 3271

Claude Opus 4.8

1247 / 1664

GPT-4.1

1263 / 3057

Llama-4 Maverick

1429 / 1911

Claude Sonnet 4.5

1677 / 2142

GPT-4.1 mini

1724 / 2229

Qwen2.5-72B

3115 / 3614

Gemini 3.5 Flash

3206 / 3709

GLM-4.6

4241 / 9612

Qwen3-235B-A22B Instruct

4662 / 8823

Gemini 3.1 Pro

4928 / 6321

DeepSeek-V3.1

6489 / 10672

Marker line = p95. Right-hand figures are median / p95 ms over successfully-scored calls.

Performance vs trace length

How cost and latency scale as traces get longer. The >20-span bucket is the GAIA agent-trace corpus, which has no root-cause gold distinct from the anchor — so it measures the long-trace cost/latency axis, not accuracy.

Trace size	cases	calls	step@1 (pooled)	mean $/call	median latency (model range)
≤5 spans	17	28915 errored	52% · n=225	$0.00426	378–6143ms
6–20 spans	3	514 errored	— (no gold)	$0.00652	441–7117ms
>20 spans	2	342 errored	— (no gold)	$0.00690	328–13126ms

Failure categories

Which failure categories the judges assign across the labeled corpus, and how often. Descriptive only — we have no production usage data yet, so this is the shape of the corpus, not failures in the wild.

Failure categories

how often each L4 category is assigned across the labeled corpus (descriptive — not scored)

Tool Output Misinterpretation

98 · 28%

Tool-related

70 · 20%

Incorrect Problem Identification

55 · 16%

Poor Information Retrieval

50 · 14%

Context Handling Failures

16 · 5%

Resource Abuse

14 · 4%

Formatting Errors

12 · 3%

Instruction Non-compliance

10 · 3%

Tool Selection Errors

9 · 3%

Goal Deviation

6 · 2%

Tool Definition Issues

6 · 2%

Task Orchestration

5 · 1%

Language-only

1 · 0%