Model bench

Preliminary · n=22 corpus · n=14 discriminating

Which model is best at finding the failure?

Every model is handed the same compressed trace and the same canonical Q72 prompt at temperature 0 — no per-model tuning. We measure who points at the true root-cause span (step@1), at what cost, and how fast. Four honest comparisons, not one ranked winner.

models
17
calls
374
errored/parse-fail
21/0
spend
$1.6530
prices as of
2026-06-01
run
2026-06-02 (live)

How to read this. step@1 is measured only on the small set of cases where the symptom span differs from the true cause (the discriminating corpus, n=14), so its confidence intervals are wide — we show them rather than hide them. Cost, latency and trace-length axes span every case × model and are far more precise. Failure categories are descriptive (the gold vocabulary differs from the judge's, so they are not scored). ProveAI Origin runs a constellation of models — a precision tier, a fast/standard tier, and a cost-optimised tier — and this bench is the research that decides which model fills each role. Full methodology →

01

Cost vs quality

The core trade-off: root-cause accuracy (step@1) against the price of a single attribution call. Up-and-to-the-left is the frontier — high accuracy, low cost. Cheap open models cluster surprisingly close to the frontier. Cost is first-class here, not a footnote: attribution runs on every failing trace, so call price compounds across context management, troubleshooting and remediation — a cheap model that stays near the frontier is a structural advantage, and it's why we run a constellation rather than one model.

Cost vs quality

step@1 root-cause accuracy vs mean cost per call · whiskers are Wilson 95% intervals
$0.00012cost / call (log scale)$0.018

Not plotted (fewer than 2 scored discriminating calls — usually provider rate-limiting, not a quality result): Kimi K2.6 (n=1, 20 errored). See the table below for their raw figures.

ModelProviderstep@195% CI · nL1 axisvalid citemed msp95 ms$/call$/correcterr/pf
Claude Opus 4.8openrouter79%52–92% · n=1476%53–90% · n=1791%72–97% · n=2212471664$0.012$0.0120/0
GLM-4.6deepinfra64%39–84% · n=1476%53–90% · n=1791%72–97% · n=2242419612$0.00053$0.000620/0
Claude Haiku 4.5openrouter64%39–84% · n=1482%59–94% · n=1782%61–93% · n=2210061706$0.00247$0.002910/0
Claude Sonnet 4.5openrouter64%39–84% · n=1482%59–94% · n=1791%72–97% · n=2216772142$0.00523$0.006570/0
Gemini 3.5 Flashopenrouter64%39–84% · n=1476%53–90% · n=1795%78–99% · n=2232063709$0.011$0.0150/0
Gemini 3.1 Proopenrouter64%39–84% · n=1488%66–97% · n=1791%72–97% · n=2249286321$0.016$0.0220/0
DeepSeek-V3.1deepinfra57%33–79% · n=1459%36–78% · n=1795%78–99% · n=22648910672$0.00037$0.000500/0
GPT-4.1 miniopenai57%33–79% · n=1465%41–83% · n=1791%72–97% · n=2217242229$0.00055$0.000730/0
GPT-4.1openai57%33–79% · n=1482%59–94% · n=1795%78–99% · n=2212633057$0.00273$0.003680/0
Llama-4 Maverickdeepinfra50%27–73% · n=1471%47–87% · n=1786%67–95% · n=2214291911$0.00033$0.000510/0
Qwen3-235B-A22B Instructdeepinfra43%21–67% · n=1447%26–69% · n=1791%72–97% · n=2246628823$0.00019$0.000340/0
GPT-5.5openrouter43%21–67% · n=1471%47–87% · n=17100%85–100% · n=22439553$0.018$0.0360/0
GPT-5 miniopenrouter36%16–61% · n=1482%59–94% · n=1791%72–97% · n=22365622$0.00198$0.005130/0
GPT-4oopenai36%16–61% · n=1465%41–83% · n=1795%78–99% · n=229781130$0.00339$0.007410/0
Qwen2.5-72Bdeepinfra29%12–55% · n=1453%31–74% · n=1795%78–99% · n=2231153614$0.00044$0.001120/0
gpt-oss-120bopenrouter21%8–48% · n=1459%36–78% · n=1786%65–95% · n=2111313271$0.00012$0.000501/0
Kimi K2.6openrouter0%0–79% · n=1100%34–100% · n=2100%34–100% · n=2823971$0.0008820/0
02

Combined pipeline vs single-shot

The product thesis under test: does running a trace through the full RJ pipeline (parse → compress → judge) beat handing the same model the same prompt over the raw trace? Within each pair the model is held fixed — only the trace representation changes — so any difference is compression alone. Preliminary at n=14: intervals are wide and we flag where they overlap.

qwen2.5-72b

compression 0pp step@1
RJ pipeline · compressed trace21% (8–48% · n=14)
Naive single-shot · raw trace21% (8–48% · n=14)
$/call
$0.00032 vs $0.00042
(1.3× cheaper)
prompt tokens
713 vs 952

No accuracy change for this model — but compression is still cheaper per call. Compression is not a substitute for model capability.

claude-opus-4-8

compression +21pp step@1
RJ pipeline · compressed trace93% (69–99% · n=14)
Naive single-shot · raw trace71% (45–88% · n=14)
$/call
$0.00940 vs $0.011
(1.2× cheaper)
prompt tokens
1111 vs 1423

Compression improves accuracy and cuts cost for this model. At n=14 the two Wilson intervals overlap — suggestive, not yet significant.

Cheap + compressed vs frontier single-shot

The production candidate (RJ pipeline on a cheap open model) against an expensive frontier model reading the raw trace. Cost and accuracy shown together — a cheaper call that is also less accurate is a trade-off, not a free win.

vs frontier-naivecandidate step@1frontier step@1Δ step@1cost
naive-opus21%71%-50pp34× cheaper
naive-gpt5521%50%-29pp45× cheaper
naive-gemini21%69%-48pp48× cheaper

Read this honestly: the cheap candidate is dramatically cheaper but also less accurate on this corpus. The win this run actually proves is the one above — compression helps the strong model both ways. Pairing compression with a frontier model, not a weak one, is where the proposition holds.

How much compression?

deterministic · chars/4 token proxy
median
1.75×
smaller trace
max
290.63×
longest trace
don't fit raw
2/22
exceed 200K context

The capability claim. 2 of 22 real traces exceed a 200K-token context window in raw form — a naive single-shot literally cannot read them. RJ compresses them to fit: gaia/0140b3f657eddf76ca82f72c49ac8e58 (290.63×), gaia/01c5727165fc43899b3b594b9bef5f19 (275.92×). The aggregate corpus reduction is 85.85×.

ArmTraceProviderstep@195% CI · n$/call$/correctmed mserr/pf
rj-opuscompressedopenrouter93%69–99% · n=14$0.00940$0.01012830/0
naive-opusrawopenrouter71%45–88% · n=14$0.011$0.01513060/0
naive-geminirawopenrouter69%42–87% · n=13$0.016$0.02433841/0
naive-gpt55rawopenrouter50%27–73% · n=14$0.015$0.0294350/0
rj-qwencompresseddeepinfra21%8–48% · n=14$0.00032$0.0015038310/0
naive-qwenrawdeepinfra21%8–48% · n=14$0.00042$0.0019554240/0
03

Fastest models

Wall-clock latency per attribution call, median and p95. This axis spans every case, so it is precise even where step@1 is not. Watch the spread between median and p95 — some models have a heavy slow tail.

Fastest models

median latency (bar) and p95 (marker), in ms — lower is faster
GPT-5 mini
365 / 622
GPT-5.5
439 / 553
Kimi K2.6
823 / 971
GPT-4o
978 / 1130
Claude Haiku 4.5
1006 / 1706
gpt-oss-120b
1131 / 3271
Claude Opus 4.8
1247 / 1664
GPT-4.1
1263 / 3057
Llama-4 Maverick
1429 / 1911
Claude Sonnet 4.5
1677 / 2142
GPT-4.1 mini
1724 / 2229
Qwen2.5-72B
3115 / 3614
Gemini 3.5 Flash
3206 / 3709
GLM-4.6
4241 / 9612
Qwen3-235B-A22B Instruct
4662 / 8823
Gemini 3.1 Pro
4928 / 6321
DeepSeek-V3.1
6489 / 10672

Marker line = p95. Right-hand figures are median / p95 ms over successfully-scored calls.

04

Performance vs trace length

How cost and latency scale as traces get longer. The >20-span bucket is the GAIA agent-trace corpus, which has no root-cause gold distinct from the anchor — so it measures the long-trace cost/latency axis, not accuracy.

Trace sizecasescallsstep@1 (pooled)mean $/callmedian latency (model range)
≤5 spans1728915 errored52% · n=225$0.00426378–6143ms
6–20 spans3514 errored— (no gold)$0.00652441–7117ms
>20 spans2342 errored— (no gold)$0.00690328–13126ms
05

Failure categories

Which failure categories the judges assign across the labeled corpus, and how often. Descriptive only — we have no production usage data yet, so this is the shape of the corpus, not failures in the wild.

Failure categories

how often each L4 category is assigned across the labeled corpus (descriptive — not scored)
Tool Output Misinterpretation
98 · 28%
Tool-related
70 · 20%
Incorrect Problem Identification
55 · 16%
Poor Information Retrieval
50 · 14%
Context Handling Failures
16 · 5%
Resource Abuse
14 · 4%
Formatting Errors
12 · 3%
Instruction Non-compliance
10 · 3%
Tool Selection Errors
9 · 3%
Goal Deviation
6 · 2%
Tool Definition Issues
6 · 2%
Task Orchestration
5 · 1%
Language-only
1 · 0%