Model bench
Preliminary · n=22 corpus · n=14 discriminatingWhich model is best at finding the failure?
Every model is handed the same compressed trace and the same canonical Q72 prompt at temperature 0 — no per-model tuning. We measure who points at the true root-cause span (step@1), at what cost, and how fast. Four honest comparisons, not one ranked winner.
- models
- 17
- calls
- 374
- errored/parse-fail
- 21/0
- spend
- $1.6530
- prices as of
- 2026-06-01
- run
- 2026-06-02 (live)
How to read this. step@1 is measured only on the small set of cases where the symptom span differs from the true cause (the discriminating corpus, n=14), so its confidence intervals are wide — we show them rather than hide them. Cost, latency and trace-length axes span every case × model and are far more precise. Failure categories are descriptive (the gold vocabulary differs from the judge's, so they are not scored). ProveAI Origin runs a constellation of models — a precision tier, a fast/standard tier, and a cost-optimised tier — and this bench is the research that decides which model fills each role. Full methodology →
Cost vs quality
The core trade-off: root-cause accuracy (step@1) against the price of a single attribution call. Up-and-to-the-left is the frontier — high accuracy, low cost. Cheap open models cluster surprisingly close to the frontier. Cost is first-class here, not a footnote: attribution runs on every failing trace, so call price compounds across context management, troubleshooting and remediation — a cheap model that stays near the frontier is a structural advantage, and it's why we run a constellation rather than one model.
Cost vs quality
step@1 root-cause accuracy vs mean cost per call · whiskers are Wilson 95% intervalsNot plotted (fewer than 2 scored discriminating calls — usually provider rate-limiting, not a quality result): Kimi K2.6 (n=1, 20 errored). See the table below for their raw figures.
| Model | Provider | step@1 | 95% CI · n | L1 axis | valid cite | med ms | p95 ms | $/call | $/correct | err/pf |
|---|---|---|---|---|---|---|---|---|---|---|
| Claude Opus 4.8 | openrouter | 79% | 52–92% · n=14 | 76%53–90% · n=17 | 91%72–97% · n=22 | 1247 | 1664 | $0.012 | $0.012 | 0/0 |
| GLM-4.6 | deepinfra | 64% | 39–84% · n=14 | 76%53–90% · n=17 | 91%72–97% · n=22 | 4241 | 9612 | $0.00053 | $0.00062 | 0/0 |
| Claude Haiku 4.5 | openrouter | 64% | 39–84% · n=14 | 82%59–94% · n=17 | 82%61–93% · n=22 | 1006 | 1706 | $0.00247 | $0.00291 | 0/0 |
| Claude Sonnet 4.5 | openrouter | 64% | 39–84% · n=14 | 82%59–94% · n=17 | 91%72–97% · n=22 | 1677 | 2142 | $0.00523 | $0.00657 | 0/0 |
| Gemini 3.5 Flash | openrouter | 64% | 39–84% · n=14 | 76%53–90% · n=17 | 95%78–99% · n=22 | 3206 | 3709 | $0.011 | $0.015 | 0/0 |
| Gemini 3.1 Pro | openrouter | 64% | 39–84% · n=14 | 88%66–97% · n=17 | 91%72–97% · n=22 | 4928 | 6321 | $0.016 | $0.022 | 0/0 |
| DeepSeek-V3.1 | deepinfra | 57% | 33–79% · n=14 | 59%36–78% · n=17 | 95%78–99% · n=22 | 6489 | 10672 | $0.00037 | $0.00050 | 0/0 |
| GPT-4.1 mini | openai | 57% | 33–79% · n=14 | 65%41–83% · n=17 | 91%72–97% · n=22 | 1724 | 2229 | $0.00055 | $0.00073 | 0/0 |
| GPT-4.1 | openai | 57% | 33–79% · n=14 | 82%59–94% · n=17 | 95%78–99% · n=22 | 1263 | 3057 | $0.00273 | $0.00368 | 0/0 |
| Llama-4 Maverick | deepinfra | 50% | 27–73% · n=14 | 71%47–87% · n=17 | 86%67–95% · n=22 | 1429 | 1911 | $0.00033 | $0.00051 | 0/0 |
| Qwen3-235B-A22B Instruct | deepinfra | 43% | 21–67% · n=14 | 47%26–69% · n=17 | 91%72–97% · n=22 | 4662 | 8823 | $0.00019 | $0.00034 | 0/0 |
| GPT-5.5 | openrouter | 43% | 21–67% · n=14 | 71%47–87% · n=17 | 100%85–100% · n=22 | 439 | 553 | $0.018 | $0.036 | 0/0 |
| GPT-5 mini | openrouter | 36% | 16–61% · n=14 | 82%59–94% · n=17 | 91%72–97% · n=22 | 365 | 622 | $0.00198 | $0.00513 | 0/0 |
| GPT-4o | openai | 36% | 16–61% · n=14 | 65%41–83% · n=17 | 95%78–99% · n=22 | 978 | 1130 | $0.00339 | $0.00741 | 0/0 |
| Qwen2.5-72B | deepinfra | 29% | 12–55% · n=14 | 53%31–74% · n=17 | 95%78–99% · n=22 | 3115 | 3614 | $0.00044 | $0.00112 | 0/0 |
| gpt-oss-120b | openrouter | 21% | 8–48% · n=14 | 59%36–78% · n=17 | 86%65–95% · n=21 | 1131 | 3271 | $0.00012 | $0.00050 | 1/0 |
| Kimi K2.6 | openrouter | 0% | 0–79% · n=1 | 100%34–100% · n=2 | 100%34–100% · n=2 | 823 | 971 | $0.00088 | — | 20/0 |
Combined pipeline vs single-shot
The product thesis under test: does running a trace through the full RJ pipeline (parse → compress → judge) beat handing the same model the same prompt over the raw trace? Within each pair the model is held fixed — only the trace representation changes — so any difference is compression alone. Preliminary at n=14: intervals are wide and we flag where they overlap.
qwen2.5-72b
compression 0pp step@1- $/call
- $0.00032 vs $0.00042 (1.3× cheaper)
- prompt tokens
- 713 vs 952
No accuracy change for this model — but compression is still cheaper per call. Compression is not a substitute for model capability.
claude-opus-4-8
compression +21pp step@1- $/call
- $0.00940 vs $0.011 (1.2× cheaper)
- prompt tokens
- 1111 vs 1423
Compression improves accuracy and cuts cost for this model. At n=14 the two Wilson intervals overlap — suggestive, not yet significant.
Cheap + compressed vs frontier single-shot
The production candidate (RJ pipeline on a cheap open model) against an expensive frontier model reading the raw trace. Cost and accuracy shown together — a cheaper call that is also less accurate is a trade-off, not a free win.
| vs frontier-naive | candidate step@1 | frontier step@1 | Δ step@1 | cost |
|---|---|---|---|---|
| naive-opus | 21% | 71% | -50pp | 34× cheaper |
| naive-gpt55 | 21% | 50% | -29pp | 45× cheaper |
| naive-gemini | 21% | 69% | -48pp | 48× cheaper |
Read this honestly: the cheap candidate is dramatically cheaper but also less accurate on this corpus. The win this run actually proves is the one above — compression helps the strong model both ways. Pairing compression with a frontier model, not a weak one, is where the proposition holds.
How much compression?
deterministic · chars/4 token proxy- median
- 1.75×
- smaller trace
- max
- 290.63×
- longest trace
- don't fit raw
- 2/22
- exceed 200K context
The capability claim. 2 of 22 real traces exceed a 200K-token context window in raw form — a naive single-shot literally cannot read them. RJ compresses them to fit: gaia/0140b3f657eddf76ca82f72c49ac8e58 (290.63×), gaia/01c5727165fc43899b3b594b9bef5f19 (275.92×). The aggregate corpus reduction is 85.85×.
| Arm | Trace | Provider | step@1 | 95% CI · n | $/call | $/correct | med ms | err/pf |
|---|---|---|---|---|---|---|---|---|
| rj-opus | compressed | openrouter | 93% | 69–99% · n=14 | $0.00940 | $0.010 | 1283 | 0/0 |
| naive-opus | raw | openrouter | 71% | 45–88% · n=14 | $0.011 | $0.015 | 1306 | 0/0 |
| naive-gemini | raw | openrouter | 69% | 42–87% · n=13 | $0.016 | $0.024 | 3384 | 1/0 |
| naive-gpt55 | raw | openrouter | 50% | 27–73% · n=14 | $0.015 | $0.029 | 435 | 0/0 |
| rj-qwen | compressed | deepinfra | 21% | 8–48% · n=14 | $0.00032 | $0.00150 | 3831 | 0/0 |
| naive-qwen | raw | deepinfra | 21% | 8–48% · n=14 | $0.00042 | $0.00195 | 5424 | 0/0 |
Fastest models
Wall-clock latency per attribution call, median and p95. This axis spans every case, so it is precise even where step@1 is not. Watch the spread between median and p95 — some models have a heavy slow tail.
Fastest models
median latency (bar) and p95 (marker), in ms — lower is fasterMarker line = p95. Right-hand figures are median / p95 ms over successfully-scored calls.
Performance vs trace length
How cost and latency scale as traces get longer. The >20-span bucket is the GAIA agent-trace corpus, which has no root-cause gold distinct from the anchor — so it measures the long-trace cost/latency axis, not accuracy.
| Trace size | cases | calls | step@1 (pooled) | mean $/call | median latency (model range) |
|---|---|---|---|---|---|
| ≤5 spans | 17 | 28915 errored | 52% · n=225 | $0.00426 | 378–6143ms |
| 6–20 spans | 3 | 514 errored | — (no gold) | $0.00652 | 441–7117ms |
| >20 spans | 2 | 342 errored | — (no gold) | $0.00690 | 328–13126ms |
Failure categories
Which failure categories the judges assign across the labeled corpus, and how often. Descriptive only — we have no production usage data yet, so this is the shape of the corpus, not failures in the wild.