Open Weight Bench

Needle in a Haystack

Three sub-benchmarks in one chat (prefill only once): corpus summary · needle retrieval (10 hidden facts in 120k tokens — uniform across all models) · 6 comprehension questions about the book (4 factual answers + 2 hallucination traps).

Task & test logic in detail
Task: In a German book corpus (with embedded source code) 10 synthetic facts are hidden at evenly distributed depths (5% – 95%). The model must retrieve all of them. Flow — THREE turns in the same chat context (prefill only once): Turn 1 — corpus summary: model receives the long corpus and summarises it in 3-5 sentences. Forces real processing of the text. Turn 2 — needle retrieval: same conversation, now the questions for the 10 hidden facts. Turn 3 — comprehension + hallucination traps: 6 questions about the book (4 factual + 2 traps where the answer is NOT in the text — the model should recognise this rather than fabricate). Default mode runs ONE uniform stage for all models: 120k tokens. Models without sufficient max_context are skipped at this stage. `niah_deep` additionally runs 32k / 64k / 200k for a full heatmap. Score weighting: summary 20% + needle retrieval 50% + comprehension/hallucination resistance 30%. Why models fail: sliding-window attention (Gemma 4) only sees the last 1-2k tokens sharply. Reasoning models hit the token limit before answering. Q4 KV cache measurably degrades recall at long contexts. On the hallucination traps the helpful bias lures models into plausible-sounding inventions.
Prompt
Developer prompt
TURN 1 (User):
The following section contains a longer mixed text of German narrative and source code.

===== TEXT BEGIN =====
<corpus with embedded needles, 32k–128k tokens depending on stage>
===== TEXT END =====

Summarise the text in 3-5 sentences. Mention the main characters, setting and key themes.

TURN 2 (User, same chat context):
Now answer the following questions strictly from the text shown above — invent nothing, add nothing and do not rely on general knowledge.

Questions:
1. <question for needle 1>
2. <question for needle 2>
...

Answer as a numbered list 1., 2., 3. with one short sentence each.

Wall-time vs. quality

X = wall-time for this bench · Y = score (0–100 %) in this bench. Optimum is top-left — fast and good. RAM estimate for 64k context: 4 GB system + model weights + max(2 GB, 40% of weights) for KV cache.

Colour = vendor · Number = total parameters (B) dense MoE

0% 25% 50% 75% 100% 0s 300s 600s 900s 1200s Wall-time (s) → Score 27 35 8 80 20 26 27 35 4 9 12 8 4 120 14 5 8 30 2 9 7 30 4 4 1 24
Models in this bench
26 visible
  1. 1. qwen3.6-27b gguf 4bit 97% · 1175s · 13 t/s · 27 GB
  2. 2. qwen3.6-35b-a3b gguf 4bit 96% · 316s · 49 t/s · 33 GB
  3. 3. gemma-4-e4b gguf 4bit 96% · 205s · 55 t/s · 12 GB
  4. 4. qwen3-coder-next mlx 4bit 90% · 166s · 45 t/s · 67 GB
  5. 5. gpt-oss-20b mlx 4bit 90% · 432s · 46 t/s · 20 GB
  6. 6. gemma-4-26b-a4b gguf 4bit 90% · 279s · 55 t/s · 27 GB
  7. 7. gemma-3-27b mlx 4bit 88% · 463s · 19 t/s · 26 GB
  8. 8. qwen3.5-35b-a3b gguf 4bit 88% · 306s · 48 t/s · 33 GB
  9. 9. qwen3.5-4b gguf 4bit 87% · 395s · 43 t/s · 9 GB
  10. 10. qwen3.5-9b gguf 4bit 86% · 566s · 35 t/s · 13 GB
  11. 11. gemma-3-12b mlx 4bit 86% · 227s · 34 t/s · 15 GB
  12. 12. qwen3-vl-8b mlx 4bit 85% · 1055s · 23 t/s · 12 GB
  13. 13. qwen3-4b-2507 mlx 4bit 84% · 283s · 26 t/s · 8 GB
  14. 14. gpt-oss-120b gguf 4bit 81% · 287s · 41 t/s · 87 GB
  15. 15. ministral-3-14b-reasoning gguf 4bit 77% · 1068s · 20 t/s · 16 GB
  16. 16. gemma-4-e2b gguf 4bit 76% · 153s · 67 t/s · 10 GB
  17. 17. qwen3-8b mlx 4bit 76% · 88s · 48 t/s · 10 GB
  18. 18. nemotron-3-nano mlx 4bit 69% · 379s · 86 t/s · 27 GB
  19. 19. qwen3.5-2b gguf 4bit 65% · 61s · 111 t/s · 8 GB
  20. 20. qwen3.5-9b-mlx mlx 4bit 63% · 471s · 54 t/s · 12 GB
  21. 21. granite-4-h-tiny gguf 4bit 62% · 65s · 94 t/s · 10 GB
  22. 22. nemotron-3-nano-omni gguf 4bit 58% · 263s · 67 t/s · 38 GB
  23. 23. gemma-3-4b mlx 4bit 54% · 70s · 93 t/s · 9 GB
  24. 24. nemotron-3-nano-4b gguf 4bit 38% · 138s · 65 t/s · 9 GB
  25. 25. lfm2.5-1.2b mlx 8bit 10% · 32s · 125 t/s · 7 GB
  26. 26. lfm2-24b-a2b mlx 4bit 10% · 271s · 69 t/s · 22 GB
Model Vendor Quant Ctx Released RAM tok/s Tokens Wall Score

Click a row to open the model detail page. Hover shows available render previews. Column headers are sortable.