Needle in a Haystack
Three sub-benchmarks in one chat (prefill only once): corpus summary · needle retrieval (10 hidden facts in 120k tokens — uniform across all models) · 6 comprehension questions about the book (4 factual answers + 2 hallucination traps).
Task & test logic in detail
Task: In a German book corpus (with embedded source code) 10 synthetic facts are hidden at evenly distributed depths (5% – 95%). The model must retrieve all of them.
Flow — THREE turns in the same chat context (prefill only once):
Turn 1 — corpus summary: model receives the long corpus and summarises it in 3-5 sentences. Forces real processing of the text.
Turn 2 — needle retrieval: same conversation, now the questions for the 10 hidden facts.
Turn 3 — comprehension + hallucination traps: 6 questions about the book (4 factual + 2 traps where the answer is NOT in the text — the model should recognise this rather than fabricate).
Default mode runs ONE uniform stage for all models: 120k tokens. Models without sufficient max_context are skipped at this stage. `niah_deep` additionally runs 32k / 64k / 200k for a full heatmap.
Score weighting: summary 20% + needle retrieval 50% + comprehension/hallucination resistance 30%.
Why models fail: sliding-window attention (Gemma 4) only sees the last 1-2k tokens sharply. Reasoning models hit the token limit before answering. Q4 KV cache measurably degrades recall at long contexts. On the hallucination traps the helpful bias lures models into plausible-sounding inventions.
Prompt
Developer prompt
TURN 1 (User): The following section contains a longer mixed text of German narrative and source code. ===== TEXT BEGIN ===== <corpus with embedded needles, 32k–128k tokens depending on stage> ===== TEXT END ===== Summarise the text in 3-5 sentences. Mention the main characters, setting and key themes. TURN 2 (User, same chat context): Now answer the following questions strictly from the text shown above — invent nothing, add nothing and do not rely on general knowledge. Questions: 1. <question for needle 1> 2. <question for needle 2> ... Answer as a numbered list 1., 2., 3. with one short sentence each.
Wall-time vs. quality
X = wall-time for this bench · Y = score (0–100 %) in this bench. Optimum is top-left — fast and good. RAM estimate for 64k context: 4 GB system + model weights + max(2 GB, 40% of weights) for KV cache.
Colour = vendor · Number = total parameters (B) dense MoE
Models in this bench
26 visible
- 1. qwen3.6-27b gguf 4bit 97% · 1175s · 13 t/s · 27 GB
- 2. qwen3.6-35b-a3b gguf 4bit 96% · 316s · 49 t/s · 33 GB
- 3. gemma-4-e4b gguf 4bit 96% · 205s · 55 t/s · 12 GB
- 4. qwen3-coder-next mlx 4bit 90% · 166s · 45 t/s · 67 GB
- 5. gpt-oss-20b mlx 4bit 90% · 432s · 46 t/s · 20 GB
- 6. gemma-4-26b-a4b gguf 4bit 90% · 279s · 55 t/s · 27 GB
- 7. gemma-3-27b mlx 4bit 88% · 463s · 19 t/s · 26 GB
- 8. qwen3.5-35b-a3b gguf 4bit 88% · 306s · 48 t/s · 33 GB
- 9. qwen3.5-4b gguf 4bit 87% · 395s · 43 t/s · 9 GB
- 10. qwen3.5-9b gguf 4bit 86% · 566s · 35 t/s · 13 GB
- 11. gemma-3-12b mlx 4bit 86% · 227s · 34 t/s · 15 GB
- 12. qwen3-vl-8b mlx 4bit 85% · 1055s · 23 t/s · 12 GB
- 13. qwen3-4b-2507 mlx 4bit 84% · 283s · 26 t/s · 8 GB
- 14. gpt-oss-120b gguf 4bit 81% · 287s · 41 t/s · 87 GB
- 15. ministral-3-14b-reasoning gguf 4bit 77% · 1068s · 20 t/s · 16 GB
- 16. gemma-4-e2b gguf 4bit 76% · 153s · 67 t/s · 10 GB
- 17. qwen3-8b mlx 4bit 76% · 88s · 48 t/s · 10 GB
- 18. nemotron-3-nano mlx 4bit 69% · 379s · 86 t/s · 27 GB
- 19. qwen3.5-2b gguf 4bit 65% · 61s · 111 t/s · 8 GB
- 20. qwen3.5-9b-mlx mlx 4bit 63% · 471s · 54 t/s · 12 GB
- 21. granite-4-h-tiny gguf 4bit 62% · 65s · 94 t/s · 10 GB
- 22. nemotron-3-nano-omni gguf 4bit 58% · 263s · 67 t/s · 38 GB
- 23. gemma-3-4b mlx 4bit 54% · 70s · 93 t/s · 9 GB
- 24. nemotron-3-nano-4b gguf 4bit 38% · 138s · 65 t/s · 9 GB
- 25. lfm2.5-1.2b mlx 8bit 10% · 32s · 125 t/s · 7 GB
- 26. lfm2-24b-a2b mlx 4bit 10% · 271s · 69 t/s · 22 GB
| Model | Vendor | Quant | Ctx | Released | RAM | tok/s | Tokens | Wall | Score |
|---|---|---|---|---|---|---|---|---|---|
| qwen3.6-27b | qwen | gguf 4bit | 256k | 2026-04-21 | 16.3 GB | 13 | 7244 | 1175.5 s | 97% |
| qwen3.6-35b-a3b | qwen | gguf 4bit | 256k | 2026-04-15 | 20.6 GB | 49 | 7550 | 316.1 s | 96% |
| gemma-4-e4b | gguf 4bit | 128k | 2026-03-02 | 5.9 GB | 55 | 3207 | 205.3 s | 96% | |
| qwen3-coder-next | qwen | mlx 4bit | 256k | 2026-01-30 | 45.3 GB | 45 | 1029 | 165.7 s | 90% |
| gpt-oss-20b | openai | mlx 4bit | 128k | 2025-08-04 | 11.3 GB | 46 | 5633 | 431.9 s | 90% |
| gemma-4-26b-a4b | gguf 4bit | 256k | 2026-03-12 | 16.8 GB | 55 | 5921 | 278.9 s | 90% | |
| gemma-3-27b | mlx 4bit | 128k | 2025-03-01 | 15.7 GB | 19 | 681 | 463.3 s | 88% | |
| qwen3.5-35b-a3b | qwen | gguf 4bit | 256k | 2026-02-24 | 20.6 GB | 48 | 6804 | 305.5 s | 88% |
| qwen3.5-4b | lmstudio-community | gguf 4bit | 256k | 2026-03-02 | 3.2 GB | 43 | 10211 | 395.1 s | 87% |
| qwen3.5-9b | qwen | gguf 4bit | 256k | 2026-02-27 | 6.1 GB | 35 | 12476 | 566.1 s | 86% |
| gemma-3-12b | mlx 4bit | 128k | 2025-03-01 | 7.5 GB | 34 | 740 | 227.1 s | 86% | |
| qwen3-vl-8b | qwen | mlx 4bit | 256k | 2025-10-11 | 5.4 GB | 23 | 710 | 1055.4 s | 85% |
| qwen3-4b-2507 | qwen | mlx 4bit | 256k | 2025-08-06 | 2.1 GB | 26 | 711 | 283.0 s | 84% |
| gpt-oss-120b | openai | gguf 4bit | 128k | 2025-08-04 | 59.0 GB | 41 | 2204 | 286.9 s | 81% |
| ministral-3-14b-reasoning | mistralai | gguf 4bit | 256k | 2025-10-31 | 8.5 GB | 20 | 11312 | 1068.5 s | 77% preliminary |
| gemma-4-e2b | gguf 4bit | 128k | 2026-03-02 | 4.1 GB | 67 | 2910 | 153.4 s | 76% | |
| qwen3-8b | qwen | mlx 4bit | 40k | 2025-04-27 | 4.3 GB | 48 | 2007 | 88.0 s | 76% |
| nemotron-3-nano | nvidia | mlx 4bit | 256k | 2025-12-15 | 16.6 GB | 86 | 11228 | 379.0 s | 69% |
| qwen3.5-2b | lmstudio-community | gguf 4bit | 256k | 2026-03-02 | 1.8 GB | 111 | 825 | 60.7 s | 65% |
| qwen3.5-9b-mlx | mlx-community | mlx 4bit | 256k | 2026-02-27 | 5.6 GB | 54 | 17912 | 471.0 s | 63% |
| granite-4-h-tiny | ibm | gguf 4bit | 1024k | 2025-10-02 | 3.9 GB | 94 | 861 | 65.5 s | 62% |
| nemotron-3-nano-omni | nvidia | gguf 4bit | 256k | 2026-04-20 | 24.3 GB | 67 | 10160 | 262.9 s | 58% |
| gemma-3-4b | mlx 4bit | 128k | 2025-02-20 | 2.8 GB | 93 | 658 | 69.6 s | 54% | |
| nemotron-3-nano-4b | nvidia | gguf 4bit | 1024k | 2026-03-07 | 2.6 GB | 65 | 2214 | 138.5 s | 38% |
| lfm2.5-1.2b | liquid | mlx 8bit | 125k | 2026-01-06 | 1.2 GB | 125 | 96 | 31.6 s | 10% |
| lfm2-24b-a2b | liquid | mlx 4bit | 125k | 2026-02-24 | 12.5 GB | 69 | 13997 | 270.8 s | 10% |
| glm-4.6v-flash | zai-org | mlx 4bit | 128k | 2025-12-07 | — | 27 | 7999 | 610.3 s | — |
| qwen3-coder-30b | qwen | mlx 4bit | 256k | 2025-07-31 | 16.0 GB | 21 | 360 | 521.5 s | — |
| qwen3.5-27b-claude-4.6-opus-distilled-mlx | mlx-community | mlx 4bit | 256k | 2026-03-04 | 14.1 GB | 18 | 1971 | 1020.5 s | — |
| glm-4.5-air-mlx | lmstudio-community | mlx 4bit | 128k | 2025-07-28 | 56.0 GB | 0 | — | 0.0 s | — |
| gemma-4-26b-a4b | gguf 8bit | 256k | 2026-03-12 | — | 0 | — | 0.0 s | — | |
| gemma-4-31b | gguf 4bit | 256k | 2026-03-12 | 18.5 GB | 0 | — | 0.0 s | — | |
| gemma-4-31b | gguf 8bit | 256k | 2026-03-12 | — | 0 | — | 0.0 s | — | |
| gemma-4-e2b | gguf 8bit | 128k | 2026-03-02 | — | 0 | — | 0.0 s | — | |
| gemma-4-e4b | gguf 8bit | 128k | 2026-03-02 | — | 0 | — | 0.0 s | — | |
| nemotron-3-nano-omni | nvidia | gguf 8bit | 256k | 2026-04-20 | — | 0 | — | 0.0 s | — |
| ouro-2.6b | mlx-community | mlx 4bit | 64k | 2025-11-09 | — | 0 | — | 0.0 s | — |
| qwen3.5-122b-a10b | lmstudio-community | gguf 4bit | 256k | 2026-02-24 | 70.0 GB | 0 | — | 0.0 s | — |
| qwen3-4b-thinking-2507 | qwen | mlx 4bit | 256k | 2025-08-05 | 2.1 GB | 0 | — | 0.0 s | — |
| qwen3.5-9b | qwen | gguf 8bit | 256k | 2026-02-27 | 9.7 GB | 0 | — | 0.0 s | — |
| qwen3.6-27b | qwen | gguf 8bit | 256k | 2026-04-21 | — | 0 | — | 0.0 s | — |
| qwen3.6-35b-a3b | qwen | gguf 8bit | 256k | 2026-04-15 | — | 0 | — | 0.0 s | — |
| glm-4.7-flash | zai-org | mlx 8bit | 198k | 2026-01-19 | 29.7 GB | 0 | — | 0.0 s | — |
| glm-4.7-flash | zai-org | mlx 4bit | 198k | 2026-01-19 | 16.9 GB | 0 | — | 0.0 s | — |
Click a row to open the model detail page. Hover shows available render previews. Column headers are sortable.