Needle in a Haystack

Three sub-benchmarks in one chat (prefill only once): corpus summary · needle retrieval (10 hidden facts in 120k tokens — uniform across all models) · 6 comprehension questions about the book (4 factual answers + 2 hallucination traps).

Task & test logic in detail

Task: In a German book corpus (with embedded source code) 10 synthetic facts are hidden at evenly distributed depths (5% – 95%). The model must retrieve all of them. Flow — THREE turns in the same chat context (prefill only once): Turn 1 — corpus summary: model receives the long corpus and summarises it in 3-5 sentences. Forces real processing of the text. Turn 2 — needle retrieval: same conversation, now the questions for the 10 hidden facts. Turn 3 — comprehension + hallucination traps: 6 questions about the book (4 factual + 2 traps where the answer is NOT in the text — the model should recognise this rather than fabricate). Default mode runs ONE uniform stage for all models: 120k tokens. Models without sufficient max_context are skipped at this stage. `niah_deep` additionally runs 32k / 64k / 200k for a full heatmap. Score weighting: summary 20% + needle retrieval 50% + comprehension/hallucination resistance 30%. Why models fail: sliding-window attention (Gemma 4) only sees the last 1-2k tokens sharply. Reasoning models hit the token limit before answering. Q4 KV cache measurably degrades recall at long contexts. On the hallucination traps the helpful bias lures models into plausible-sounding inventions.

Prompt

Developer prompt

TURN 1 (User):
The following section contains a longer mixed text of German narrative and source code.

===== TEXT BEGIN =====
<corpus with embedded needles, 32k–128k tokens depending on stage>
===== TEXT END =====

Summarise the text in 3-5 sentences. Mention the main characters, setting and key themes.

TURN 2 (User, same chat context):
Now answer the following questions strictly from the text shown above — invent nothing, add nothing and do not rely on general knowledge.

Questions:
1. <question for needle 1>
2. <question for needle 2>
...

Answer as a numbered list 1., 2., 3. with one short sentence each.

Wall-time vs. quality

Max RAM

X = wall-time for this bench · Y = score (0–100 %) in this bench. Optimum is top-left — fast and good. RAM estimate for 64k context: 4 GB system + model weights + max(2 GB, 40% of weights) for KV cache.

Colour = vendor · Number = total parameters (B) dense MoE

Models in this bench

26 visible

1. qwen3.6-27b gguf 4bit 97% · 1175s · 13 t/s · 27 GB
2. qwen3.6-35b-a3b gguf 4bit 96% · 316s · 49 t/s · 33 GB
3. gemma-4-e4b gguf 4bit 96% · 205s · 55 t/s · 12 GB
4. qwen3-coder-next mlx 4bit 90% · 166s · 45 t/s · 67 GB
5. gpt-oss-20b mlx 4bit 90% · 432s · 46 t/s · 20 GB
6. gemma-4-26b-a4b gguf 4bit 90% · 279s · 55 t/s · 27 GB
7. gemma-3-27b mlx 4bit 88% · 463s · 19 t/s · 26 GB
8. qwen3.5-35b-a3b gguf 4bit 88% · 306s · 48 t/s · 33 GB
9. qwen3.5-4b gguf 4bit 87% · 395s · 43 t/s · 9 GB
10. qwen3.5-9b gguf 4bit 86% · 566s · 35 t/s · 13 GB
11. gemma-3-12b mlx 4bit 86% · 227s · 34 t/s · 15 GB
12. qwen3-vl-8b mlx 4bit 85% · 1055s · 23 t/s · 12 GB
13. qwen3-4b-2507 mlx 4bit 84% · 283s · 26 t/s · 8 GB
14. gpt-oss-120b gguf 4bit 81% · 287s · 41 t/s · 87 GB
15. ministral-3-14b-reasoning gguf 4bit 77% · 1068s · 20 t/s · 16 GB
16. gemma-4-e2b gguf 4bit 76% · 153s · 67 t/s · 10 GB
17. qwen3-8b mlx 4bit 76% · 88s · 48 t/s · 10 GB
18. nemotron-3-nano mlx 4bit 69% · 379s · 86 t/s · 27 GB
19. qwen3.5-2b gguf 4bit 65% · 61s · 111 t/s · 8 GB
20. qwen3.5-9b-mlx mlx 4bit 63% · 471s · 54 t/s · 12 GB
21. granite-4-h-tiny gguf 4bit 62% · 65s · 94 t/s · 10 GB
22. nemotron-3-nano-omni gguf 4bit 58% · 263s · 67 t/s · 38 GB
23. gemma-3-4b mlx 4bit 54% · 70s · 93 t/s · 9 GB
24. nemotron-3-nano-4b gguf 4bit 38% · 138s · 65 t/s · 9 GB
25. lfm2.5-1.2b mlx 8bit 10% · 32s · 125 t/s · 7 GB
26. lfm2-24b-a2b mlx 4bit 10% · 271s · 69 t/s · 22 GB

Model	Vendor	Quant	Ctx	Released	RAM	tok/s	Tokens	Wall	Score
qwen3.6-27b	qwen	gguf 4bit	256k	2026-04-21	16.3 GB	13	7244	1175.5 s	97%
qwen3.6-35b-a3b	qwen	gguf 4bit	256k	2026-04-15	20.6 GB	49	7550	316.1 s	96%
gemma-4-e4b	google	gguf 4bit	128k	2026-03-02	5.9 GB	55	3207	205.3 s	96%
qwen3-coder-next	qwen	mlx 4bit	256k	2026-01-30	45.3 GB	45	1029	165.7 s	90%
gpt-oss-20b	openai	mlx 4bit	128k	2025-08-04	11.3 GB	46	5633	431.9 s	90%
gemma-4-26b-a4b	google	gguf 4bit	256k	2026-03-12	16.8 GB	55	5921	278.9 s	90%
gemma-3-27b	google	mlx 4bit	128k	2025-03-01	15.7 GB	19	681	463.3 s	88%
qwen3.5-35b-a3b	qwen	gguf 4bit	256k	2026-02-24	20.6 GB	48	6804	305.5 s	88%
qwen3.5-4b	lmstudio-community	gguf 4bit	256k	2026-03-02	3.2 GB	43	10211	395.1 s	87%
qwen3.5-9b	qwen	gguf 4bit	256k	2026-02-27	6.1 GB	35	12476	566.1 s	86%
gemma-3-12b	google	mlx 4bit	128k	2025-03-01	7.5 GB	34	740	227.1 s	86%
qwen3-vl-8b	qwen	mlx 4bit	256k	2025-10-11	5.4 GB	23	710	1055.4 s	85%
qwen3-4b-2507	qwen	mlx 4bit	256k	2025-08-06	2.1 GB	26	711	283.0 s	84%
gpt-oss-120b	openai	gguf 4bit	128k	2025-08-04	59.0 GB	41	2204	286.9 s	81%
ministral-3-14b-reasoning	mistralai	gguf 4bit	256k	2025-10-31	8.5 GB	20	11312	1068.5 s	77% preliminary
gemma-4-e2b	google	gguf 4bit	128k	2026-03-02	4.1 GB	67	2910	153.4 s	76%
qwen3-8b	qwen	mlx 4bit	40k	2025-04-27	4.3 GB	48	2007	88.0 s	76%
nemotron-3-nano	nvidia	mlx 4bit	256k	2025-12-15	16.6 GB	86	11228	379.0 s	69%
qwen3.5-2b	lmstudio-community	gguf 4bit	256k	2026-03-02	1.8 GB	111	825	60.7 s	65%
qwen3.5-9b-mlx	mlx-community	mlx 4bit	256k	2026-02-27	5.6 GB	54	17912	471.0 s	63%
granite-4-h-tiny	ibm	gguf 4bit	1024k	2025-10-02	3.9 GB	94	861	65.5 s	62%
nemotron-3-nano-omni	nvidia	gguf 4bit	256k	2026-04-20	24.3 GB	67	10160	262.9 s	58%
gemma-3-4b	google	mlx 4bit	128k	2025-02-20	2.8 GB	93	658	69.6 s	54%
nemotron-3-nano-4b	nvidia	gguf 4bit	1024k	2026-03-07	2.6 GB	65	2214	138.5 s	38%
lfm2.5-1.2b	liquid	mlx 8bit	125k	2026-01-06	1.2 GB	125	96	31.6 s	10%
lfm2-24b-a2b	liquid	mlx 4bit	125k	2026-02-24	12.5 GB	69	13997	270.8 s	10%
glm-4.6v-flash	zai-org	mlx 4bit	128k	2025-12-07	—	27	7999	610.3 s	—
qwen3-coder-30b	qwen	mlx 4bit	256k	2025-07-31	16.0 GB	21	360	521.5 s	—
qwen3.5-27b-claude-4.6-opus-distilled-mlx	mlx-community	mlx 4bit	256k	2026-03-04	14.1 GB	18	1971	1020.5 s	—
glm-4.5-air-mlx	lmstudio-community	mlx 4bit	128k	2025-07-28	56.0 GB	0	—	0.0 s	—
gemma-4-26b-a4b	google	gguf 8bit	256k	2026-03-12	—	0	—	0.0 s	—
gemma-4-31b	google	gguf 4bit	256k	2026-03-12	18.5 GB	0	—	0.0 s	—
gemma-4-31b	google	gguf 8bit	256k	2026-03-12	—	0	—	0.0 s	—
gemma-4-e2b	google	gguf 8bit	128k	2026-03-02	—	0	—	0.0 s	—
gemma-4-e4b	google	gguf 8bit	128k	2026-03-02	—	0	—	0.0 s	—
nemotron-3-nano-omni	nvidia	gguf 8bit	256k	2026-04-20	—	0	—	0.0 s	—
ouro-2.6b	mlx-community	mlx 4bit	64k	2025-11-09	—	0	—	0.0 s	—
qwen3.5-122b-a10b	lmstudio-community	gguf 4bit	256k	2026-02-24	70.0 GB	0	—	0.0 s	—
qwen3-4b-thinking-2507	qwen	mlx 4bit	256k	2025-08-05	2.1 GB	0	—	0.0 s	—
qwen3.5-9b	qwen	gguf 8bit	256k	2026-02-27	9.7 GB	0	—	0.0 s	—
qwen3.6-27b	qwen	gguf 8bit	256k	2026-04-21	—	0	—	0.0 s	—
qwen3.6-35b-a3b	qwen	gguf 8bit	256k	2026-04-15	—	0	—	0.0 s	—
glm-4.7-flash	zai-org	mlx 8bit	198k	2026-01-19	29.7 GB	0	—	0.0 s	—
glm-4.7-flash	zai-org	mlx 4bit	198k	2026-01-19	16.9 GB	0	—	0.0 s	—

Click a row to open the model detail page. Hover shows available render previews. Column headers are sortable.