Tool Use

Seven OpenAI-style function-calling scenarios across three difficulty tiers (easy / medium / hard). Toolset: list_files, read_file, apply_diff, get_weather. The model must pick tools, supply correct arguments, chain multiple calls and synthesise a final answer.

Task & test logic in detail

Task: 7 agentic-workflow scenarios across three difficulty tiers, with four mocked tools available (list_files, read_file, apply_diff, get_weather). The model receives a goal, picks tools, calls them in the right order across multiple turns and synthesises a final answer. Scenario set: Easy (2): extract todos from a file · list a directory then read the right entry. Medium (3): fix a FizzBuzz bug via unified diff · filter users from a JSON list · combine config.json with a weather mock for an alert. Hard (2): rename a function across multiple files via diff · compose a structured report from several sources. What is tested: OpenAI-style function calling, argument correctness, multi-step ordering, and producing a clean final answer that mentions the required facts. Why models fail: models without 'tool_use' capability ignore the schemas. Weak models pick wrong tools or emit malformed arguments. Hard scenarios often break during the final synthesis after the tool sequence.

Prompt

System prompt

Du arbeitest in einer agentic IDE (vergleichbar mit Cursor/Claude Code). Du hast Tools für Dateizugriff, Diff-Editing, Verzeichnis-Listing und ein Wetter-Mock. Plane mehrstufig: erst aufklären, dann handeln, am Ende prüfen. Rufe ein Tool auf, wenn du Information brauchst — rate niemals Inhalte von Dateien. Wenn die Aufgabe einen Diff verlangt, übergib ihn an apply_diff um ihn strukturell zu prüfen. Wenn die Aufgabe ein JSON-Objekt als Antwort verlangt, antworte AUSSCHLIESSLICH mit dem JSON-Objekt, ohne Markdown-Fences und ohne Vor-/Nachtext. Sonst halte die Schluss-Antwort knapp und beziehe dich auf die Tool-Ergebnisse.

Wall-time vs. quality

Max RAM

X = wall-time for this bench · Y = score (0–100 %) in this bench. Optimum is top-left — fast and good. RAM estimate for 64k context: 4 GB system + model weights + max(2 GB, 40% of weights) for KV cache.

Colour = vendor · Number = total parameters (B) dense MoE

Models in this bench

44 visible

1. gemma-4-e2b gguf 4bit 100% · 51s · 132 t/s · 10 GB
2. gemma-4-e2b gguf 8bit 100% · 78s · 113 t/s · 12 GB
3. gemma-4-e4b gguf 4bit 100% · 80s · 89 t/s · 12 GB
4. qwen3-8b mlx 4bit 100% · 120s · 88 t/s · 10 GB
5. nemotron-3-nano-omni gguf 4bit 100% · 202s · 87 t/s · 38 GB
6. gemma-4-26b-a4b gguf 8bit 100% · 132s · 75 t/s · 41 GB
7. qwen3.6-35b-a3b gguf 8bit 100% · 70s · 70 t/s · 53 GB
8. gemma-4-e4b gguf 8bit 100% · 94s · 68 t/s · 16 GB
9. glm-4.5-air-mlx mlx 4bit 100% · 146s · 43 t/s · 82 GB
10. qwen3.6-27b gguf 4bit 100% · 179s · 22 t/s · 27 GB
11. gemma-4-31b gguf 4bit 100% · 249s · 21 t/s · 30 GB
12. qwen3.5-122b-a10b gguf 4bit 100% · 266s · 12 t/s · 102 GB
13. qwen3.5-27b-claude-4.6-opus-distilled-mlx mlx 4bit 100% · 651s · 9 t/s · 24 GB
14. gpt-oss-20b mlx 4bit 98% · 99s · 114 t/s · 20 GB
15. glm-4.7-flash mlx 4bit 98% · 53s · 85 t/s · 28 GB
16. qwen3-coder-next mlx 4bit 98% · 38s · 74 t/s · 67 GB
17. gemma-4-26b-a4b gguf 4bit 98% · 37s · 94 t/s · 27 GB
18. nemotron-3-nano mlx 4bit 96% · 137s · 132 t/s · 27 GB
19. qwen3-vl-30b mlx 4bit 96% · 51s · 91 t/s · 28 GB
20. qwen3-vl-8b mlx 4bit 96% · 106s · 84 t/s · 12 GB
21. qwen3.5-35b-a3b gguf 4bit 96% · 35s · 81 t/s · 33 GB
22. qwen3-coder-30b mlx 4bit 95% · 33s · 109 t/s · 26 GB
23. qwen3.6-35b-a3b gguf 4bit 95% · 55s · 83 t/s · 33 GB
24. qwen3-30b-a3b-2507 mlx 4bit 94% · 22s · 107 t/s · 26 GB
25. glm-4.7-flash mlx 8bit 94% · 51s · 57 t/s · 46 GB
26. gpt-oss-120b gguf 4bit 92% · 71s · 82 t/s · 87 GB
27. qwen3-4b-2507 mlx 4bit 89% · 26s · 150 t/s · 8 GB
28. nemotron-3-nano-4b gguf 4bit 89% · 109s · 86 t/s · 9 GB
29. nemotron-3-nano-omni gguf 8bit 89% · 244s · 77 t/s · 50 GB
30. qwen3-4b-thinking-2507 mlx 4bit 89% · 422s · 49 t/s · 8 GB
31. devstral-small-2-2512 mlx 4bit 89% · 64s · 37 t/s · 22 GB
32. nemotron-3-super gguf 4bit 89% · 364s · 33 t/s · 116 GB
33. granite-4-h-tiny gguf 4bit 89% · 32s · 118 t/s · 10 GB
34. glm-4.6v-flash mlx 4bit 86% · 134s · 67 t/s · 13 GB
35. qwen3.5-9b-mlx mlx 4bit 82% · 81s · 89 t/s · 12 GB
36. ministral-3-14b-reasoning gguf 4bit 82% · 52s · 54 t/s · 16 GB
37. seed-oss-36b mlx 4bit 76% · 835s · 21 t/s · 31 GB
38. qwen3.5-4b gguf 4bit 75% · 42s · 88 t/s · 9 GB
39. qwen3.5-2b gguf 4bit 75% · 31s · 166 t/s · 8 GB
40. qwen3.5-9b gguf 4bit 70% · 161s · 61 t/s · 13 GB
41. qwen3.5-9b gguf 8bit 69% · 69s · 45 t/s · 18 GB
42. llama-3.3-70b gguf 4bit 50% · 122s · 11 t/s · 59 GB
43. lfm2-24b-a2b mlx 4bit 48% · 22s · 154 t/s · 22 GB
44. lfm2.5-1.2b mlx 8bit 5% · 10s · 287 t/s · 7 GB

Model	Vendor	Quant	Ctx	Released	RAM	tok/s	Tokens	Wall	Score
gemma-4-e2b	google	gguf 4bit	128k	2026-03-02	4.1 GB	132	6154	50.8 s	100%
gemma-4-e2b	google	gguf 8bit	128k	2026-03-02	5.5 GB	113	8225	77.9 s	100%
gemma-4-e4b	google	gguf 4bit	128k	2026-03-02	5.9 GB	89	6672	80.1 s	100%
qwen3-8b	qwen	mlx 4bit	40k	2025-04-27	4.3 GB	88	9664	120.0 s	100%
nemotron-3-nano-omni	nvidia	gguf 4bit	256k	2026-04-20	24.3 GB	87	16547	202.2 s	100%
gemma-4-26b-a4b	google	gguf 8bit	256k	2026-03-12	26.1 GB	75	8835	131.7 s	100%
qwen3.6-35b-a3b	qwen	gguf 8bit	256k	2026-04-15	35.2 GB	70	3994	69.5 s	100%
gemma-4-e4b	google	gguf 8bit	128k	2026-03-02	8.4 GB	68	5971	93.9 s	100%
glm-4.5-air-mlx	lmstudio-community	mlx 4bit	128k	2025-07-28	56.0 GB	43	4074	145.5 s	100%
qwen3.6-27b	qwen	gguf 4bit	256k	2026-04-21	16.3 GB	22	3040	179.4 s	100%
gemma-4-31b	google	gguf 4bit	256k	2026-03-12	18.5 GB	21	4124	249.2 s	100%
qwen3.5-122b-a10b	lmstudio-community	gguf 4bit	256k	2026-02-24	70.0 GB	12	1888	265.9 s	100%
qwen3.5-27b-claude-4.6-opus-distilled-mlx	mlx-community	mlx 4bit	256k	2026-03-04	14.1 GB	9	3986	650.6 s	100%
gpt-oss-20b	openai	mlx 4bit	128k	2025-08-04	11.3 GB	114	8636	98.5 s	98%
glm-4.7-flash	zai-org	mlx 4bit	198k	2026-01-19	16.9 GB	85	2885	53.4 s	98%
qwen3-coder-next	qwen	mlx 4bit	256k	2026-01-30	45.3 GB	74	1519	38.0 s	98%
gemma-4-26b-a4b	google	gguf 4bit	256k	2026-03-12	16.8 GB	94	2737	36.8 s	98%
nemotron-3-nano	nvidia	mlx 4bit	256k	2025-12-15	16.6 GB	132	14908	136.7 s	96%
qwen3-vl-30b	qwen	mlx 4bit	256k	2025-10-04	17.0 GB	91	1840	51.4 s	96%
qwen3-vl-8b	qwen	mlx 4bit	256k	2025-10-11	5.4 GB	84	3981	106.2 s	96%
qwen3.5-35b-a3b	qwen	gguf 4bit	256k	2026-02-24	20.6 GB	81	2090	34.6 s	96%
qwen3-coder-30b	qwen	mlx 4bit	256k	2025-07-31	16.0 GB	109	1304	33.0 s	95%
qwen3.6-35b-a3b	qwen	gguf 4bit	256k	2026-04-15	20.6 GB	83	3831	55.4 s	95%
qwen3-30b-a3b-2507	qwen	mlx 4bit	256k	2025-07-21	16.0 GB	107	1675	22.3 s	94%
glm-4.7-flash	zai-org	mlx 8bit	198k	2026-01-19	29.7 GB	57	1250	51.1 s	94%
gpt-oss-120b	openai	gguf 4bit	128k	2025-08-04	59.0 GB	82	4241	70.9 s	92%
qwen3-4b-2507	qwen	mlx 4bit	256k	2025-08-06	2.1 GB	150	1754	26.4 s	89%
nemotron-3-nano-4b	nvidia	gguf 4bit	1024k	2026-03-07	2.6 GB	86	8477	108.8 s	89%
nemotron-3-nano-omni	nvidia	gguf 8bit	256k	2026-04-20	32.8 GB	77	17645	244.0 s	89%
qwen3-4b-thinking-2507	qwen	mlx 4bit	256k	2025-08-05	2.1 GB	49	20242	422.2 s	89%
devstral-small-2-2512	mistralai	mlx 4bit	384k	2025-12-09	13.2 GB	37	1317	63.9 s	89%
nemotron-3-super	nvidia	gguf 4bit	1024k	2026-03-10	80.1 GB	33	10677	364.2 s	89%
granite-4-h-tiny	ibm	gguf 4bit	1024k	2025-10-02	3.9 GB	118	3070	32.1 s	89%
glm-4.6v-flash	zai-org	mlx 4bit	128k	2025-12-07	6.6 GB	67	5103	133.9 s	86%
qwen3.5-9b-mlx	mlx-community	mlx 4bit	256k	2026-02-27	5.6 GB	89	4672	80.6 s	82%
ministral-3-14b-reasoning	mistralai	gguf 4bit	256k	2025-10-31	8.5 GB	54	1950	51.5 s	82%
seed-oss-36b	bytedance	mlx 4bit	512k	2025-08-20	19.0 GB	21	15737	834.8 s	76%
qwen3.5-4b	lmstudio-community	gguf 4bit	256k	2026-03-02	3.2 GB	88	3101	41.6 s	75%
qwen3.5-2b	lmstudio-community	gguf 4bit	256k	2026-03-02	1.8 GB	166	4454	31.3 s	75%
qwen3.5-9b	qwen	gguf 4bit	256k	2026-02-27	6.1 GB	61	8776	160.8 s	70%
qwen3.5-9b	qwen	gguf 8bit	256k	2026-02-27	9.7 GB	45	2663	69.2 s	69%
llama-3.3-70b	meta	gguf 4bit	128k	2024-12-06	39.6 GB	11	542	121.6 s	50%
lfm2-24b-a2b	liquid	mlx 4bit	125k	2026-02-24	12.5 GB	154	947	21.6 s	48%
lfm2.5-1.2b	liquid	mlx 8bit	125k	2026-01-06	1.2 GB	287	658	9.7 s	5%
gemma-4-31b	google	gguf 8bit	256k	2026-03-12	—	0	—	0.0 s	—
qwen3.6-27b	qwen	gguf 8bit	256k	2026-04-21	—	0	—	0.0 s	—

Click a row to open the model detail page. Hover shows available render previews. Column headers are sortable.