Tool Use
Seven OpenAI-style function-calling scenarios across three difficulty tiers (easy / medium / hard). Toolset: list_files, read_file, apply_diff, get_weather. The model must pick tools, supply correct arguments, chain multiple calls and synthesise a final answer.
Task & test logic in detail
Task: 7 agentic-workflow scenarios across three difficulty tiers, with four mocked tools available (list_files, read_file, apply_diff, get_weather). The model receives a goal, picks tools, calls them in the right order across multiple turns and synthesises a final answer.
Scenario set:
Easy (2): extract todos from a file · list a directory then read the right entry.
Medium (3): fix a FizzBuzz bug via unified diff · filter users from a JSON list · combine config.json with a weather mock for an alert.
Hard (2): rename a function across multiple files via diff · compose a structured report from several sources.
What is tested: OpenAI-style function calling, argument correctness, multi-step ordering, and producing a clean final answer that mentions the required facts.
Why models fail: models without 'tool_use' capability ignore the schemas. Weak models pick wrong tools or emit malformed arguments. Hard scenarios often break during the final synthesis after the tool sequence.
Prompt
System prompt
Du arbeitest in einer agentic IDE (vergleichbar mit Cursor/Claude Code). Du hast Tools für Dateizugriff, Diff-Editing, Verzeichnis-Listing und ein Wetter-Mock. Plane mehrstufig: erst aufklären, dann handeln, am Ende prüfen. Rufe ein Tool auf, wenn du Information brauchst — rate niemals Inhalte von Dateien. Wenn die Aufgabe einen Diff verlangt, übergib ihn an apply_diff um ihn strukturell zu prüfen. Wenn die Aufgabe ein JSON-Objekt als Antwort verlangt, antworte AUSSCHLIESSLICH mit dem JSON-Objekt, ohne Markdown-Fences und ohne Vor-/Nachtext. Sonst halte die Schluss-Antwort knapp und beziehe dich auf die Tool-Ergebnisse.
Wall-time vs. quality
X = wall-time for this bench · Y = score (0–100 %) in this bench. Optimum is top-left — fast and good. RAM estimate for 64k context: 4 GB system + model weights + max(2 GB, 40% of weights) for KV cache.
Colour = vendor · Number = total parameters (B) dense MoE
Models in this bench
38 visible
- 1. gemma-4-e2b gguf 4bit 100% · 51s · 132 t/s · 10 GB
- 2. gemma-4-e2b gguf 8bit 100% · 78s · 113 t/s · 12 GB
- 3. gemma-4-e4b gguf 4bit 100% · 80s · 89 t/s · 12 GB
- 4. qwen3-8b mlx 4bit 100% · 120s · 88 t/s · 10 GB
- 5. nemotron-3-nano-omni gguf 4bit 100% · 202s · 87 t/s · 38 GB
- 6. gemma-4-26b-a4b gguf 8bit 100% · 132s · 75 t/s · 41 GB
- 7. qwen3.6-35b-a3b gguf 8bit 100% · 70s · 70 t/s · 53 GB
- 8. gemma-4-e4b gguf 8bit 100% · 94s · 68 t/s · 16 GB
- 9. glm-4.5-air-mlx mlx 4bit 100% · 146s · 43 t/s · 82 GB
- 10. qwen3.6-27b gguf 4bit 100% · 179s · 22 t/s · 27 GB
- 11. gemma-4-31b gguf 4bit 100% · 249s · 21 t/s · 30 GB
- 12. qwen3.5-122b-a10b gguf 4bit 100% · 266s · 12 t/s · 102 GB
- 13. qwen3.5-27b-claude-4.6-opus-distilled-mlx mlx 4bit 100% · 651s · 9 t/s · 24 GB
- 14. gpt-oss-20b mlx 4bit 98% · 99s · 114 t/s · 20 GB
- 15. glm-4.7-flash mlx 4bit 98% · 53s · 85 t/s · 28 GB
- 16. qwen3-coder-next mlx 4bit 98% · 38s · 74 t/s · 67 GB
- 17. gemma-4-26b-a4b gguf 4bit 98% · 37s · 94 t/s · 27 GB
- 18. nemotron-3-nano mlx 4bit 96% · 137s · 132 t/s · 27 GB
- 19. qwen3-vl-8b mlx 4bit 96% · 106s · 84 t/s · 12 GB
- 20. qwen3.5-35b-a3b gguf 4bit 96% · 35s · 81 t/s · 33 GB
- 21. qwen3-coder-30b mlx 4bit 95% · 33s · 109 t/s · 26 GB
- 22. qwen3.6-35b-a3b gguf 4bit 95% · 55s · 83 t/s · 33 GB
- 23. glm-4.7-flash mlx 8bit 94% · 51s · 57 t/s · 46 GB
- 24. gpt-oss-120b gguf 4bit 92% · 71s · 82 t/s · 87 GB
- 25. qwen3-4b-2507 mlx 4bit 89% · 26s · 150 t/s · 8 GB
- 26. nemotron-3-nano-4b gguf 4bit 89% · 109s · 86 t/s · 9 GB
- 27. nemotron-3-nano-omni gguf 8bit 89% · 244s · 77 t/s · 50 GB
- 28. qwen3-4b-thinking-2507 mlx 4bit 89% · 422s · 49 t/s · 8 GB
- 29. granite-4-h-tiny gguf 4bit 89% · 32s · 118 t/s · 10 GB
- 30. glm-4.6v-flash mlx 4bit 86% · 134s · 67 t/s · 13 GB
- 31. qwen3.5-9b-mlx mlx 4bit 82% · 81s · 89 t/s · 12 GB
- 32. ministral-3-14b-reasoning gguf 4bit 82% · 52s · 54 t/s · 16 GB
- 33. qwen3.5-4b gguf 4bit 75% · 42s · 88 t/s · 9 GB
- 34. qwen3.5-2b gguf 4bit 75% · 31s · 166 t/s · 8 GB
- 35. qwen3.5-9b gguf 4bit 70% · 161s · 61 t/s · 13 GB
- 36. qwen3.5-9b gguf 8bit 69% · 69s · 45 t/s · 18 GB
- 37. lfm2-24b-a2b mlx 4bit 48% · 22s · 154 t/s · 22 GB
- 38. lfm2.5-1.2b mlx 8bit 5% · 10s · 287 t/s · 7 GB
| Model | Vendor | Quant | Ctx | Released | RAM | tok/s | Tokens | Wall | Score |
|---|---|---|---|---|---|---|---|---|---|
| gemma-4-e2b | gguf 4bit | 128k | 2026-03-02 | 4.1 GB | 132 | 6154 | 50.8 s | 100% | |
| gemma-4-e2b | gguf 8bit | 128k | 2026-03-02 | 5.5 GB | 113 | 8225 | 77.9 s | 100% | |
| gemma-4-e4b | gguf 4bit | 128k | 2026-03-02 | 5.9 GB | 89 | 6672 | 80.1 s | 100% | |
| qwen3-8b | qwen | mlx 4bit | 40k | 2025-04-27 | 4.3 GB | 88 | 9664 | 120.0 s | 100% |
| nemotron-3-nano-omni | nvidia | gguf 4bit | 256k | 2026-04-20 | 24.3 GB | 87 | 16547 | 202.2 s | 100% |
| gemma-4-26b-a4b | gguf 8bit | 256k | 2026-03-12 | 26.1 GB | 75 | 8835 | 131.7 s | 100% | |
| qwen3.6-35b-a3b | qwen | gguf 8bit | 256k | 2026-04-15 | 35.2 GB | 70 | 3994 | 69.5 s | 100% |
| gemma-4-e4b | gguf 8bit | 128k | 2026-03-02 | 8.4 GB | 68 | 5971 | 93.9 s | 100% | |
| glm-4.5-air-mlx | lmstudio-community | mlx 4bit | 128k | 2025-07-28 | 56.0 GB | 43 | 4074 | 145.5 s | 100% |
| qwen3.6-27b | qwen | gguf 4bit | 256k | 2026-04-21 | 16.3 GB | 22 | 3040 | 179.4 s | 100% |
| gemma-4-31b | gguf 4bit | 256k | 2026-03-12 | 18.5 GB | 21 | 4124 | 249.2 s | 100% | |
| qwen3.5-122b-a10b | lmstudio-community | gguf 4bit | 256k | 2026-02-24 | 70.0 GB | 12 | 1888 | 265.9 s | 100% |
| qwen3.5-27b-claude-4.6-opus-distilled-mlx | mlx-community | mlx 4bit | 256k | 2026-03-04 | 14.1 GB | 9 | 3986 | 650.6 s | 100% |
| gpt-oss-20b | openai | mlx 4bit | 128k | 2025-08-04 | 11.3 GB | 114 | 8636 | 98.5 s | 98% |
| glm-4.7-flash | zai-org | mlx 4bit | 198k | 2026-01-19 | 16.9 GB | 85 | 2885 | 53.4 s | 98% |
| qwen3-coder-next | qwen | mlx 4bit | 256k | 2026-01-30 | 45.3 GB | 74 | 1519 | 38.0 s | 98% |
| gemma-4-26b-a4b | gguf 4bit | 256k | 2026-03-12 | 16.8 GB | 94 | 2737 | 36.8 s | 98% | |
| nemotron-3-nano | nvidia | mlx 4bit | 256k | 2025-12-15 | 16.6 GB | 132 | 14908 | 136.7 s | 96% |
| qwen3-vl-8b | qwen | mlx 4bit | 256k | 2025-10-11 | 5.4 GB | 84 | 3981 | 106.2 s | 96% |
| qwen3.5-35b-a3b | qwen | gguf 4bit | 256k | 2026-02-24 | 20.6 GB | 81 | 2090 | 34.6 s | 96% |
| qwen3-coder-30b | qwen | mlx 4bit | 256k | 2025-07-31 | 16.0 GB | 109 | 1304 | 33.0 s | 95% |
| qwen3.6-35b-a3b | qwen | gguf 4bit | 256k | 2026-04-15 | 20.6 GB | 83 | 3831 | 55.4 s | 95% |
| glm-4.7-flash | zai-org | mlx 8bit | 198k | 2026-01-19 | 29.7 GB | 57 | 1250 | 51.1 s | 94% |
| gpt-oss-120b | openai | gguf 4bit | 128k | 2025-08-04 | 59.0 GB | 82 | 4241 | 70.9 s | 92% |
| qwen3-4b-2507 | qwen | mlx 4bit | 256k | 2025-08-06 | 2.1 GB | 150 | 1754 | 26.4 s | 89% |
| nemotron-3-nano-4b | nvidia | gguf 4bit | 1024k | 2026-03-07 | 2.6 GB | 86 | 8477 | 108.8 s | 89% |
| nemotron-3-nano-omni | nvidia | gguf 8bit | 256k | 2026-04-20 | 32.8 GB | 77 | 17645 | 244.0 s | 89% |
| qwen3-4b-thinking-2507 | qwen | mlx 4bit | 256k | 2025-08-05 | 2.1 GB | 49 | 20242 | 422.2 s | 89% |
| granite-4-h-tiny | ibm | gguf 4bit | 1024k | 2025-10-02 | 3.9 GB | 118 | 3070 | 32.1 s | 89% |
| glm-4.6v-flash | zai-org | mlx 4bit | 128k | 2025-12-07 | 6.6 GB | 67 | 5103 | 133.9 s | 86% |
| qwen3.5-9b-mlx | mlx-community | mlx 4bit | 256k | 2026-02-27 | 5.6 GB | 89 | 4672 | 80.6 s | 82% |
| ministral-3-14b-reasoning | mistralai | gguf 4bit | 256k | 2025-10-31 | 8.5 GB | 54 | 1950 | 51.5 s | 82% |
| qwen3.5-4b | lmstudio-community | gguf 4bit | 256k | 2026-03-02 | 3.2 GB | 88 | 3101 | 41.6 s | 75% |
| qwen3.5-2b | lmstudio-community | gguf 4bit | 256k | 2026-03-02 | 1.8 GB | 166 | 4454 | 31.3 s | 75% |
| qwen3.5-9b | qwen | gguf 4bit | 256k | 2026-02-27 | 6.1 GB | 61 | 8776 | 160.8 s | 70% |
| qwen3.5-9b | qwen | gguf 8bit | 256k | 2026-02-27 | 9.7 GB | 45 | 2663 | 69.2 s | 69% |
| lfm2-24b-a2b | liquid | mlx 4bit | 125k | 2026-02-24 | 12.5 GB | 154 | 947 | 21.6 s | 48% |
| lfm2.5-1.2b | liquid | mlx 8bit | 125k | 2026-01-06 | 1.2 GB | 287 | 658 | 9.7 s | 5% |
| gemma-4-31b | gguf 8bit | 256k | 2026-03-12 | — | 0 | — | 0.0 s | — | |
| qwen3.6-27b | qwen | gguf 8bit | 256k | 2026-04-21 | — | 0 | — | 0.0 s | — |
Click a row to open the model detail page. Hover shows available render previews. Column headers are sortable.