Open Weight Bench

Coding

Single-shot code generation (Kanban board as an HTML file). Measures how fast and how functionally models solve a concrete UI task. Hovering over a model row shows a screenshot of the rendered app.

Task & test logic in detail
Task: From a ~200-word prompt the model must generate a fully functional Kanban board as a single-file HTML with drag & drop, localStorage persistence, edit/delete and a confetti animation — in a single chat without iteration. The prompt also includes a small `data-testid` contract so a Playwright test can drive the app remotely. Three signals feed into the score: (1) Static — a linter checks concrete constraints in the HTML (columns, Tailwind, localStorage call, no framework, no window.alert/prompt, …). (2) Functional — Playwright runs a small CRUD sequence: create a card, delete a card with confirmation, reload — does state persist? — and checks whether any JS console errors occur during the entire flow. Drag & drop and confetti are deliberately not tested functionally (too many implementation variants). (3) Qualitative — LLM-as-judge rates screenshot and code (visual + code quality + render↔code consistency). Score = mean over the available signals. Why models fail: reasoning models burn their tokens in thinking instead of writing. Sliding-window models (Gemma 4) lose the constraints at the start of the prompt. Small models (<3B) often fail to produce coherent HTML — or ignore the data-testid contract, which makes the functional tests fail in droves.
Prompt
System prompt
You are a careful front-end engineer.
Developer prompt
Create a fully functional Kanban board in a single HTML file using vanilla JavaScript (no frameworks like react).

Requirements:
- Columns: Backlog, In Progress, Review, Done.
- Cards must be:
  - draggable across columns,
  - editable in place,
  - persisted in localStorage (state survives reloads) - please use your own namespace,
  - deletable with a confirmation prompt.
- Each column provides an "Add card" action.
- Style with Tailwind via CDN.
- Add subtle CSS transitions and trigger a confetti animation when a card moves to "Done".
- Thoroughly comment the code.
- dont use window.alert or window.prompt to add/edit/delete cards
- if there are no cards yet, create some dummy cards
- modern and vibrant design

Stable test selectors (mandatory — these data-testid attributes are used by an automated functional test; do not omit, rename, or split them across multiple elements):
- Column containers: data-testid="column-backlog", data-testid="column-in-progress", data-testid="column-review", data-testid="column-done".
- Every "Add card" button (one per column): data-testid="add-card".
- Every card element: data-testid="card".
- Inside each card, the delete trigger: data-testid="delete-card".
- The confirm button of the delete-confirmation dialog/modal: data-testid="confirm-delete".
- The input/textarea where a new card title is typed: data-testid="card-input". Pressing Enter in this input MUST commit the new card.

As answer return the plain HTML of the working application (script and styles included)

Wall-time vs. quality

X = wall-time for this bench · Y = score (0–100 %) in this bench. Optimum is top-left — fast and good. RAM estimate for 64k context: 4 GB system + model weights + max(2 GB, 40% of weights) for KV cache.

Colour = vendor · Number = total parameters (B) dense MoE

0% 25% 50% 75% 100% 0s 188s 375s 562s 750s Wall-time (s) → Score 35 27 31 35 80 26 110 27 122 30 30 5 9 120 8 35 9 5 32 30 2 20 15 12 4 8 26 9 27 4 30 8 4 30 14 30 4 7 4 24 14 4 1
Models in this bench
43 visible
  1. 1. qwen3.6-35b-a3b gguf 4bit 93% · 160s · 81 t/s · 33 GB
  2. 2. qwen3.6-27b gguf 4bit 89% · 608s · 22 t/s · 27 GB
  3. 3. gemma-4-31b gguf 4bit 88% · 353s · 21 t/s · 30 GB
  4. 4. qwen3.6-35b-a3b gguf 8bit 87% · 203s · 67 t/s · 53 GB
  5. 5. qwen3-coder-next mlx 4bit 87% · 144s · 72 t/s · 67 GB
  6. 6. gemma-4-26b-a4b gguf 8bit 86% · 118s · 72 t/s · 41 GB
  7. 7. glm-4.5-air-mlx mlx 4bit 86% · 372s · 35 t/s · 82 GB
  8. 8. qwen3.5-27b-claude-4.6-opus-distilled-mlx mlx 4bit 85% · 276s · 25 t/s · 24 GB
  9. 9. qwen3.5-122b-a10b gguf 4bit 80% · 733s · 4 t/s · 102 GB
  10. 10. qwen3-coder-30b mlx 4bit 78% · 83s · 95 t/s · 26 GB
  11. 11. glm-4.7-flash mlx 4bit 78% · 143s · 69 t/s · 28 GB
  12. 12. gemma-4-e2b gguf 8bit 73% · 68s · 110 t/s · 12 GB
  13. 13. qwen3.5-9b gguf 8bit 70% · 172s · 44 t/s · 18 GB
  14. 14. gpt-oss-120b gguf 4bit 69% · 149s · 80 t/s · 87 GB
  15. 15. gemma-4-e4b gguf 4bit 65% · 91s · 85 t/s · 12 GB
  16. 16. qwen3.5-35b-a3b gguf 4bit 62% · 113s · 79 t/s · 33 GB
  17. 17. qwen3.5-9b-mlx mlx 4bit 61% · 113s · 84 t/s · 12 GB
  18. 18. gemma-4-e2b gguf 4bit 61% · 48s · 135 t/s · 10 GB
  19. 19. qwen2.5-coder-32b mlx 4bit 60% · 147s · 23 t/s · 28 GB
  20. 20. qwen3-vl-30b mlx 4bit 60% · 91s · 79 t/s · 28 GB
  21. 21. qwen3.5-2b gguf 4bit 59% · 67s · 158 t/s · 8 GB
  22. 22. gpt-oss-20b mlx 4bit 52% · 44s · 109 t/s · 20 GB
  23. 23. phi-4-reasoning-plus mlx 4bit 52% · 368s · 41 t/s · 15 GB
  24. 24. gemma-3-12b mlx 4bit 51% · 85s · 56 t/s · 15 GB
  25. 25. gemma-3-4b mlx 4bit 50% · 30s · 141 t/s · 9 GB
  26. 26. gemma-4-e4b gguf 8bit 46% · 108s · 66 t/s · 16 GB
  27. 27. gemma-4-26b-a4b gguf 4bit 46% · 112s · 89 t/s · 27 GB
  28. 28. qwen3.5-9b gguf 4bit 45% · 129s · 59 t/s · 13 GB
  29. 29. gemma-3-27b mlx 4bit 42% · 187s · 27 t/s · 26 GB
  30. 30. qwen3.5-4b gguf 4bit 40% · 88s · 85 t/s · 9 GB
  31. 31. nemotron-3-nano-omni gguf 8bit 39% · 140s · 78 t/s · 50 GB
  32. 32. qwen3-8b mlx 4bit 39% · 186s · 73 t/s · 10 GB
  33. 33. qwen3-4b-2507 mlx 4bit 39% · 32s · 135 t/s · 8 GB
  34. 34. nemotron-3-nano-omni gguf 4bit 38% · 217s · 83 t/s · 38 GB
  35. 35. qwen2.5-coder-14b mlx 4bit 36% · 64s · 49 t/s · 15 GB
  36. 36. nemotron-3-nano mlx 4bit 35% · 69s · 131 t/s · 27 GB
  37. 37. qwen3-4b-thinking-2507 mlx 4bit 33% · 113s · 113 t/s · 8 GB
  38. 38. granite-4-h-tiny gguf 4bit 32% · 26s · 116 t/s · 10 GB
  39. 39. nemotron-3-nano-4b gguf 4bit 30% · 64s · 84 t/s · 9 GB
  40. 40. lfm2-24b-a2b mlx 4bit 29% · 53s · 135 t/s · 22 GB
  41. 41. ministral-3-14b-reasoning gguf 4bit 22% · 136s · 46 t/s · 16 GB
  42. 42. gemma-3n-e4b mlx 4bit 11% · 50s · 79 t/s · 12 GB
  43. 43. lfm2.5-1.2b mlx 8bit 7% · 21s · 271 t/s · 7 GB
Model Vendor Quant Ctx Released RAM tok/s Tokens Wall Score

Click a row to open the model detail page. Hover shows available render previews. Column headers are sortable.