microsoft/phi-4-reasoning-plus
microsoft
14.7B
· dense
mlx / 4bit
ctx 32k
released 2025-04-17
coding
all models in this bench →Score
52%
Static
100%
Functional
33%
Qualitative
54%
Worum geht es? Was wird getestet?
Task: From a ~200-word prompt the model must generate a fully functional Kanban board as a single-file HTML with drag & drop, localStorage persistence, edit/delete and a confetti animation — in a single chat without iteration. The prompt also includes a small `data-testid` contract so a Playwright test can drive the app remotely.
Three signals feed into the score:
(1) Static — a linter checks concrete constraints in the HTML (columns, Tailwind, localStorage call, no framework, no window.alert/prompt, …).
(2) Functional — Playwright runs a small CRUD sequence: create a card, delete a card with confirmation, reload — does state persist? — and checks whether any JS console errors occur during the entire flow. Drag & drop and confetti are deliberately not tested functionally (too many implementation variants).
(3) Qualitative — LLM-as-judge rates screenshot and code (visual + code quality + render↔code consistency).
Score = mean over the available signals.
Why models fail: reasoning models burn their tokens in thinking instead of writing. Sliding-window models (Gemma 4) lose the constraints at the start of the prompt. Small models (<3B) often fail to produce coherent HTML — or ignore the data-testid contract, which makes the functional tests fail in droves.
Prompt
System prompt
You are a careful front-end engineer.
Developer prompt
Create a fully functional Kanban board in a single HTML file using vanilla JavaScript (no frameworks like react). Requirements: - Columns: Backlog, In Progress, Review, Done. - Cards must be: - draggable across columns, - editable in place, - persisted in localStorage (state survives reloads) - please use your own namespace, - deletable with a confirmation prompt. - Each column provides an "Add card" action. - Style with Tailwind via CDN. - Add subtle CSS transitions and trigger a confetti animation when a card moves to "Done". - Thoroughly comment the code. - dont use window.alert or window.prompt to add/edit/delete cards - if there are no cards yet, create some dummy cards - modern and vibrant design Stable test selectors (mandatory — these data-testid attributes are used by an automated functional test; do not omit, rename, or split them across multiple elements): - Column containers: data-testid="column-backlog", data-testid="column-in-progress", data-testid="column-review", data-testid="column-done". - Every "Add card" button (one per column): data-testid="add-card". - Every card element: data-testid="card". - Inside each card, the delete trigger: data-testid="delete-card". - The confirm button of the delete-confirmation dialog/modal: data-testid="confirm-delete". - The input/textarea where a new card title is typed: data-testid="card-input". Pressing Enter in this input MUST commit the new card. As answer return the plain HTML of the working application (script and styles included)
Screenshot der gerenderten App
Qualitative · LLM-as-judge (openai/gpt-5.4)
2026-04-29T06:54:17.731302+00:00
54%
Visual (screenshot)
-
board renders80%
-
column completeness70%
-
cards present50%
-
ui affordances60%
-
design quality35%
Das Board rendert grundsätzlich und alle vier Bereiche sind sichtbar, aber das Layout ist stark degradiert: Die Spalten stehen untereinander statt als echtes Board nebeneinander. Karten sind nur im Backlog sichtbar, obwohl laut Beispielzustand auch eine Karte in "In Progress" erscheinen sollte.
Code quality (HTML/JS)
-
code structure70%
-
dom safety90%
-
robustness45%
-
code quality55%
Der Code ist funktional gegliedert und verwendet für Kartentitel überwiegend sichere DOM-APIs mit textContent statt unsicherem innerHTML für User-Inhalte. Allerdings gibt es mehrere Logikfehler: falsche data-testid-Ableitung für "In Progress", renderBoard erzeugt doppelte card-list-Container, Dragover-Listener werden pro Karte erneut registriert, und das Modal-Verhalten ist fehleranfällig.
Render ↔ code consistency
20%
Es gibt eine deutliche Diskrepanz zwischen Code und Render: Der Initialzustand enthält drei Karten, davon eine in "In Progress", im Screenshot erscheinen aber nur zwei Karten im Backlog. Hauptursache ist sehr wahrscheinlich die inkonsistente Selector-Logik für Spaltennamen, wodurch Teile des States nicht korrekt gerendert werden.
Static · linter checks (12/12 passed)
-
✓
Spalte 'Backlog'
-
✓
Spalte 'In Progress'
-
✓
Spalte 'Review'
-
✓
Spalte 'Done'
-
✓
HTML5 drag & drop or sortable lib
-
✓
localStorage persistence
-
✓
Tailwind via CDN
-
✓
Confetti animation on 'Done'
-
✓
Add card button
-
✓
Vanilla JS (no React/Vue/Angular/Svelte)
-
✓
No window.alert/prompt for CRUD
-
✓
Editable cards (contenteditable / input)
Functional · E2E tests (2/6 passed)
-
✓
Page loads with all 4 columns
-
✓
Add card via button
-
✗
Card is added exactly 1× (no double submit)card was added 7× angelegt — Mehrfach-Listener oder Doppel-Submit
-
✗
Delete card via button + confirmationPage.wait_for_selector: Timeout 2000ms exceeded. Call log: - waiting for locator("[data-testid=\"confirm-delete\"]") to be visible
-
✗
Cards survive a reloadbefore reload 9 Karten, nach Reload 4
-
✗
No JS console errorspageerror: Failed to execute 'removeChild' on 'Node': The node to be removed is no longer a child of this node. Perhaps it was moved in a 'blur' event handler?
Raw model response
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8" />
<!-- Ensure proper scaling on mobile devices -->
<meta name="viewport" content="width=device-width, initial-scale=1.0" />
<title>Kanban Board</title>
<!-- Tailwind CSS CDN for styling -->
<link
href="https://cdn.jsdelivr.net/npm/tailwindcss@2.0/dist/tailwind.min.css"
rel="stylesheet"
/>
<!-- Custom styles for confetti canvas and modal -->
<style>
/* The delete confirmation modal is hidden by default */
.modal {
display: none;
}
</style>
</head>
<body class="bg-gray-50">
<!-- A full-screen canvas for confetti animation (hidden until triggered) -->
<canvas
id="confettiCanvas"
style="position: absolute; top:0; left: 0; pointer-events:none;"
></canvas>
<!-- Main board container -->
<div id="board" class="relative">
<!-- COLUMN: Backlog -->
<div
data-testid="column-backlog"
class="w-full md:w-1/2 lg:w-1/4 p-2"
>
<h2 class="text-lg font-bold mb-2">Backlog</h2>
<!-- "Add card" button with stable test selector -->
<button data-testid="add-card" class="mb-4 p-2 bg-blue-500 text-white rounded">
Add card
</button>
<!-- Container for cards will be dynamically appended here -->
<div class="card-list flex flex-col gap-2"></div>
</div>
<!-- COLUMN: In Progress -->
<div
data-testid="column-in-progress"
class="w-full md:w-1/2 lg:w-1/4 p-2"
>
<h2 class="text-lg font-bold mb-2">In Progress</h2>
<button data-testid="add-card" class="mb-4 p-2 bg-blue-500 text-white rounded">
Add card
</button>
<div class="card-list flex flex-col gap-2"></div>
</div>
<!-- COLUMN: Review -->
<div
data-testid="column-review"
class="w-full md:w-1/2 lg:w-1/4 p-2"
>
<h2 class="text-lg font-bold mb-2">Review</h2