Open Weight Bench

microsoft/phi-4-reasoning-plus

microsoft 14.7B · dense mlx / 4bit ctx 32k released 2025-04-17
Score
52%
Static
100%
Functional
33%
Qualitative
54%
tok/s
41
Tokens
14415
Wall
368.4 s
RAM
7.7 GB
Worum geht es? Was wird getestet?
Task: From a ~200-word prompt the model must generate a fully functional Kanban board as a single-file HTML with drag & drop, localStorage persistence, edit/delete and a confetti animation — in a single chat without iteration. The prompt also includes a small `data-testid` contract so a Playwright test can drive the app remotely. Three signals feed into the score: (1) Static — a linter checks concrete constraints in the HTML (columns, Tailwind, localStorage call, no framework, no window.alert/prompt, …). (2) Functional — Playwright runs a small CRUD sequence: create a card, delete a card with confirmation, reload — does state persist? — and checks whether any JS console errors occur during the entire flow. Drag & drop and confetti are deliberately not tested functionally (too many implementation variants). (3) Qualitative — LLM-as-judge rates screenshot and code (visual + code quality + render↔code consistency). Score = mean over the available signals. Why models fail: reasoning models burn their tokens in thinking instead of writing. Sliding-window models (Gemma 4) lose the constraints at the start of the prompt. Small models (<3B) often fail to produce coherent HTML — or ignore the data-testid contract, which makes the functional tests fail in droves.
Prompt
System prompt
You are a careful front-end engineer.
Developer prompt
Create a fully functional Kanban board in a single HTML file using vanilla JavaScript (no frameworks like react).

Requirements:
- Columns: Backlog, In Progress, Review, Done.
- Cards must be:
  - draggable across columns,
  - editable in place,
  - persisted in localStorage (state survives reloads) - please use your own namespace,
  - deletable with a confirmation prompt.
- Each column provides an "Add card" action.
- Style with Tailwind via CDN.
- Add subtle CSS transitions and trigger a confetti animation when a card moves to "Done".
- Thoroughly comment the code.
- dont use window.alert or window.prompt to add/edit/delete cards
- if there are no cards yet, create some dummy cards
- modern and vibrant design

Stable test selectors (mandatory — these data-testid attributes are used by an automated functional test; do not omit, rename, or split them across multiple elements):
- Column containers: data-testid="column-backlog", data-testid="column-in-progress", data-testid="column-review", data-testid="column-done".
- Every "Add card" button (one per column): data-testid="add-card".
- Every card element: data-testid="card".
- Inside each card, the delete trigger: data-testid="delete-card".
- The confirm button of the delete-confirmation dialog/modal: data-testid="confirm-delete".
- The input/textarea where a new card title is typed: data-testid="card-input". Pressing Enter in this input MUST commit the new card.

As answer return the plain HTML of the working application (script and styles included)
▶ Open generated app in a new tab ↗
Screenshot der gerenderten App
Screenshot der gerenderten App
Qualitative · LLM-as-judge (openai/gpt-5.4) 2026-04-29T06:54:17.731302+00:00
54%
Visual (screenshot)
  • board renders
    80%
  • column completeness
    70%
  • cards present
    50%
  • ui affordances
    60%
  • design quality
    35%
Das Board rendert grundsätzlich und alle vier Bereiche sind sichtbar, aber das Layout ist stark degradiert: Die Spalten stehen untereinander statt als echtes Board nebeneinander. Karten sind nur im Backlog sichtbar, obwohl laut Beispielzustand auch eine Karte in "In Progress" erscheinen sollte.
Code quality (HTML/JS)
  • code structure
    70%
  • dom safety
    90%
  • robustness
    45%
  • code quality
    55%
Der Code ist funktional gegliedert und verwendet für Kartentitel überwiegend sichere DOM-APIs mit textContent statt unsicherem innerHTML für User-Inhalte. Allerdings gibt es mehrere Logikfehler: falsche data-testid-Ableitung für "In Progress", renderBoard erzeugt doppelte card-list-Container, Dragover-Listener werden pro Karte erneut registriert, und das Modal-Verhalten ist fehleranfällig.
Render ↔ code consistency 20%
Es gibt eine deutliche Diskrepanz zwischen Code und Render: Der Initialzustand enthält drei Karten, davon eine in "In Progress", im Screenshot erscheinen aber nur zwei Karten im Backlog. Hauptursache ist sehr wahrscheinlich die inkonsistente Selector-Logik für Spaltennamen, wodurch Teile des States nicht korrekt gerendert werden.
Static · linter checks (12/12 passed)
  • Spalte 'Backlog'
  • Spalte 'In Progress'
  • Spalte 'Review'
  • Spalte 'Done'
  • HTML5 drag & drop or sortable lib
  • localStorage persistence
  • Tailwind via CDN
  • Confetti animation on 'Done'
  • Add card button
  • Vanilla JS (no React/Vue/Angular/Svelte)
  • No window.alert/prompt for CRUD
  • Editable cards (contenteditable / input)
Functional · E2E tests (2/6 passed)
  • Page loads with all 4 columns
  • Add card via button
  • Card is added exactly 1× (no double submit)
    card was added 7× angelegt — Mehrfach-Listener oder Doppel-Submit
  • Delete card via button + confirmation
    Page.wait_for_selector: Timeout 2000ms exceeded. Call log: - waiting for locator("[data-testid=\"confirm-delete\"]") to be visible
  • Cards survive a reload
    before reload 9 Karten, nach Reload 4
  • No JS console errors
    pageerror: Failed to execute 'removeChild' on 'Node': The node to be removed is no longer a child of this node. Perhaps it was moved in a 'blur' event handler?
Raw model response
<!DOCTYPE html>
<html lang="en">
  <head>
    <meta charset="UTF-8" />
    <!-- Ensure proper scaling on mobile devices -->
    <meta name="viewport" content="width=device-width, initial-scale=1.0" />
    <title>Kanban Board</title>
    <!-- Tailwind CSS CDN for styling -->
    <link
      href="https://cdn.jsdelivr.net/npm/tailwindcss@2.0/dist/tailwind.min.css"
      rel="stylesheet"
    />
    <!-- Custom styles for confetti canvas and modal -->
    <style>
      /* The delete confirmation modal is hidden by default */
      .modal {
        display: none;
      }
    </style>
  </head>
  <body class="bg-gray-50">
    <!-- A full-screen canvas for confetti animation (hidden until triggered) -->
    <canvas
      id="confettiCanvas"
      style="position: absolute; top:0; left: 0; pointer-events:none;"
    ></canvas>

    <!-- Main board container -->
    <div id="board" class="relative">
      <!-- COLUMN: Backlog -->
      <div
        data-testid="column-backlog"
        class="w-full md:w-1/2 lg:w-1/4 p-2"
      >
        <h2 class="text-lg font-bold mb-2">Backlog</h2>
        <!-- "Add card" button with stable test selector -->
        <button data-testid="add-card" class="mb-4 p-2 bg-blue-500 text-white rounded">
          Add card
        </button>
        <!-- Container for cards will be dynamically appended here -->
        <div class="card-list flex flex-col gap-2"></div>
      </div>

      <!-- COLUMN: In Progress -->
      <div
        data-testid="column-in-progress"
        class="w-full md:w-1/2 lg:w-1/4 p-2"
      >
        <h2 class="text-lg font-bold mb-2">In Progress</h2>
        <button data-testid="add-card" class="mb-4 p-2 bg-blue-500 text-white rounded">
          Add card
        </button>
        <div class="card-list flex flex-col gap-2"></div>
      </div>

      <!-- COLUMN: Review -->
      <div
        data-testid="column-review"
        class="w-full md:w-1/2 lg:w-1/4 p-2"
      >
        <h2 class="text-lg font-bold mb-2">Review</h2