Open Weight Bench

Open-weight LLMs put to the test

Local benchmarks for models running in LM Studio on a MacBook Pro M4 Max (128 GB). Six categories, each with visible results — not just numbers.

Why a MacBook Pro can run 100 B-parameter models — unified memory & bandwidth

Apple Silicon uses a unified memory architecture: CPU, GPU and Neural Engine all share the same physical RAM pool over a single high-bandwidth fabric. There is no separate VRAM, no PCIe bottleneck, no copy step from system RAM to GPU memory before inference. Every byte the GPU needs to read a weight tensor is already addressable in place — which is why a 128 GB Mac can hold and run a model that would normally require a workstation GPU with 80 GB+ of VRAM.

For LLM inference, throughput is dominated by how fast you can stream the weights past the compute units. That makes memory bandwidth, not raw FLOPS, the limiting factor on a Mac. The M4 Max ships with up to 546 GB/s of unified memory bandwidth (the binned 14-core variant tops out at 410 GB/s).

Memory Typical bandwidth vs. M4 Max
M4 Max unified memory 546 GB/s 1.0×
Desktop DDR5-6400, dual-channel ~102 GB/s ~5× slower
Desktop DDR4-3200, dual-channel ~51 GB/s ~10× slower

Sources: Apple newsroom — M4 Pro & M4 Max, DDR5 SDRAM (Wikipedia), Kingston DDR5 overview. Desktop figures are for two-DIMM consumer setups; servers with quad/octo-channel reach much higher.

Coding

55 models

Single-shot code generation (Kanban board as an HTML file). Measures how fast and how functionally models solve a concrete UI task. Hovering over a model row shows a screenshot of the rendered app.

Top score 93% Median 78 tok/s open →

Vision

30 models

OCR for vision-capable models: four sub-tasks — handwritten meeting notes in three difficulty tiers (easy / medium / hard) plus an old book page set in Fraktur typeface.

Top score 98% Median 79 tok/s open →

Needle in a Haystack

51 models

Three sub-benchmarks in one chat (prefill only once): corpus summary · needle retrieval (10 hidden facts in 120k tokens — uniform across all models) · 6 comprehension questions about the book (4 factual answers + 2 hallucination traps).

Top score 100% Median 43 tok/s open →

Tool Use

46 models

Seven OpenAI-style function-calling scenarios across three difficulty tiers (easy / medium / hard). Toolset: list_files, read_file, apply_diff, get_weather. The model must pick tools, supply correct arguments, chain multiple calls and synthesise a final answer.

Top score 100% Median 83 tok/s open →

Diagram → SVG

30 models

Photo of a hand-drawn diagram (architecture, flowchart, sequence, quadrant) → model emits inline SVG. Original and render sit side-by-side; an LLM judge rates visual fidelity.

Top score 91% Median 76 tok/s open →

Hallucination

54 models

12 questions with subtle but factually false premises. Does the model invent an answer, say 'I don't know', or correct the premise?

Top score 100% Median 80 tok/s open →

Overall summary across all models

Score 0–100 % per benchmark. Overall = weighted mean: Coding counts double, Tool Use half. RAM = on-disk size of the model (Q4 file). Wall = cumulative bench time across all tasks.

Cumulative wall-time vs. overall score

Colour = vendor · Number = total params (B) · RAM estimate for 64k ctx dense MoE 55/55 visible
0% 25% 50% 75% 100% 0s 975s 1950s 2925s 3900s Cumulative wall-time (s) → Overall score 31 27 26 35 35 80 110 27 122 36 24 120 8 35 9 30 120 30 26 30 20 8 9 5 8 70 9 30 9 12 30 5 27 2 32 8 4 32 4 30 14 15 30 30 4 3 7 4 4 4 14 24 31 27 1
Model Quant Overall RAM Wall total Coding Vision Needle in a Haystack Tool Use Diagram → SVG Hallucination

Click a row to open the model detail page. Columns are sortable.