Apple’s M5 Max 128GB Shows Record‑Breaking Performance in New Round‑2 Tests

While early benchmarks hinted modest gains, the latest round‑2 tests show Apple’s M5 Max 128 GB shattering expectations, delivering record‑breaking performance after the author retuned the methodology based on community feedback, reports indicate.

Key Facts

•Key company: Apple

Apple’s M5 Max 128 GB demonstrates a dramatic leap in prompt‑processing throughput, a metric that has been a blind spot in earlier LLM benchmarks. In the second‑round tests the author of the Reddit post r/LocalLLaMA added a dedicated “prompt processing (PP)” measurement and found the 35‑billion‑parameter Qwen 3.5 35B‑A3B Mixture‑of‑Experts (MoE) model hitting 2,845 tokens per second (tok/s) on a 512‑token prompt, and still above 2,000 tok/s at an 8 k token length. Those numbers are roughly 5.5× faster than the dense 27‑billion‑parameter Qwen 3.5 model run at the same Q6_K quantization, underscoring the synergy between the M5 Max’s 40‑core Metal GPU and its 16‑core Neural Engine (as listed in the system spec table). The author notes that PP speed was the community’s primary request, and the new data “shows the M5’s real advantage over the M4” (Reddit post).

Token‑generation (TG) performance, which remains bandwidth‑bound, also topped out at impressive levels. The same MoE model achieved 92.2 tok/s for a 128‑token generation window using the llama.cpp engine, outpacing the next best result—DeepSeek‑R1 8B at 68.2 tok/s—by more than 35 %. Even the 122‑billion‑parameter Qwen 3.5 122B‑A10B MoE model managed 41.5 tok/s, indicating that the M5 Max can sustain reasonable throughput on very large models when they are quantized to Q4_K_M. By contrast, the dense 27‑billion‑parameter Qwen 3.5 models fell between 17.1 and 24.3 tok/s depending on quantization level (Q8_0, Q6_K, or Q4_K_M), confirming that the architecture’s memory bandwidth of 614 GB/s and full‑GPU memory allocation (≈128 GB) are being leveraged most effectively by MoE workloads.

A key correction in the round‑2 analysis addresses an earlier mis‑comparison between the MLX and llama.cpp inference engines. The author re‑ran the benchmark with equivalent 4‑bit quantization (MLX 4‑bit vs. llama.cpp Q4_K_M) and found MLX delivering 31.6 tok/s, roughly 30 % faster than llama.cpp’s 24.3 tok/s on the same 15.9‑GiB Qwen 3.5 27B model. This replaces the inflated “92 % faster” claim from the first round and aligns the results with a fair, apples‑to‑apples methodology. The updated table also shows llama.cpp’s Q6_K and Q8_0 configurations lagging behind MLX, reinforcing the notion that the newer MLX stack is better optimized for Apple silicon’s unified memory architecture.

The test suite itself was tightened to meet community standards: three repetitions per test, use of the open‑source llama‑bench harness, and explicit quantization parity across all models. The author also expanded the model roster to include a 35‑billion‑parameter MoE, a 122‑billion‑parameter MoE, and a 72‑billion‑parameter dense model (Qwen 2.5 72B), providing a broader view of how the M5 Max scales. Notably, the 72‑billion model, still quantized to Q6_K, managed only 7.9 tok/s, highlighting that beyond a certain size the bandwidth ceiling of the system becomes the limiting factor despite the massive unified memory pool.

Overall, the round‑2 data paints a picture of a silicon platform that excels when the workload can exploit its high‑throughput GPU cores and Neural Engine for parallel prompt processing, while token generation remains constrained by memory bandwidth. The inclusion of MoE models—whose sparse activation patterns align well with the M5 Max’s heterogeneous compute—produces the most striking gains, suggesting that future AI software stacks on macOS will likely gravitate toward sparsity‑aware architectures to fully harness Apple’s latest silicon.

Apple’s M5 Max 128GB Shows Record‑Breaking Performance in New Round‑2 Tests

Key Facts

Sources

🏢Companies in This Story

Related Stories