Meta Unveils Fastest LLM Decode Engine for Apple Silicon, Shows Benchmark Numbers
Photo by Julio Lopez (unsplash.com/@juliolopez) on Unsplash
While most expect Apple Silicon to lag in LLM throughput, Meta’s new MetalRT engine hits 658 tokens per second on a single M4 Max—1.67× faster than llama.cpp, Runanywhere reports.
Key Facts
- •Key company: Meta
Meta’s MetalRT engine, a native C++ binary that bypasses the usual abstraction layers, delivers 658 tokens per second on a single M4 Max chip – a 1.67× advantage over the widely‑used llama.cpp benchmark, Runanywhere reported. The test suite spanned four 4‑bit quantized models (Qwen3‑0.6B, Qwen3‑4B, Llama‑3.2‑3B, and LFM2.5‑1.2B) on an Apple M4 Max with 64 GB of unified memory running macOS 26.3. Across five runs per engine, MetalRT topped the decode‑throughput chart on three of the four models, confirming that a metal‑first approach can outpace both Apple’s own MLX framework and community‑driven inference stacks.
The benchmark compared MetalRT against five competing runtimes: uzu (a production‑grade Rust engine), Apple’s mlx‑lm (the official MLX Python API), llama.cpp (the de‑facto open‑source reference), and Ollama (a Go‑based wrapper that adds a REST streaming layer). While uzu edged out MetalRT on the Llama‑3.2‑3B model with 222 tokens per second, MetalRT still posted 184 tokens per second on the same model, outpacing mlx‑lm (210 t/s) and llama.cpp (137 t/s). The most striking gaps appeared on the smallest model, Qwen3‑0.6B, where MetalRT’s 658 t/s eclipsed llama.cpp’s 295 t/s and Ollama’s 274 t/s, the latter suffering additional latency from its API stack. Relative speedups ranged from 1.10‑1.19× versus mlx‑lm (identical model files) to 1.35‑2.14× versus llama.cpp, and 1.41‑2.40× versus Ollama, underscoring the raw efficiency gains of a metal‑direct pipeline.
Beyond raw numbers, Meta positions MetalRT for latency‑critical applications. The company highlights a 6.6 ms time‑to‑first‑token (TTFT) on the Qwen3‑0.6B model, a metric that directly translates to snappier user experiences in chat interfaces, structured‑output pipelines, and tool‑calling workflows. Faster decode reduces the cumulative latency of sequential LLM calls, a boon for agent‑oriented systems that stitch together multiple model invocations for tool use or function calling. In coding assistants and voice‑driven pipelines, sub‑7 ms TTFT narrows the perceptual gap between hearing a prompt and hearing a response, potentially reshaping on‑device AI product strategies that prioritize privacy and low‑bandwidth operation.
Meta’s push reflects a broader industry trend of extracting maximum performance from consumer‑grade silicon. While Apple’s own MLX framework already offers a native path to Apple GPUs, the Runanywhere data shows that a purpose‑built engine can still shave up to 19 % off decode times. Competitors such as Together AI’s ATLAS speculator have demonstrated 400 % speedups by dynamically adapting workloads, but those gains rely on server‑grade hardware and cloud orchestration. MetalRT’s on‑device advantage—delivering cloud‑competitive throughput without leaving the device—could appeal to developers building privacy‑first or offline‑first AI products, a segment that VentureBeat notes is gaining momentum as enterprises seek to reduce data‑exfiltration risks.
The implications for the AI hardware ecosystem are twofold. First, the results challenge the prevailing narrative that Apple Silicon lags behind dedicated accelerators for LLM inference; Meta’s 658 t/s figure suggests that, with a stripped‑down stack, Apple’s GPUs can rival or exceed the performance of many x86‑based inference servers on comparable model sizes. Second, the benchmark reinforces the importance of software co‑design: the same 4‑bit quantized models run across MetalRT and mlx‑lm, yet the former extracts a consistent 10‑19 % speed benefit purely through engine optimizations. As AI workloads continue to migrate to the edge, vendors that can marry hardware capabilities with tightly tuned inference runtimes—whether through proprietary solutions like MetalRT or open‑source collaborations—are likely to capture the next wave of on‑device AI adoption.
Sources
This article was created using AI technology and reviewed by the SectorHQ editorial team for accuracy and quality.