Skip to main content
Google

Google Tests All Gemma 4 Models on MacBook, Finds Which Run On‑Device

Published by
SectorHQ Editorial
Google Tests All Gemma 4 Models on MacBook, Finds Which Run On‑Device

Photo by Kevin Ku on Unsplash

While many expected Google’s new Gemma 4 suite to run effortlessly on any laptop, a recent report shows only the smaller models actually operate on‑device, with the larger variants stalling on a MacBook M4 Pro with 24 GB RAM.

Key Facts

  • Key company: Google

Google’s on‑device benchmark shows that only the two smallest Gemma 4 models—E2B (2.3 B parameters) and E4B (4.5 B)—can actually run on a MacBook M4 Pro with 24 GB of RAM, while the larger 26‑B‑parameter MoE variant (A4B) and the 31‑B dense model fail to load or stall, according to a hands‑on test posted by akartit on kartit.net on April 4. The author measured raw token‑generation speed with both Ollama and the Unsloth MLX runtime, finding that Ollama delivers 95 tokens / s on E2B and 57 tokens / s on E4B, whereas Unsloth MLX lags slightly (81 tok/s and 49 tok/s respectively) but consumes roughly 40 % less memory. The 26‑B A4B model, which requires 16‑18 GB at 4‑bit quantisation, drops to about 2 tok/s when forced to swap, and the 31‑B model cannot be fit into the 24‑GB memory envelope at all.

The performance gap is most evident in multimodal tasks. Audio‑to‑text transcription, tested in three languages via Ollama’s OpenAI‑compatible endpoint, works reliably only on the E4B model. In English, E4B produced a perfect transcription in 1.0 seconds, complete with punctuation, while E2B took 2.8 seconds and delivered a garbled output missing words and punctuation. French and Arabic followed the same pattern: E4B rendered flawless transcriptions in 1.6 seconds and 6.0 seconds respectively, whereas E2B’s results were fragmented or nonsensical. The author notes that “only E2B and E4B support audio,” implying that the larger models lack the necessary audio pipelines or exceed memory constraints for this modality.

Image‑processing benchmarks also favor the smaller variants. Both E2B and E4B handle text‑plus‑image prompts within the 4‑GB and 5.5‑GB memory footprints reported by the Gemma 4 technical sheet, but the 26‑B A4B model, despite its MoE architecture that activates only 4 B parameters at inference, still requires 16‑18 GB of RAM and consequently suffers severe slowdown when forced to swap. The author’s “interactive version with playable audio, live charts, and the working React app” (gemma4‑benchmark.pages.dev) demonstrates that the larger models cannot sustain real‑time interaction on a typical laptop, reinforcing Google’s own classification of the 2.3‑B and 4.5‑B models as “phones and edge” and the 4‑B MoE as “laptops.”

From a business perspective, the findings suggest that Google’s claim of a “new ceiling for on‑device MoE models” may be premature for consumer‑grade hardware. While the 26‑B A4B variant shows impressive throughput on a high‑end workstation (the author cites a Strix Halo with 128 GB RAM achieving “impressive” speeds), it remains out of reach for most developers and end users who rely on laptops or thin clients. This disparity could limit the commercial appeal of Gemma 4’s larger models, especially when competitors such as Meta’s Llama 3 or open‑source alternatives from the community can be run on similar hardware with comparable or better efficiency.

Nevertheless, the successful deployment of E2B and E4B on a MacBook M4 Pro validates Google’s strategy of offering a tiered family of models that scale from “phones and edge” to “laptops.” The 4‑bit quantised weights keep memory usage modest—4 GB for E2B and 5.5 GB for E4B—allowing developers to embed multimodal capabilities (text, image, audio) directly into desktop applications without resorting to cloud inference. As enterprises continue to prioritize data‑privacy and latency, the ability to run even a 4.5 B‑parameter model locally could become a differentiator, provided that software stacks like Ollama and Unsloth MLX continue to optimise performance and memory overhead.

Sources

Primary source

No primary source found (coverage-based)

Other signals
  • Dev.to Machine Learning Tag
  • Reddit - r/LocalLLaMA New

Reporting based on verified sources and public filings. Sector HQ editorial standards require multi-source attribution.

More from SectorHQ:📊Intelligence📝Blog

🏢Companies in This Story

Related Stories