Zhipu AI Benchmarks Three 30B MoE Models on M1 Max, Pitting GLM‑4.7‑Flash vs
Photo by Kevin Ku on Unsplash
Around 3 billion parameters fire per token in each of three 30 billion‑parameter MoE models benchmarked on a MacBook Pro M1 Max, revealing GLM‑4.7‑Flash, Nemotron‑3‑Nano and Qwen3‑Coder’s Apple‑silicon performance.
Quick Summary
- •Around 3 billion parameters fire per token in each of three 30 billion‑parameter MoE models benchmarked on a MacBook Pro M1 Max, revealing GLM‑4.7‑Flash, Nemotron‑3‑Nano and Qwen3‑Coder’s Apple‑silicon performance.
- •Key company: Zhipu AI
- •Also mentioned: Qwen3-Coder-30B, Zhipu AI
Zhipu AI’s GLM‑4.7‑Flash, NVIDIA’s Nemotron‑3‑Nano‑30B and Alibaba’s Qwen3‑Coder‑30B were put through a head‑to‑head benchmark on a MacBook Pro M1 Max with 64 GB of unified memory, according to the author’s own test report. All three models are mixture‑of‑experts (MoE) architectures that activate roughly three billion parameters per token, a sweet spot for Apple‑silicon inference. The tests ran with `llama‑server` (build 8139), flash‑attention enabled, a context window of 4 K tokens and four parallel streams, providing a consistent environment for comparison.
In raw token‑per‑second throughput, Nemotron‑3‑Nano leads on prompt evaluation with an average of 136.9 tok/s, edging out Qwen3‑Coder’s 132.1 tok/s and GLM‑4.7‑Flash’s 99.4 tok/s. Generation speed tells a slightly different story: Qwen3‑Coder tops the chart at 58.5 tok/s, Nemotron follows at 43.7 tok/s, and GLM‑4.7‑Flash trails at 36.8 tok/s. The generation range for Qwen3‑Coder (57.0–60.2 tok/s) is notably tighter than the other two, indicating more consistent performance across varied prompts.
The per‑prompt breakdown reveals why the headline numbers can be misleading. For a general‑knowledge query, GLM‑4.7‑Flash pre‑fills at 54.9 tok/s but generates at 40.6 tok/s, while Nemotron‑3‑Nano pre‑fills at 113.8 tok/s and generates at 44.8 tok/s. Qwen3‑Coder’s pre‑fill sits in the middle at 75.1 tok/s but its generation jumps to 60.2 tok/s, making it feel the snappiest. Similar patterns appear in math‑reasoning, coding and networking‑explanation prompts, with Qwen3‑Coder consistently delivering the highest generation rates despite a modest pre‑fill speed.
A decisive factor is the “thinking token tax” imposed by the models’ internal reasoning modes. Both GLM‑4.7‑Flash and Nemotron‑3‑Nano are configured as “thinking” models, emitting extended chain‑of‑thought (CoT) traces before the final answer, whereas Qwen3‑Coder runs in a non‑thinking mode. The report shows GLM‑4.7‑Flash producing 2‑5× more reasoning tokens than Nemotron‑3‑Nano—for the TCP‑vs‑UDP prompt, GLM generated 1 664 tokens (4 567 characters of reasoning) versus Nemotron’s 1 101 tokens (181 characters of reasoning). Qwen3‑Coder, by contrast, emitted only 220 tokens (955 characters) for the same prompt, all content‑focused. This overhead translates directly into longer wall‑clock times: GLM‑4.7‑Flash took 47.7 seconds to answer the TCP‑vs‑UDP question, Nemotron‑3‑Nano required 26.2 seconds, while Qwen3‑Coder answered in just 3.8 seconds.
Quality-wise, all three models solved the included math problem correctly, suggesting comparable baseline competence despite speed differences. However, the report notes that GLM‑4.7‑Flash’s reasoning traces are substantially longer, which may be advantageous for applications that need transparent step‑by‑step explanations but detrimental for latency‑sensitive workloads. Nemotron‑3‑Nano offers a lighter CoT, striking a middle ground, while Qwen3‑Coder’s omission of thinking makes it the fastest but potentially less explainable.
From a deployment perspective, the models also differ in size and resource footprints. GLM‑4.7‑Flash occupies 16 GB on disk and uses about 16.9 GB of GPU VRAM when quantized to Q4_K_XL (4.68 bits per weight). Nemotron‑3‑Nano is larger at 22 GB on disk and 22 GB of VRAM (5.78 bits per weight), whereas Qwen3‑Coder is the most compact at 15 GB on disk and 15.8 GB of VRAM (IQ4_XS, 4.29 bits per weight). Licensing varies as well: GLM‑4.7‑Flash is released under MIT, Nemotron‑3‑Nano under NVIDIA’s open license, and Qwen3‑Coder under Apache 2.0, giving developers flexibility depending on commercial needs.
Overall, the benchmark underscores that on Apple‑silicon hardware, raw token throughput is only part of the story. The presence or absence of internal reasoning, quantization format, and model size all shape real‑world latency. For developers prioritizing speed and minimal memory overhead, Qwen3‑Coder appears the clear winner; for those who value explainable AI and are willing to tolerate longer response times, GLM‑4.7‑Flash offers richer CoT output. Nemotron‑3‑Nano occupies a middle niche, delivering solid speed with a lightweight reasoning mode.
Sources
No primary source found (coverage-based)
- Reddit - r/LocalLLaMA New
This article was created using AI technology and reviewed by the SectorHQ editorial team for accuracy and quality.