Nvidia‑backed MoE models benchmarked on M1 Max: GLM‑4.7‑Flash outpaces Nemotron‑3‑Nano
Photo by GAMERCOMP.RU (unsplash.com/@gamercomp) on Unsplash
While many expected NVIDIA’s Nemotron‑3‑Nano to lead on Apple Silicon, GLM‑4.7‑Flash actually outpaced it in head‑to‑head tests on a 64 GB M1 Max MacBook Pro, reports indicate, with both ~30 B MoE models activating only ~3 B parameters per token.
Quick Summary
- •While many expected NVIDIA’s Nemotron‑3‑Nano to lead on Apple Silicon, GLM‑4.7‑Flash actually outpaced it in head‑to‑head tests on a 64 GB M1 Max MacBook Pro, reports indicate, with both ~30 B MoE models activating only ~3 B parameters per token.
- •Key company: Nvidia
- •Also mentioned: Qwen3-Coder-30B, Zhipu AI
GLM‑4.7‑Flash’s edge on the M1 Max stems from its DeepSeek‑V2 MoE core combined with a lightweight “thinking” mode that adds an extended chain‑of‑thought (CoT) step to each request. In the benchmark run on a 64 GB unified‑memory MacBook Pro, the model achieved a prompt‑evaluation speed of 99.4 tokens per second (tok/s) and a generation rate of 36.8 tok/s, according to the author’s “Benchmarked 3 Small MoE Models” report. While those raw numbers trail Nemotron‑3‑Nano’s 136.9 tok/s prefill and 43.7 tok/s generation, GLM’s larger reasoning trace—averaging 2‑5× more tokens than Nemotron’s—means the total time‑to‑answer is higher. For a general‑knowledge query, GLM required 15.6 seconds versus Nemotron’s 6.9 seconds, a gap driven by the 2163‑character reasoning segment that GLM emits before the 868‑character answer (source: benchmark data).
Nemotron‑3‑Nano, NVIDIA’s hybrid Mamba‑2 + Transformer MoE, was engineered for “light” CoT, activating roughly 3.2 B of its 31.6 B parameters per token. Its prompt‑prefill speed topped 136.9 tok/s, and generation hovered around 43.7 tok/s, making it the fastest of the three in raw throughput. However, the model’s thinking phase is modest: a math‑reasoning prompt produced 482 tokens total (213 tokens of reasoning, 1002 tokens of answer) and completed in 10.8 seconds, compared with GLM’s 39.5 seconds for the same task. The benchmark shows Nemotron’s efficiency gains are most pronounced on prompts that demand less extensive reasoning, confirming NVIDIA’s design goal of a “light CoT” that minimizes token overhead while still delivering coherent answers (source: benchmark data).
Qwen3‑Coder‑30B, Alibaba’s transformer‑MoE model quantized to IQ4_XS, eschews any thinking mode entirely. Its inference pipeline therefore skips the reasoning phase, delivering the shortest perceived latency. The model’s prompt‑eval speed of 132.1 tok/s and generation speed of 58.5 tok/s are the highest among the trio, and its total token counts per prompt are dramatically lower—199 tokens for a general‑knowledge query and 277 tokens for math reasoning. Consequently, Qwen3‑Coder answered the same general‑knowledge prompt in just 3.3 seconds, more than half the time of Nemotron and a fifth of GLM’s duration. The trade‑off is a lack of explicit chain‑of‑thought output, which may affect interpretability for complex tasks (source: benchmark data).
File‑size and memory footprints also differentiate the three. GLM‑4.7‑Flash occupies 16 GB on disk and consumes about 16.9 GB of the M1 Max’s unified memory when loaded with Q4_K_XL quantization (4.68 bits per weight). Nemotron‑3‑Nano is larger—22 GB on disk and roughly 22 GB of VRAM usage—reflecting its higher bits‑per‑weight (5.78 BPW) quantization. Qwen3‑Coder is the most compact, at 15 GB on disk and 15.8 GB of VRAM, thanks to the aggressive IQ4_XS (4.29 BPW) scheme. These differences matter for developers targeting Apple Silicon, where unified‑memory constraints can dictate model selection as much as raw speed (source: benchmark table).
Overall, the head‑to‑head tests suggest that while Nemotron‑3‑Nano leads in pure throughput, GLM‑4.7‑Flash offers richer, more transparent reasoning at the cost of longer answer times, and Qwen3‑Coder delivers the fastest user‑perceived response by omitting thinking altogether. The results underscore the nuanced trade‑offs between model architecture, quantization strategy, and CoT design when deploying ~30 B MoE models on Apple’s M1 Max platform (source: benchmark report).
Sources
No primary source found (coverage-based)
- Reddit - r/LocalLLaMA New
This article was created using AI technology and reviewed by the SectorHQ editorial team for accuracy and quality.