Nvidia launches Nemotron‑3‑Nano, a 4B hybrid Mamba‑Attention model running locally in
Photo by Brecht Corbeel (unsplash.com/@brechtcorbeel) on Unsplash
Nvidia unveiled Nemotron‑3‑Nano, a 4‑billion‑parameter hybrid Mamba‑Attention model that runs entirely in a browser via WebGPU, with demos showing about 75 tokens per second on an M4 Max, according to a recent report.
Key Facts
- •Key company: Nvidia
Nemotron‑3‑Nano’s architecture blends a 4‑billion‑parameter Mamba recurrent block with a conventional attention stack, a design Nvidia describes as “hybrid Mamba‑Attention” to capture both long‑range dependencies and rapid token‑level reasoning (VentureBeat). The Mamba component, originally introduced for efficient state‑space modeling, processes sequences as a continuous‑time dynamical system, reducing the quadratic cost of self‑attention for very long contexts. By interleaving this with a lightweight transformer‑style attention layer, the model can retain the expressive power needed for reasoning tasks while keeping inference latency low enough for client‑side execution.
The key to running the model entirely in a browser is the use of WebGPU, the emerging web standard that exposes low‑level GPU compute to JavaScript. A community‑built demo on Hugging Face leverages Transformers.js to compile the model’s weight tensors into GPU buffers and execute the Mamba kernels via WebGPU’s compute shaders (report). On an Apple M4 Max chip, the demo achieves roughly 75 tokens per second, a throughput that, while modest compared to server‑grade GPUs, demonstrates that a 4 B LLM can be interactive on consumer hardware without any remote API calls. The demo’s source code, openly available on Hugging Face, shows how the model’s inference graph is partitioned between WebGPU‑accelerated matrix multiplications and CPU‑fallback operations for control flow.
From a performance perspective, the hybrid approach sidesteps the memory bottlenecks that have limited earlier on‑device LLMs. Traditional transformer‑only models of comparable size often require several gigabytes of VRAM to store attention matrices, which exceeds the capacity of most integrated GPUs. By offloading the bulk of sequence processing to the Mamba block, Nemotron‑3‑Nano reduces peak memory usage to under 2 GB on the M4 Max, according to the demo’s profiling data. This memory efficiency opens the door for a broader class of devices—laptops, tablets, and even smartphones with WebGPU support—to run sophisticated language models locally, eliminating latency and privacy concerns associated with cloud inference.
Nvidia positions Nemotron‑3‑Nano as a stepping stone toward “agentic AI” that can operate at the edge, a narrative echoed in VentureBeat’s coverage of the broader Nemotron 3 family, which includes larger MoE (Mixture‑of‑Experts) variants for data‑center deployment. The Nano version’s modest size and hybrid design are intended for scenarios where on‑device responsiveness is paramount, such as real‑time code assistance, offline document summarization, or interactive tutoring applications. By exposing the model through standard web APIs, Nvidia also invites third‑party developers to embed LLM capabilities directly into web apps without negotiating API keys or handling server scaling.
The release arrives amid a wave of hybrid architectures from other vendors—IBM’s Granite 4 and the “Western Qwen” project both combine Mamba‑style state‑space layers with transformers to improve efficiency (VentureBeat). Nvidia’s contribution is notable for its early adoption of WebGPU as the execution substrate, effectively turning the browser into a lightweight inference engine. While the 75‑token‑per‑second figure will need to improve for demanding workloads, the proof‑of‑concept validates a new deployment paradigm: high‑quality language models that run entirely client‑side, preserving user data and reducing reliance on centralized inference services.
Sources
No primary source found (coverage-based)
- Reddit - r/LocalLLaMA New
Reporting based on verified sources and public filings. Sector HQ editorial standards require multi-source attribution.