Gemma 4 Visual Guide Launches, Showcasing New AI Features and Capabilities
Photo by Possessed Photography on Unsplash
While earlier Gemma models lagged in parameter efficiency, Gemma 4 now packs 2‑4 billion per‑layer embeddings, a leap Newsletter reports from DeepMind’s April release.
Key Facts
- •Key company: Gemma 4
Gemma 4’s most visible upgrade is the introduction of per‑layer embeddings that effectively double the parameter count of the smallest dense models. According to the DeepMind release on the AI‑focused Newsletter, the “E2B” variant now carries roughly 2 billion effective parameters, while the “E4B” scales that to about 4 billion — a stark contrast to the Gemma 3 series, which relied on a single shared embedding matrix across all layers. The per‑layer design lets each transformer block learn its own positional and token‑level nuances, a change that DeepMind’s team says improves both language understanding and multimodal reasoning without a proportional increase in FLOPs.
Both dense models share a hybrid attention scheme that interleaves sliding‑window (local) attention with a final global attention layer. As described in the visual guide, the sliding window is set to 512 tokens for the E2B and E4B models, while the larger 26 B‑parameter MoE (A4B) and 31 B‑parameter dense variants use a 1 024‑token window. The local attention restricts each token’s view to a fixed‑size context, dramatically cutting the quadratic cost of full‑sequence attention. The final global layer then aggregates information across the entire prompt, ensuring that long‑range dependencies are still captured. This pattern mirrors the Gemma 3 architecture but adds two refinements: the Keys are forced to equal the Values (K = V) for the global attention stage, and a low‑frequency‑pruned rotary positional encoding (p‑RoPE) is applied to the embeddings, which DeepMind claims reduces positional drift on very long sequences.
The 26 B‑parameter “A4B” model adopts a Mixture‑of‑Experts (MoE) topology, activating only 4 billion parameters at inference time. In practice, the model routes each token through a subset of expert feed‑forward layers, keeping the compute budget comparable to the dense 4 B models while preserving the capacity benefits of a larger parameter pool. DeepMind’s guide notes that the MoE design is paired with the same 1 024‑token sliding window and global attention stack, meaning that the expert routing logic operates on the same interleaved attention schedule as the dense variants. This consistency simplifies deployment across hardware platforms, as the same inference kernel can handle both dense and MoE paths with only a switch in the routing layer.
All four Gemma 4 models are multimodal by default. The visual guide emphasizes that each model can ingest images of varying resolutions, process them alongside text, and even handle audio inputs on the smaller E2B/E4B configurations. Training data included a broad spectrum of image sizes, enabling the models to learn scale‑invariant visual features without the need for explicit resizing pipelines. According to the Newsletter, the multimodal capability is baked into the transformer’s token stream: image patches are projected into the same embedding space as text tokens, then passed through the interleaved attention layers. The p‑RoPE modification, originally designed for language, also improves the alignment of visual tokens across different spatial frequencies, reducing artifacts when the model reasons about fine‑grained visual detail.
From a deployment standpoint, DeepMind highlights that the per‑layer embeddings and MoE routing impose modest memory overheads. The dense E2B/E4B models fit comfortably on a single high‑end GPU (≈ 24 GB VRAM) when using 16‑bit precision, while the 31 B dense model requires a multi‑GPU setup with tensor‑parallel sharding. The A4B MoE, despite its larger nominal size, can be run on a single GPU because only a fraction of its experts are active per token. This flexibility gives developers a clear path to select a model that matches their hardware constraints without sacrificing multimodal functionality—a key selling point that DeepMind underscores throughout the guide.
Sources
Reporting based on verified sources and public filings. Sector HQ editorial standards require multi-source attribution.