Llama.cpp releases b8299 update, boosting performance and adding new model support
Photo by Possessed Photography on Unsplash
Before the update, Llama.cpp lagged on Apple Silicon and repeated query‑key tensors; after the b8299 patch, it gains a chunked fused GDN path, a Metal‑backed gated‑delta‑net kernel and Q/K repeat avoidance, delivering noticeably faster GPU inference for DeltaNet models.
Key Facts
- •Key company: Llama.cpp
The b8299 patch marks the most substantial performance overhaul for Llama.cpp since its initial Apple‑Silicon support, adding a “chunked fused GDN” path that eliminates the costly repetition of query‑key tensors and introduces a Metal‑backed gated‑delta‑net (GDN) kernel. According to the commit log, the new path (issue #20340) restructures the computation graph so that Q and K tensors are generated only once per batch, a change that alone reduces memory traffic on the M‑series GPUs. The Metal kernel (issue #20361) implements the GDN recurrence op in a fused fashion, supporting both scalar‑gate (GDA) and per‑row‑gate (KDA) modes for head sizes of 64 and 128, while gracefully falling back to CPU for unsupported configurations such as head‑size 32 or non‑contiguous tensors. Benchmarks supplied by the developers show a 25 percent throughput gain on the Qwen 3.5‑0.8B Q4_K_M model running on an M4 Max, climbing from 170 tokens per second to 213 t/s.
Beyond the Apple‑Silicon improvements, the update also tightens the CUDA backend for gated‑delta‑net inference. The commit (issue #20391) adds a FastDiv routine and sharding of columns across warps, which lowers register pressure and frees warp‑scheduler capacity, thereby hiding data‑access latency. The developers note that these changes eliminate register spills for a stride value of 128 and enable more concurrent thread blocks, a benefit that translates to smoother scaling on NVIDIA GPUs running CUDA 12.4 or 13.1. The same patch refactors the llm_build_delta_net_base API, aligning the CUDA and Metal implementations under a common abstraction and making the codebase more maintainable across platforms.
The broader impact of the b8299 release is evident in the expanding hardware matrix that Llama.cpp now officially supports. The repository’s platform list now includes macOS/iOS (Apple‑Silicon and Intel), Linux (Ubuntu x64 with CPU, Vulkan, ROCm 7.2, and s390x), Windows (CPU, CUDA 12/13, Vulkan, SYCL, HIP), and openEuler (x86 and aarch64 with ACL Graph). This diversification, highlighted in the commit comments, signals the project’s ambition to become the de‑facto inference engine for a wide range of edge and server environments, from Raspberry Pi Zero USB sticks (as reported by Tom’s Hardware) to full‑size laptops and desktops (as noted by The Register and Ars Technica).
Industry observers see the performance gains as a catalyst for broader adoption of Llama.cpp in production workloads. The 25 percent speedup on Apple Silicon narrows the gap between on‑device inference and cloud‑based services, a development that could encourage developers to embed large‑language‑model capabilities directly into consumer devices. At the same time, the CUDA enhancements make the library more competitive with proprietary SDKs for NVIDIA GPUs, potentially attracting enterprise users who require high‑throughput batch processing without incurring the licensing costs of commercial frameworks. By consolidating both Metal and CUDA pathways under a unified, open‑source codebase, Llama.cpp positions itself as a versatile, cost‑effective alternative for organizations looking to scale AI workloads across heterogeneous hardware.
The b8299 update also underscores the collaborative nature of the project, with contributions from a diverse set of engineers—including Aman Gupta, Paul Flynn, Claude Opus (Anthropic), Oliver Simons (NVIDIA), and uvos—reflecting a cross‑industry effort to optimize LLM inference. The commit history notes extensive code‑clean‑up, contiguity validation for input tensors, and algorithm‑equivalence comments that improve both reliability and readability. Such community‑driven rigor is rare in the fast‑moving AI tooling space and may give Llama.cpp a durability advantage as the ecosystem matures.
In sum, the b8299 patch delivers a concrete performance uplift on Apple‑Silicon, extends GPU acceleration to CUDA platforms, and broadens the library’s hardware compatibility. For developers and enterprises alike, the update reduces the compute cost of running DeltaNet‑based models such as Qwen 3.5, making on‑device and edge deployments more feasible. As the AI landscape continues to fragment across devices and accelerators, Llama.cpp’s open‑source, multi‑backend strategy—bolstered by this latest code push—could become a decisive factor in shaping where and how large‑language‑models are deployed.
Sources
This article was created using AI technology and reviewed by the SectorHQ editorial team for accuracy and quality.