ik_llama.cpp Beats llama.cpp in Qwen3/3.5 MoE Model Benchmark
Photo by Al Amin Mir (unsplash.com/@alaminip) on Unsplash
22,568‑token prompts revealed ik_llama.cpp outpacing llama.cpp on the Qwen3‑Coder‑Next MoE model, a surprise given the same Ryzen 9 5950X, 64 GB RAM and RTX 5070 Ti hardware, reports indicate.
Key Facts
- •Key company: Llama.cpp
ik_llama.cpp’s edge on the Qwen3‑Coder‑Next MoE model is stark. Running on a Ryzen 9 5950X, 64 GB DDR4 and an RTX 5070 Ti, the fork delivered prompt‑processing speeds of 451 tokens / s (unsloth Q4_K_XL) versus 309 tokens / s for the upstream llama.cpp, a 45 % gain (benchmark report). The advantage persisted across quantizations and providers: Q4_K_M saw 455 t/s versus 312 t/s, Q4_K_L 441 t/s versus 310 t/s, and even the lower‑precision Q4_0 run posted 424 t/s against 317 t/s. Generation throughput, however, remained essentially flat—33–34 t/s for both back‑ends—indicating that ik_llama’s optimizations chiefly accelerate the initial prompt embedding and attention passes rather than the token‑by‑token decoder loop.
The picture reverses on the larger Qwen3.5‑35B‑A3B MoE architecture. With a 180 k context window, 24 CPU‑only MoE workers and 16 GPU threads, llama.cpp outpaced ik_llama by roughly 25 % on prompt evaluation, hitting 2,353 t/s (ubergarm Q4_0) compared with ik_llama’s 1,801 t/s. Unsloth’s Q4_K_XL quantization showed a similar split: 2,201 t/s for llama.cpp versus 1,726 t/s for ik_llama. Notably, ik_llama generated marginally more tokens per second (≈58 t/s versus 57 t/s) and succeeded in loading GGUF files that llama.cpp failed to parse for the AesSedai provider. This suggests that while ik_llama’s prompt pipeline lags on massive MoE contexts, its loader and generation path are more robust for certain quantizations.
Smaller MoE models again favor ik_llama. On the 9‑billion‑parameter Crow‑9B (Q6_K) and the Qwen3.5‑9B (Q6_K) checkpoints, ik_llama recorded prompt speeds exceeding 4,100 t/s, outpacing llama.cpp’s 3,850 t/s on the former. Generation, however, swung the other way: llama.cpp achieved 81.7 t/s versus ik_llama’s 73.2 t/s on the Crow‑9B model, underscoring a recurring trade‑off where ik_llama accelerates the forward pass but lags in the autoregressive loop for smaller networks.
Across all tests the hardware stack remained constant, eliminating platform variance as a confounding factor. The benchmark parameters—context sizes (100 k for Qwen3‑Coder‑Next, 180 k for Qwen3.5‑35B‑A3B, 131 k for the 9 B models), flash‑attention enabled, and identical cache quantizations (q8_0 for keys and values)—were mirrored between the two binaries. The only differentiator was the backend implementation: ik_llama.cpp incorporates a custom kernel path for MoE routing and a modified token cache layout, whereas llama.cpp relies on the upstream reference kernels. The data therefore points to ik_llama’s kernel tweaks delivering measurable gains on models with moderate MoE depth, while the reference implementation retains superiority on ultra‑large contexts where memory bandwidth and CPU‑MoE coordination dominate.
The broader implication for developers is nuanced. For workloads centered on code‑generation or reasoning tasks that employ Qwen3‑Coder‑Next‑style MoE models, ik_llama.cpp offers a clear speed advantage without sacrificing generation quality. Conversely, enterprises deploying the full‑scale Qwen3.5‑35B‑A3B model should expect better prompt latency from llama.cpp, though they may benefit from ik_llama’s more tolerant GGUF loader and slightly higher generation throughput. As the Qwen3 family continues to mature—Alibaba’s open‑source releases have already matched top‑tier proprietary benchmarks (The Decoder, 2025)—the choice of inference engine will likely hinge on model size, MoE topology, and the specific balance between prompt latency and token‑generation speed.
Sources
No primary source found (coverage-based)
- Reddit - r/LocalLLaMA New
Reporting based on verified sources and public filings. Sector HQ editorial standards require multi-source attribution.