LLAMA Benchmarks Qwen 3.5 35B MoE at 100K Context, 40+ TPS on RTX 5060 Ti
Photo by Max Corahua P. (unsplash.com/@maxcorahua) on Unsplash
LLAMA benchmarks show Qwen 3.5 35B MoE handling a 100 k token context at 696.6 TPS read and generating 41.35 TPS on an RTX 5060 Ti (15.6 GiB), according to a recent report.
Quick Summary
- •LLAMA benchmarks show Qwen 3.5 35B MoE handling a 100 k token context at 696.6 TPS read and generating 41.35 TPS on an RTX 5060 Ti (15.6 GiB), according to a recent report.
- •Key company: Llama
LLAMA‑Bench’s latest run puts Alibaba’s Qwen 3.5 35B Mixture‑of‑Experts (MoE) model through a 100 k‑token window on a single consumer‑grade GPU, delivering read‑through speeds of 696.6 tokens per second (TPS) and generation rates of 41.35 TPS. The figures come directly from the benchmark report, which logs a read‑through performance of 696.60 ± 1.41 TPS and a generation throughput of 41.35 ± 0.18 TPS for a 720‑token generation slice (pp100000 / tg720) [report].
The test rig is a desktop‑class platform built around an AMD Ryzen 7 9700X CPU clocked to 5.55 GHz, paired with an NVIDIA GeForce RTX 5060 Ti GPU running at 3.09 GHz and exposing 15.59 GiB of VRAM. System memory sits at 47.61 GiB, of which 8.74 GiB is actively used during the benchmark, representing roughly 18 % of the total pool [report]. The GPU is addressed through a “GameViewer Virtual Display Adapter,” indicating a headless or virtual‑display configuration typical of server‑side inference workloads.
The 35‑billion‑parameter MoE architecture distributes its weight matrix across multiple expert sub‑networks, allowing the model to keep a relatively modest VRAM footprint while still scaling to large context windows. In this configuration, the 100 k token context is held entirely in GPU memory, a notable achievement given the RTX 5060 Ti’s 15.6 GiB limit. The benchmark’s “read” metric reflects the speed at which the model can ingest and process the full context, while the “gen” metric measures the sustained token‑generation rate once the context is loaded [report].
From a performance‑per‑dollar perspective, the RTX 5060 Ti sits near the low‑end of the current NVIDIA lineup, yet it manages to sustain over 40 TPS on a 35 B‑parameter MoE model—a rate that would typically require a higher‑tier card for dense‑only transformers. The reported throughput suggests that MoE routing efficiently leverages the limited VRAM, activating only a subset of experts per token and thereby reducing the per‑token memory bandwidth demand. This efficiency aligns with the MoE design goal of delivering large‑scale language capabilities on commodity hardware [report].
The benchmark also records a build identifier (a96a1120b / 8149), providing a reproducible reference for future comparisons as newer GPUs or driver stacks emerge. While the report does not include a direct comparison to other models or hardware, the raw numbers establish a baseline: a sub‑$200 GPU can handle a 100 k token context at near‑700 TPS read speed and exceed 40 TPS generation on a 35 B‑parameter MoE model. For developers targeting high‑throughput, long‑context applications—such as document summarization or code analysis—these results demonstrate that the performance gap between enterprise‑grade accelerators and mainstream graphics cards is narrowing, provided the model architecture is MoE‑enabled.
Sources
No primary source found (coverage-based)
- Reddit - r/LocalLLaMA New
This article was created using AI technology and reviewed by the SectorHQ editorial team for accuracy and quality.