Nvidia’s Qwen3.5-397B NVFP4 Benchmarks Reveal Top MoE Backend on 4× RTX PRO 6000 GPUs
Photo by Brecht Corbeel (unsplash.com/@brechtcorbeel) on Unsplash
While hype promised 130+ tokens per second on a Blackwell workstation, benchmarks on four RTX PRO 6000 GPUs top out at just 50.5 tok/s decode, a record that reports indicate shatters prior claims due to broken CUTLASS kernels.
Key Facts
- •Key company: Nvidia
The benchmark suite, compiled by an independent researcher who spent more than eight hours testing every available mixture‑of‑experts (MoE) backend on a four‑GPU RTX PRO 6000 Blackwell workstation, shows a sustained decode speed of 50.5 tokens per second (tok/s) using the Marlin W4A16 backend with tensor‑parallelism = 4 and no multi‑token prediction (MTP). This figure, the author argues, is likely the highest anyone has achieved on SM120 hardware, directly contradicting community claims of 130 + tok/s that were based on broken CUTLASS kernels (source: benchmark report).
The test environment consisted of four RTX PRO 6000 GPUs (each with 96 GB GDDR7, SM 12.0 architecture), a Threadripper 24‑core/48‑thread CPU, 512 GB DDR5 RAM, PCIe Gen5 connectivity, and Windows 11 + WSL2. The model evaluated was NVIDIA’s NVFP4‑quantized Qwen 3.5‑397B‑A17B (≈140 GB, 397 B parameters, 17 B active per token). Across 16 configurations—including multiple Docker images, two inference frameworks, every MoE backend, and variations of EP/PP/TP and MTP—the Marlin TP = 4, no‑MTP setup emerged as the clear winner (50.5 tok/s), while the next‑best Marlin TP = 2 + PP = 2 achieved 49 tok/s (source: benchmark table).
All attempts to leverage NVIDIA’s FlashInfer CUTLASS backend failed. The best‑case CUTLASS Docker run managed only 41 tok/s and skipped 80 fast kernels, while the worst case fell to 26 tok/s. Native CUTLASS runs produced nonsensical output at roughly 5 tok/s, and the default auto‑backend configuration yielded garbage at 6‑7 tok/s. The root cause, according to the researcher, is a bug in the CUTLASS library: on SM 12.0 (the desktop Blackwell SKU), every grouped GEMM kernel that should activate FP4 tensor cores aborts during initialization with the error “Failed to initialize cutlass TMA WS grouped gemm.” This forces the workload onto Marlin, which de‑quantizes FP4 weights to FP16, sacrificing roughly half the theoretical throughput (source: benchmark report, issue #3096).
The failure is architecture‑specific. SM 12.1 GPUs—such as those in NVIDIA’s DGX Spark—run the same NVFP4‑quantized MoE model without issue, delivering 356 TFLOPS of native FP4 performance. The researcher filed a CUTLASS issue ( #3096 ) but has yet to receive a response from NVIDIA, leaving workstation users without a viable path to the advertised 130 + tok/s speeds (source: benchmark report).
Multi‑token prediction, which is expected to improve throughput by processing several tokens in parallel, actually degrades performance on this platform. With MTP = 2 enabled, the Marlin backend drops to 39‑40 tok/s—a 22 % regression—because the de‑quantized FP16 activations differ from the FP4‑native expectations, reducing acceptance rates to 61‑85 % versus the intended 89 % and adding speculative overhead (source: benchmark report).
Community claims of 130‑150 tok/s on the same hardware, often cited from custom SGLang or vLLM forks, were examined and found to contain no kernel‑level modifications; the purported gains stem from altered Python‑level quantization settings that do not translate into real‑world speedups on SM 12.0 (source: benchmark report). In summary, the data indicate that, until NVIDIA resolves the CUTLASS initialization bug for SM 12.0, the practical ceiling for Qwen 3.5‑397B‑NVFP4 inference on a four‑GPU RTX PRO 6000 workstation sits at roughly 50 tok/s, far below the hype‑driven expectations.
Sources
No primary source found (coverage-based)
- Reddit - r/LocalLLaMA New
This article was created using AI technology and reviewed by the SectorHQ editorial team for accuracy and quality.