Nvidia Sets World Record for DeepSeek‑R1 Inference on Single 8‑GPU System
Photo by GuerrillaBuzz (unsplash.com/@guerrillabuzz) on Unsplash
671 billion parameters. That’s the size of DeepSeek‑R1, which NVIDIA AI Twitter reports set a new inference world record on a single 8‑GPU system as an NIM microservice preview.
Quick Summary
- •671 billion parameters. That’s the size of DeepSeek‑R1, which NVIDIA AI Twitter reports set a new inference world record on a single 8‑GPU system as an NIM microservice preview.
- •Key company: Nvidia
NVIDIA’s latest benchmark shows that a single eight‑GPU Blackwell system can run the 671‑billion‑parameter DeepSeek‑R1 model at 253 queries‑per‑second (TPS) per user, delivering a total system throughput of roughly 30 k TPS, according to a post on NVIDIA AI’s official Twitter account. The record‑setting run was performed with the NVL8 configuration, which strings together eight Blackwell GPUs in a tightly coupled topology to maximize inter‑GPU bandwidth and minimize latency. By exposing DeepSeek‑R1 as a NVIDIA NIM (NVIDIA Inference Microservice) preview on build.nvidia.com, the company is allowing developers to experiment with the model in a secure, managed environment while showcasing the raw performance of its newest hardware generation.
The achievement builds on NVIDIA’s co‑design approach, which Bloomberg reports was highlighted by a U.S. lawmaker as a key factor in DeepSeek‑R1’s efficiency. NVIDIA engineers worked closely with the model’s architects to align algorithmic optimizations, framework tweaks, and hardware capabilities, resulting in an inference pipeline that extracts more work per watt than prior generations. Tom’s Hardware corroborates this, noting a 45 % increase in inference throughput over the previous DGX B200 Blackwell node, underscoring the impact of the co‑design effort on real‑world performance.
Beyond raw speed, the record demonstrates the scalability of NVIDIA’s NIM platform. By packaging DeepSeek‑R1 as a microservice, the company abstracts away the complexities of model deployment, letting enterprises spin up specialized agents without managing the underlying GPU fleet. The Twitter thread promoting the preview emphasizes “securely experiment and build your own specialized agents,” signaling NVIDIA’s intent to position the service as a turnkey solution for AI‑driven applications ranging from conversational assistants to domain‑specific analytics.
Industry observers see the milestone as a litmus test for the broader AI hardware race. While competitors such as AMD and Intel are rolling out their own high‑core‑count accelerators, NVIDIA’s ability to push a 671 B model to 30 k TPS on a single rack‑scale system sets a new performance bar. The record also hints at the practical limits of scaling massive language models; achieving higher throughput now hinges less on adding more GPUs and more on tightening the software‑hardware integration that NVIDIA has refined through its NIM ecosystem.
The DeepSeek‑R1 benchmark arrives as the AI community grapples with the trade‑offs between model size, latency, and cost. At 671 billion parameters, the model sits at the upper end of the current “large‑language‑model” spectrum, yet NVIDIA’s demonstration shows that such scale need not translate into prohibitive inference delays. If the preview service can maintain these throughput figures in production, it could accelerate the adoption of ultra‑large models across enterprises that previously shied away from the computational expense.
Sources
This article was created using AI technology and reviewed by the SectorHQ editorial team for accuracy and quality.