Nvidia Powers GPU‑Initiated Networking on AWS, Boosting DeepSeek‑V3 via DeepEP on EFA
Photo by Brecht Corbeel (unsplash.com/@brechtcorbeel) on Unsplash
A new GPU‑Initiated Networking feature lets CUDA kernels bypass the CPU and send RDMA packets directly, and—thanks to AWS Annapurna Labs’ work on the EFA provider—now runs on AWS, enabling multi‑node DeepSeek‑V3 deployments with DeepEP over EFA.
Quick Summary
- •A new GPU‑Initiated Networking feature lets CUDA kernels bypass the CPU and send RDMA packets directly, and—thanks to AWS Annapurna Labs’ work on the EFA provider—now runs on AWS, enabling multi‑node DeepSeek‑V3 deployments with DeepEP over EFA.
- •Key company: Nvidia
NVIDIA’s NCCL library has taken a decisive step forward with its newly unveiled GPU‑Initiated Networking (GIN) capability, allowing CUDA kernels to fire RDMA packets without ever touching the CPU — a claim made in the NCCL release report. The breakthrough eliminates the traditional host‑side bottleneck that has long limited multi‑GPU scaling, especially in large‑language‑model serving where latency and bandwidth are at a premium. By moving the networking stack onto the GPU, GIN lets each device push data straight into the network fabric, shaving off microseconds per collective operation. According to the same report, the feature was initially demonstrated on on‑premise clusters, but a concerted effort by AWS’s Annapurna Labs team adapted the Elastic Fabric Adapter (EFA) provider to understand and route these GPU‑originated RDMA verbs, making the technology usable on Amazon’s public cloud for the first time.
The practical impact of this integration was showcased in a multi‑node vLLM deployment of DeepSeek‑V3, the Chinese startup’s 7‑billion‑parameter transformer, running under DeepEP on an AWS HyperPod Slurm cluster. The experiment, detailed in the NCCL report, confirmed that DeepSeek‑V3 could be served across several GPU‑rich instances with end‑to‑end latency comparable to on‑premise setups, thanks to the direct‑to‑network path. The author of the report notes that the combination of GIN and EFA “now works on AWS,” highlighting the seamless hand‑off between NVIDIA’s software stack and AWS’s low‑latency interconnect. While the report does not publish raw throughput numbers, the successful test validates that the previously theoretical performance gains of GPU‑initiated networking can be realized in production‑grade cloud environments.
The timing of NVIDIA’s announcement dovetails with broader hardware advances that are reshaping AI inference performance. Tom’s Hardware reported that NVIDIA’s Blackwell Ultra GB300 GPU delivers a 45 % uplift in DeepSeek R‑1 inference throughput compared with the prior‑generation GB200 — a claim that underscores the synergy between raw compute power and the new networking model. By pairing the GB300’s higher tensor‑core density with GIN‑enabled EFA links, cloud operators can now push more tokens per second per dollar, a metric that matters to enterprises scaling LLM‑as‑a‑service. The report also points out that NVIDIA is positioning the Blackwell family as the “dominant” platform for MLPerf benchmarks, suggesting that the company expects the combined compute‑and‑communication stack to set new performance standards across both training and inference workloads.
Industry observers have taken note of the geopolitical undercurrents surrounding DeepSeek’s rapid ascent. CNBC highlighted that NVIDIA has been fielding questions about reports that DeepSeek is using Blackwell chips, a narrative that the company has not publicly refuted but has addressed through technical briefings. The outlet’s coverage indicates that the partnership between DeepSeek and NVIDIA is now more visible, with the DeepSeek‑V3 deployment on AWS serving as a tangible proof point of that collaboration. By enabling DeepSeek‑V3 to run on a public cloud without sacrificing the low‑latency interconnect that was once the exclusive domain of on‑premise data centers, NVIDIA effectively widens the market for its Blackwell GPUs and reinforces its position as the de‑facto hardware supplier for next‑generation LLMs.
Looking ahead, the convergence of GPU‑initiated networking, high‑throughput EFA, and ever‑more powerful Blackwell GPUs could reshape how AI workloads are provisioned at scale. The NCCL report’s author hints that the same GIN approach could be extended to other collective libraries and even to emerging data‑plane protocols, though no concrete roadmap is disclosed. If the performance gains observed in the DeepSeek‑V3 trial translate to broader workloads—such as retrieval‑augmented generation or multi‑modal inference—cloud providers may begin to price GPU‑heavy instances differently, factoring in the reduced CPU overhead and higher effective bandwidth. For now, the successful AWS integration marks a clear milestone: NVIDIA’s vision of a “network‑enabled GPU” is no longer a research prototype but a production‑ready capability that can accelerate the most demanding AI services today.
Sources
No primary source found (coverage-based)
- Reddit - r/LocalLLaMA New
This article was created using AI technology and reviewed by the SectorHQ editorial team for accuracy and quality.