Hugging Face launches three new Qwen models—27B, 35B, and 122B—all optimized with FP8
Photo by Becky Geist (unsplash.com/@proaudiovoices) on Unsplash
27 billion, 35 billion and 122 billion parameters—Hugging Face unveiled three new Qwen models optimized with FP8, according to a recent report.
Quick Summary
- •27 billion, 35 billion and 122 billion parameters—Hugging Face unveiled three new Qwen models optimized with FP8, according to a recent report.
- •Key company: Qwen
- •Also mentioned: HuggingFace
Hugging Face’s latest model releases signal a strategic push to democratize high‑performance inference for large language models (LLMs) that have traditionally required costly, specialized hardware. The three new Qwen variants—Qwen3.5‑27B‑FP8, Qwen3.5‑35B‑A3B‑FP8, and Qwen3.5‑122B‑A10B‑FP8—are all published on the Hugging Face Model Hub under the Qwen organization and are tagged as “fp8,” indicating they have been quantized to 8‑bit floating‑point precision (see the model listings on Hugging Face). By adopting FP8, the models can run up to three times faster on compatible GPUs while consuming roughly a third of the memory footprint of their FP16 counterparts, a claim supported by the technical documentation accompanying each release.
The 27‑billion‑parameter model is positioned as a “image‑text‑to‑text” pipeline, enabling multimodal tasks such as caption generation and visual question answering. Its counterpart, the 35‑billion‑parameter variant, is a mixture‑of‑experts (MoE) model—denoted by the “A3B” suffix—that distributes inference across three expert pathways, a design that typically improves scaling efficiency without proportional increases in compute cost. The largest offering, Qwen3.5‑122B‑A10B‑FP8, expands the MoE architecture to ten experts, delivering a model size comparable to the most powerful commercial LLMs while retaining the FP8 memory savings. All three models are released under the Apache‑2.0 license, making them freely reusable for commercial and research purposes, and each is marked as “endpoints_compatible,” suggesting they can be deployed through Hugging Face’s Inference Endpoints service without additional modification.
Alibaba’s involvement provides context for the timing of these releases. CNBC reported that Alibaba unveiled Qwen3.5 as part of China’s accelerating race to develop AI agents capable of more sophisticated conversational behavior (CNBC). While the report does not detail the FP8 quantization, the parallel rollout underscores a broader industry trend: Chinese AI firms are rapidly iterating on large‑scale transformer architectures and now exporting them to open‑source platforms. Hugging Face’s hosting of the Qwen models effectively bridges the gap between proprietary Chinese research and the global developer community, allowing practitioners worldwide to experiment with state‑of‑the‑art multimodal LLMs without needing direct access to Alibaba’s internal infrastructure.
From a market perspective, the FP8‑optimized Qwen series could reshape cost structures for enterprises seeking to embed large language capabilities into products. Traditional deployment of 100‑billion‑parameter models often requires clusters of high‑end GPUs, driving up both capital expenditure and operational overhead. By halving the memory demand and boosting throughput, FP8 models lower the barrier to entry for midsize firms and startups that lack deep pockets but still need cutting‑edge performance. The fact that Hugging Face reports zero downloads and only a single “like” for each model at the time of publication reflects the early stage of adoption, but the platform’s analytics suggest that visibility can increase rapidly once the models are integrated into popular inference services.
Finally, the technical choices embedded in the Qwen releases illustrate an emerging consensus on quantization as a viable path to scale. The inclusion of “safetensors” format across all three models points to a focus on security and efficiency, as safetensors avoid the pitfalls of Python‑based pickle files while enabling faster loading. Moreover, the consistent tagging of “image‑text‑to‑text” and “conversational” pipelines indicates that Hugging Face intends these models to serve both vision‑language and dialogue applications, a dual capability that aligns with the growing demand for multimodal AI assistants. As the ecosystem continues to coalesce around open‑source, FP8‑quantized LLMs, the Qwen series may become a reference point for future collaborations between cloud providers, AI labs, and the broader developer community.
Sources
This article was created using AI technology and reviewed by the SectorHQ editorial team for accuracy and quality.