Hugging Face launches three Qwen models—35B‑A3B, 27B, and 397B‑A17B—GPTQ‑Int4 optimized.
Photo by Annie Spratt (unsplash.com/@anniespratt) on Unsplash
While many expect only incremental updates, Hugging Face just unveiled three new Qwen models—35B‑A3B, 27B, and 397B‑A17B—each fine‑tuned with GPTQ‑Int4 quantization, reports indicate.
Key Facts
- •Key company: Qwen
- •Also mentioned: Hugging Face
Hugging Face’s model hub now hosts three new Qwen‑3.5 variants that push the limits of 4‑bit inference. The 35‑billion‑parameter “A3B” model, a 27‑billion‑parameter baseline, and a massive 397‑billion‑parameter “A17B” mixture‑of‑experts have all been quantized with GPTQ‑Int4, a technique that compresses weights to 4‑bit integer format while preserving most of the original accuracy. According to the model cards posted on Hugging Face, each release is tagged as “endpoints_compatible” and licensed under Apache‑2.0, signaling that developers can deploy them behind Hugging Face’s inference endpoints without custom serving code (Hugging Face model listings).
All three models share a common pipeline configuration: “image‑text‑to‑text” under the Transformers library, indicating they can handle multimodal prompts that combine visual and textual inputs. The 35B‑A3B and 397B‑A17B entries also carry the “qwen3_5_moe” tag, confirming that the larger model uses a mixture‑of‑experts architecture to distribute computation across expert sub‑networks. The model cards list the base models (Qwen/Qwen3.5‑35B‑A3B, Qwen/Qwen3.5‑27B, Qwen/Qwen3.5‑397B‑A17B) and their quantized counterparts, making the provenance clear for users who need to trace back to the original full‑precision checkpoints.
Despite the headline‑grabbing parameter counts, the new models have yet to see any community adoption. The Hugging Face download counters all read zero, and the “likes” metric shows only a single endorsement for the 35B‑A3B release, with none for the 27B or 397B‑A17B models. This suggests that the releases are still in an early testing phase, likely intended for internal benchmarking or for developers who need to experiment with extreme‑scale inference on limited hardware. The lack of traffic also mirrors the broader market trend where 4‑bit quantization is still a niche technique, primarily used by researchers and hobbyists who can tolerate the trade‑offs in latency and accuracy for the benefit of fitting massive models on consumer‑grade GPUs.
The quantization method itself—GPTQ‑Int4—has been highlighted in recent community discussions as a practical way to run large language models on modest compute. While the source material does not provide performance benchmarks, the “4‑bit” and “gptq” tags imply that the models have been processed with the GPTQ algorithm, which iteratively refines quantization parameters to minimize loss. By publishing the models in safetensors format, Hugging Face ensures that the files are both space‑efficient and safe to load, a detail noted in each model’s metadata. The “region:us” tag further indicates that the files are hosted on U.S. servers, which could reduce latency for developers in North America.
From a strategic standpoint, Hugging Face’s rollout of these three models underscores its commitment to democratizing access to cutting‑edge AI. By offering both a mid‑range 27B model and the flagship 397B mixture‑of‑experts version, the company caters to a spectrum of use cases—from research labs that need a relatively lightweight yet powerful multimodal model to enterprises willing to invest in the infrastructure required for the largest variant. The open‑source licensing and endpoint compatibility also lower the barrier for integration into existing products, a move that aligns with Hugging Face’s broader mission to make advanced models “as easy to use as a library function.”
Sources
This article was created using AI technology and reviewed by the SectorHQ editorial team for accuracy and quality.