DeepSeek’s Open‑Source Models Match or Surpass Claude Opus 4.6 on Four of Five Key
Photo by Solen Feyissa (unsplash.com/@solenfeyissa) on Unsplash
While many still view Claude Opus 4.6 as the production benchmark, a recent report shows open‑source models now match or outpace it on four of five major tests, with DeepSeek V3.2 leading the charge.
Key Facts
- •Key company: DeepSeek
DeepSeek V3.2’s performance on general‑reasoning benchmarks signals a turning point for open‑source LLMs. In the SWE‑bench verification test, the model resolved 73.0 % of problems versus Claude Opus 4.6’s 80.8 %, a gap that narrows dramatically when the same models are evaluated on LiveCodeBench, where DeepSeek posts a 74.1 score against Opus’s 76 % (the report’s author). More striking, however, is DeepSeek’s 85.0 % result on MMLU‑Pro, outpacing Opus’s 82.0 % and demonstrating stronger multilingual competence across CJK, Arabic and European languages. The model runs with a 128 K token context window, sparse attention, and an inference speed of roughly 60 tokens per second with a first‑token latency of 1.18 seconds, making it “production‑ready for 90 %+ of general use cases” while costing five times less than GPT‑5 and twenty times less than Opus 4.6, according to the benchmark analysis.
The reasoning‑specialized DeepSeek R1 pushes the envelope further. On the “Humanity’s Last Exam” – a high‑stakes evaluation of abstract problem‑solving – R1 achieved 50.2 % versus Opus’s 40.0 %, and on MMLU‑Pro it posted 88.9 % against Opus’s 82.0 % (same source). Although its inference speed drops to about 30 tokens per second and a 2‑second time‑to‑first‑token due to verbose chain‑of‑thought processing, the trade‑off yields depth that “matches GPT‑5.2 Pro on HLE” while being thirty times cheaper than the proprietary o1 model. The report therefore positions R1 as the “best open‑source reasoning model,” a claim that could reshape enterprise choices for tasks requiring rigorous logical deduction.
Agentic workloads, where models must orchestrate tool use and sub‑agent coordination, appear to be dominated by Kimi K2.5. The 1‑trillion‑parameter mixture‑of‑experts architecture activates roughly 32 B parameters per token and supports a 256 K context window under a modified MIT license. When granted tool access, Kimi gains 20.1 benchmark points, eclipsing Opus’s 12.4‑point improvement and GPT‑5.2’s 11‑point gain. Its ability to spawn up to 100 sub‑agents in parallel and handle more than 1,500 tool calls without human oversight is unprecedented among open‑source offerings. Speed is another differentiator: Kimi delivers 334 tokens per second with a 0.31‑second first‑token latency, making it “the fastest large model” tested in the report, while still posting a respectable 76.8 % on SWE‑bench verification and matching R1’s 50.2 % on Humanity’s Last Exam.
Coding remains a niche where open‑source models have historically lagged, yet MiniMax M2.5 quietly entered the top tier, according to the same benchmark suite. While the report does not list exact scores for MiniMax, its inclusion among the “best coding models” suggests it is competitive with proprietary alternatives on standard programming benchmarks such as SWE‑bench and LiveCodeBench. This development dovetails with broader industry trends highlighted by VentureBeat, which notes a wave of new open‑source releases – from Deep Cogito’s hybrid reasoning models to Z.ai’s GLM‑4.5 family – that collectively raise the baseline for community‑driven AI capabilities.
Taken together, the data underscores a rapid erosion of the performance gap that has long justified premium pricing for closed‑source systems. If enterprises prioritize cost efficiency without sacrificing accuracy on core reasoning, agentic, or coding tasks, the economics now favor open‑source deployments. The report’s “bottom‑line” conclusions – that DeepSeek V3.2 is five times cheaper than GPT‑5, DeepSeek R1 is thirty times cheaper than o1, and Kimi K2.5 offers the fastest inference among large models – provide a concrete financial calculus for decision‑makers. As Anthropic’s Claude Opus 4.6 continues to dominate headline market share (54 % of the enterprise coding market, per the author’s background), the emergence of open‑source contenders that meet or exceed its benchmark scores on four of five key tests could catalyze a shift in procurement strategies, especially for firms willing to invest in the engineering overhead required to integrate and fine‑tune these models.
Sources
No primary source found (coverage-based)
- Reddit - r/LocalLLaMA New
Reporting based on verified sources and public filings. Sector HQ editorial standards require multi-source attribution.