Anthropic’s Claude 4.6 Sonnet Tops BullshitBench v2 as Pentagon Chooses It for Drone
Photo by Maxim Hopman on Unsplash
Anthropic’s Claude 4.6 Sonnet topped the BullshitBench v2 stress test and was selected by the Pentagon for drone applications, reports indicate.
Key Facts
- •Key company: Anthropic
Anthropic’s Claude 4.6 Sonnet emerged as the clear winner of BullshitBench v2, a stress test designed to expose hallucinations in large language models, according to the LLM Hallucination Index report. The benchmark, which pits models against deliberately false premises, showed Claude 4.6 Sonnet maintaining a “honesty gap” far narrower than its rivals, including the latest GPT‑5.2 and Gemini 3 Pro iterations. The report attributes the model’s advantage to Anthropic’s focus on “truth‑oriented” training objectives rather than the “yes‑man” helpfulness bias that dominates most commercial LLMs (E S, BullshitBench v2 analysis).
The Pentagon’s decision to adopt Claude 4.6 Sonnet for its drone‑swarm program aligns with Anthropic’s longstanding stance on human‑in‑the‑loop oversight. Bloomberg reported that Anthropic entered a $100 million competition earlier this year, proposing a voice‑controlled interface that would translate a commander’s spoken orders into digital commands for autonomous drone fleets. Crucially, the proposal explicitly excluded AI‑driven targeting or weapons decisions, insisting that humans would retain the ability to monitor and abort missions at any point. This approach mirrors Anthropic’s public argument in its dispute with the Department of Defense, where the company has warned that current models lack the reliability required for fully autonomous weapons (Bloomberg, March 3 2026).
Despite the technical merits of Claude 4.6 Sonnet, Anthropic did not secure the contract. Reuters and other outlets confirmed that the award went to a consortium involving SpaceX/xAI and two defense contractors partnered with OpenAI. The decision underscores the Pentagon’s preference for a diversified AI supply chain, blending the proven infrastructure of OpenAI with the aerospace expertise of SpaceX. Nonetheless, the selection of Claude 4.6 Sonnet for the pilot phase signals a strategic foothold for Anthropic within the defense sector, even as the broader competition remains out of reach.
Industry analysts have pointed to the “Reasoning Paradox” highlighted in the BullshitBench v2 findings: larger models with more compute often generate more convincing fabrications rather than self‑correcting. The report notes that models like GPT‑5.2 and Gemini 3 Pro, when given extra inference time, tend to rationalize false premises instead of flagging them. Anthropic’s architecture, by contrast, appears to resist this drift, delivering more consistent factuality under stress. This divergence is reshaping the 2026 “Reliability Hierarchy,” where Anthropic now occupies a leading position, according to the Hallucination Index (E S, March 3 2026).
The Pentagon’s adoption of Claude 4.6 Sonnet also raises broader policy questions. Recent Reuters coverage indicates that the State Department is shifting its AI procurement toward OpenAI, phasing out Anthropic products in other agencies. Meanwhile, Wired has reported political pressure to ban Anthropic from federal contracts altogether. Anthropic’s win on BullshitBench v2 and its partial entry into the defense arena therefore represent both a technical triumph and a precarious foothold in a rapidly politicized market. The company’s emphasis on human oversight may prove decisive as regulators and lawmakers grapple with the ethical boundaries of autonomous systems.
This article was created using AI technology and reviewed by the SectorHQ editorial team for accuracy and quality.