Google discovers longer chain‑of‑thought prompts reduce AI accuracy, showing –0.54
Photo by Zulfugar Karimov (unsplash.com/@zulfugarkarimov) on Unsplash
A ‑0.54 correlation shows longer chain‑of‑thought prompts actually lower AI accuracy, with token length inversely tied to performance across eight model variants tested on AIME 2024/2025, HMMT 2025 and GPQA‑Diamond.
Quick Summary
- •A ‑0.54 correlation shows longer chain‑of‑thought prompts actually lower AI accuracy, with token length inversely tied to performance across eight model variants tested on AIME 2024/2025, HMMT 2025 and GPQA‑Diamond.
- •Key company: Google
Google’s new “Deep Thinking Ratio” (DTR) metric reframes how researchers evaluate chain‑of‑thought prompting, showing that raw token length is a misleading proxy for reasoning quality. In a paper posted to arXiv (https://arxiv.org/abs/2602.13517), Google’s AI team measured the correlation between total token count and answer accuracy across eight model variants—including GPT‑OSS, DeepSeek‑R1, and Qwen‑3—on three high‑stakes benchmark suites: AIME 2024/2025, HMMT 2025, and GPQA‑Diamond. The analysis revealed an average Pearson correlation of –0.54, meaning that longer reasoning chains tended to produce lower scores rather than the expected improvement (Google, 2025). The authors attribute the negative relationship to “spiraling” or “overthinking,” where models generate filler tokens that do not contribute to genuine problem solving.
To separate substantive reasoning from filler, the researchers introduced DTR, which quantifies the proportion of tokens that undergo deep processing in later transformer layers. By tracking how prediction distributions evolve across layers, they label early‑stabilizing tokens—often function words such as “and,” “is,” or “the”—as filler, while tokens whose logits continue to shift in deeper layers are counted as “deep” (Google, 2025). Across the same test set, DTR exhibited a strong positive correlation with accuracy (r = 0.82), outperforming raw length as a predictor by a wide margin. This suggests that the quality of token‑level reasoning, not the sheer quantity, drives performance on complex mathematical and logical problems.
The practical payoff of DTR is embodied in the “Think@n” inference strategy, which samples multiple reasoning paths, computes DTR from the first 50 tokens of each path, and discards the lower‑half of samples before a majority‑vote aggregation. In the authors’ experiments, the approach delivered equal or higher accuracy while cutting compute consumption by roughly 50 %. For example, the GPT‑OSS‑120B‑medium model achieved 94.7 % accuracy on AIME 2025 under Think@n, compared with 92.7 % using a conventional chain‑of‑thought pipeline (Google, 2025). Token usage dropped from 355.6 k to 181.9 k per inference batch, illustrating how early termination of low‑DTR trajectories can dramatically reduce inference cost.
These findings have immediate implications for developers running large language models on local hardware or edge devices. By estimating DTR after only 50 tokens, a system can abort unpromising reasoning streams before they consume the bulk of GPU memory and processing time. This early‑exit capability enables higher throughput within a fixed compute budget, a benefit that extends to cloud‑based multi‑agent platforms such as Verdant, which orchestrate several reasoning passes per request. In environments where latency and energy efficiency are paramount, DTR‑guided pruning could become a standard optimization layer for chain‑of‑thought applications.
Google’s work also challenges a long‑standing assumption in the AI community that longer, more verbose explanations inherently signal deeper understanding. The –0.54 correlation reported here runs counter to earlier anecdotal claims that “thinking aloud” improves model performance, and it aligns with recent observations from other groups that excessive token generation can mask model uncertainty. By providing a concrete, layer‑wise diagnostic (DTR) and a scalable inference recipe (Think@n), the paper offers a roadmap for both researchers and practitioners to refine prompting strategies without sacrificing accuracy. As chain‑of‑thought techniques continue to proliferate in education, scientific discovery, and enterprise analytics, the ability to distinguish genuine reasoning from filler will likely become a critical factor in the next generation of AI systems.
Sources
No primary source found (coverage-based)
- Reddit - r/LocalLLaMA New
This article was created using AI technology and reviewed by the SectorHQ editorial team for accuracy and quality.