Google's AI Overviews Hit 90% Accuracy, New Independent Study Shows
Photo by Maxim Hopman on Unsplash
91% of Google’s AI Overviews were correct, the study found, with Gemini 3 hitting that mark versus 85% for Gemini 2, The‑Decoder reports.
Key Facts
- •Key company: Google
Google’s AI Overviews have nudged past the 90‑percent accuracy threshold, but the triumph comes with a paradox: the answers are harder to trace back to reliable sources. The Oumi study, commissioned by the New York Times and detailed by The Decoder, evaluated 4,326 real‑world queries using the SimpleQA benchmark. In the October round, Gemini 2 delivered correct overviews 85 percent of the time; by February, after the upgrade to Gemini 3, that figure rose to 91 percent. On paper, a nine‑point jump looks impressive, yet at Google’s scale even a 9 percent error rate translates into millions of mis‑answers per hour.
The researchers didn’t stop at raw correctness. They also measured “verifiability”—whether the linked sources actually support the AI’s claim. Here the numbers swing the other way. With Gemini 2, 37 percent of the correct answers were “ungrounded,” meaning the cited pages didn’t fully back the response. Gemini 3, despite its higher accuracy, saw that share climb to 56 percent, according to Oumi’s analysis. In other words, more than half of the right answers now come from sources that either don’t confirm the claim or are too vague to be useful. The study’s verification engine, HallOumi, flagged a litany of weak citations, from Facebook posts to Reddit threads, that Google apparently favors because they are less likely to sue over content reuse.
The report also highlights concrete slip‑ups that illustrate the verification gap. When asked about the Classical Music Hall of Fame, the system located the correct Yo‑Yo Ma listing but still reported that no record existed. A query about the river west of Goldsboro, North Carolina, pulled the right tourism site but misread the data, naming the Neuse River instead of the Little River. Even a seemingly simple fact— the opening year of the Bob Marley Museum—was off by a year, with Gemini 3 stitching together a Facebook post, a travel blog, and a conflicted Wikipedia entry to produce “1987” instead of the correct “1986.” These examples, cited by The Decoder, show that the AI can latch onto the right page yet still misinterpret or mis‑aggregate the information.
Google has pushed back, calling the methodology “serious holes” in a brief statement that the company released after the study’s publication. The critique centers on Oumi’s reliance on an internal verification model rather than human fact‑checkers, a point the researchers acknowledge but defend as the only scalable way to audit thousands of responses. Google’s own disclaimer—“AI responses may include mistakes”—now feels less like a caution and more like a pre‑emptive legal shield, especially as the volume of erroneous answers grows with user traffic.
The broader implication is a trade‑off between headline‑grabbing accuracy and the transparency that underpins trust. If a user can’t verify an answer because the linked source is a social‑media post or a low‑quality blog, the utility of a correct response diminishes. As Oumi’s data suggests, the AI’s “correctness” metric may be inflating the real user experience. For a product that sits at the front door of the web, the stakes are high: millions of people will act on these overviews every day, and a single mis‑step can ripple across news cycles, commerce, and public discourse. The study forces Google—and the industry at large—to reckon with the fact that hitting 90 percent accuracy is only half the battle; grounding those answers in verifiable, high‑quality sources is the other half that still feels out of reach.
Sources
Reporting based on verified sources and public filings. Sector HQ editorial standards require multi-source attribution.