Google faces lawsuit as AI training data includes copyrighted bestseller novels

Torbenkopp reports that leading AI models are trained on copyrighted bestseller novels, enabling near‑verbatim reproductions and prompting a lawsuit that could reshape the tech industry.

Quick Summary

•Torbenkopp reports that leading AI models are trained on copyrighted bestseller novels, enabling near‑verbatim reproductions and prompting a lawsuit that could reshape the tech industry.
•Key company: Google
•Also mentioned: Google, Anthropic, xAI

The Stanford‑Yale study, released this week, provides the first systematic evidence that the world’s largest language models retain verbatim passages from copyrighted novels. By prompting the systems with carefully crafted “jailbreak” queries, the researchers extracted up to 76.8 % of Harry Potter und der Stein der Weisen from Google’s Gemini 2.5 and 70.3 % of the same text from xAI’s Grok 3, while Anthropic’s Claude 3.7 Sonnet reproduced nearly an entire Hunger Games novel when the model’s safety filters were bypassed. The paper lists additional titles—including A Game of Thrones and The Hobbit—that were similarly recovered from OpenAI’s ChatGPT‑4 and other leading models, demonstrating that the issue is industry‑wide rather than isolated to a single provider.

Google’s 2023 filing with the U.S. Copyright Office asserted that its models contain no direct copies of training material, arguing that the learning process is purely statistical and therefore non‑infringing. The new findings undercut that claim by showing that the models can regenerate large, coherent excerpts that match the original prose word‑for‑word. “If a model can output three‑quarters of a novel with high fidelity, the argument that it merely learned abstract patterns becomes tenuous,” the study’s authors wrote, noting that the reproduced text exceeds what would be expected from a purely statistical approximation.

The legal stakes are already materializing. In a 2022 federal case, a judge ruled that Anthropic’s use of copyrighted works could be considered “transformative” and thus qualify for fair use, but simultaneously held that the storage of pirated content constituted a clear infringement. Anthropic ultimately settled for $1.5 billion after the court’s mixed ruling. German courts reached a comparable conclusion last November, issuing a landmark judgment that labeled unlicensed training data as a violation of the nation’s copyright law. Those precedents suggest that plaintiffs could argue Google’s Gemini series similarly breaches both U.S. and European statutes, especially given the quantitative evidence now on record.

Industry analysts have warned that the “fair use” defense hinges on the opacity of model internals. The Stanford‑Yale team’s methodology—probing models with prompts that coax out specific passages—exposes a practical test for infringement that regulators may adopt. If courts accept that near‑verbatim output demonstrates de facto storage, the burden will shift to AI firms to prove that their datasets are scrubbed of protected works or that they have secured licenses. Google’s earlier public statements, which downplayed the presence of copyrighted text, may now be scrutinized as misleading, potentially inviting additional claims for false advertising or consumer fraud.

Beyond litigation, the findings could reshape how AI developers curate training corpora. VentureBeat’s recent coverage of open‑source alternatives such as AI2’s Molmo models highlights a growing appetite for transparent data pipelines, but those projects still face the same copyright landscape. Should the lawsuit succeed, the industry may be forced to adopt licensing frameworks akin to those used for music and film streaming, dramatically increasing the cost of building large‑scale models and potentially slowing the pace of innovation. For now, Google’s legal team has not commented on the study, but the company is expected to file a motion to dismiss on the grounds that the extracted text results from “prompt engineering” rather than inherent data storage—a line of argument that will be tested in courts for the first time.

Google faces lawsuit as AI training data includes copyrighted bestseller novels

Quick Summary

Sources

Compare these companies

🏢Companies in This Story

Related Stories