Google nutzt urheberrechtlich geschützte Bestseller‑Romane als Trainingsdaten für
Photo by Lucia Macedo (unsplash.com/@sample_in_photography) on Unsplash
Google nutzt laut Torbenkopp urheberrechtlich geschützte Bestseller‑Romane als Trainingsdaten für seine KI‑Modelle, wodurch die Systeme nahezu wortgetreue Kopien erstellen können und neue juristische‑ethische Debatten auslösen.
Quick Summary
- •Google nutzt laut Torbenkopp urheberrechtlich geschützte Bestseller‑Romane als Trainingsdaten für seine KI‑Modelle, wodurch die Systeme nahezu wortgetreue Kopien erstellen können und neue juristische‑ethische Debatten auslösen.
- •Key company: Google
- •Also mentioned: Google, Anthropic, xAI
Google’s Gemini 2.5 reproduced 76.8 percent of Harry Potter und der Stein der Weisen with “high accuracy,” according to a joint Stanford‑Yale study that probed the output of the world’s leading language models (source: Torbenkopp). The researchers triggered the result by “jailbreaking” the model—bypassing its built‑in safety filters with crafted prompts—showing that the system can emit thousands of words that match the original text almost verbatim. Comparable experiments on Anthropic’s Claude 3.7 Sonnet and xAI’s Grok 3 yielded 70.3 percent and a similarly high fidelity reconstruction of The Hobbit and A Game of Thrones (source: Torbenkopp). The findings overturn the industry’s long‑standing claim that these models never store exact copies of copyrighted material.
The study’s authors argue that the ability to regenerate such large swaths of text suggests more than “pattern learning.” In a 2023 letter to the U.S. Copyright Office, Google asserted that its models contain no literal copies of training data, whether text, images, or other formats (source: Torbenkopp). The new evidence, however, puts that statement into doubt. If a model can output three‑quarters of a novel with minimal deviation, the “transformative‑learning” defense—central to the “fair use” argument—faces a practical test: at what point does learning become storage? Legal scholars cited in the report note that the U.S. courts have already grappled with this line, deeming Anthropic’s use of copyrighted works “potentially fair” while simultaneously ruling that the act of retaining pirated content is “fundamentally infringing” (source: Torbenkopp).
The ramifications for the broader AI ecosystem are immediate. In the United States, a recent ruling forced Anthropic to settle for $1.5 billion after a jury concluded that its training practices violated copyright, despite the company’s reliance on the fair‑use defense (source: Torbenkopp). Germany’s Federal Court of Justice issued a landmark decision last November that echoed the U.S. stance, signaling that European courts may also reject blanket claims of non‑storage (source: Torbenkopp). If similar lawsuits target Google, the company could face multibillion‑dollar exposure, especially given the scale of its Gemini product line and the commercial reliance of enterprises on its API.
Industry observers note that the controversy could reshape data‑curation practices across the sector. VentureBeat’s coverage of open‑source alternatives, such as AI2’s Molmo models that now outperform GPT‑4o and Claude on certain benchmarks, underscores a growing appetite for transparent training pipelines (source: VentureBeat). While Molmo’s developers publish their data sources, the major cloud providers have largely kept theirs proprietary, a stance that may become untenable under heightened legal scrutiny. Moreover, the ability to extract near‑exact passages from proprietary works could erode trust among publishers and authors, prompting them to demand stricter licensing terms or to block crawling altogether.
For Google, the immediate challenge is twofold: mitigate the technical vulnerability that allows jailbreaking and rebuild its legal narrative around fair use. The company’s 2023 correspondence with the Copyright Office emphasized that “no copy of the training data is present” in the model (source: Torbenkopp), a claim now vulnerable to empirical refutation. Until Google can demonstrate that its architecture truly abstracts away source text—perhaps by integrating differential privacy or other data‑sanitization techniques—it may find itself defending against a wave of infringement claims that could reshape the economics of AI development.
Sources
This article was created using AI technology and reviewed by the SectorHQ editorial team for accuracy and quality.