Anthropic nutzt Trainingsdaten mit urheberrechtlich geschützten Bestsellern
Photo by Alexandre Debiève on Unsplash
While many expected AI models to learn only abstract patterns, Torbenkopp reports they actually store verbatim excerpts from copyrighted bestsellers, enabling near‑word‑for‑word reproductions and revealing far more proprietary data than previously assumed.
Quick Summary
- •While many expected AI models to learn only abstract patterns, Torbenkopp reports they actually store verbatim excerpts from copyrighted bestsellers, enabling near‑word‑for‑word reproductions and revealing far more proprietary data than previously assumed.
- •Key company: Anthropic
- •Also mentioned: Google, Anthropic, xAI
Researchers at Stanford and Yale have demonstrated that the largest commercial language models can reproduce copyrighted prose with startling fidelity, upending the industry’s long‑standing claim that training data are never stored verbatim. By prompting the models with carefully crafted “jailbreak” queries, the teams extracted thousands of words from best‑selling novels—including Game of Thrones, The Hunger Games and The Hobbit—with near‑exact wording. Google’s Gemini 2.5 reproduced 76.8 percent of Harry Potter and the Sorcerer’s Stone accurately, while xAI’s Grok 3 returned 70.3 percent of the same text, the study reports. Most dramatically, Anthropic’s Claude 3.7 Sonnet yielded almost the entire novel when the researchers bypassed its safety filters, proving that the model retains large swaths of the original work rather than merely abstracting patterns.
The findings directly challenge the “fair‑use” defense that AI firms have leaned on in recent litigation. Companies such as OpenAI, Google, Anthropic and xAI have argued that their systems learn statistical relationships without ever storing the source material, a position Google reiterated in a 2023 letter to the U.S. Copyright Office stating that “no copies of the training data exist in the model.” The new empirical evidence, however, suggests that the models can recall and regenerate protected text at a level that courts may deem a direct infringement. If a system can output three‑quarters of a novel word‑for‑word, the distinction between “learning” and “storing” becomes legally tenuous.
U.S. courts have already begun to grapple with this gray area. In a precedent‑setting case last year, a federal judge ruled that Anthropic’s use of copyrighted content could be considered transformative and thus qualify for fair use, yet simultaneously warned that the act of retaining pirated works constitutes a clear violation. The ruling precipitated a settlement in which Anthropic paid $1.5 billion to resolve the dispute. German courts have also entered the fray; a November decision last year signaled that European jurisdictions may pursue similar liability for unlicensed data ingestion, though the full text of that judgment was not disclosed in the source material.
The technical implications are equally profound. The ability to “jailbreak” safety layers and coax models into reproducing protected text reveals weaknesses in current alignment strategies. Researchers noted that the models’ internal representations appear to retain large contiguous passages, which can be accessed when prompts are engineered to sidestep content filters. This raises questions about the scalability of existing data‑scrubbing techniques and whether future models will need to adopt more aggressive data‑filtering pipelines or fundamentally different training paradigms to avoid legal exposure.
Industry observers warn that the fallout could reshape AI development pipelines. Venture capital‑backed initiatives such as AI2’s open‑source Molmo models, which recently outperformed GPT‑4o and Claude on select benchmarks, may gain traction if they can demonstrate transparent, rights‑cleared training sets. Meanwhile, incumbents face mounting pressure to negotiate licensing agreements with publishers or to develop robust provenance tracking for their corpora. As the debate moves from academic labs to courtroom benches, the balance between rapid AI innovation and respect for intellectual property is poised to become a defining battleground for the sector.
Sources
This article was created using AI technology and reviewed by the SectorHQ editorial team for accuracy and quality.