xAI nutzt Trainingsdaten, die urheberrechtlich geschützte Bestseller‑Romane enthalten
Photo by Maxim Hopman on Unsplash
According to Torbenkopp, xAI’s language models were trained on datasets that include copyrighted bestseller novels, enabling the AI to generate near‑verbatim reproductions and sparking fresh legal and ethical battles for the industry.
Quick Summary
- •According to Torbenkopp, xAI’s language models were trained on datasets that include copyrighted bestseller novels, enabling the AI to generate near‑verbatim reproductions and sparking fresh legal and ethical battles for the industry.
- •Key company: xAI
- •Also mentioned: Google, Anthropic, xAI
The study from Stanford and Yale, released this week, shows that the “Grok” model behind xAI’s flagship chatbot can spout more than 70 percent of a protected novel word‑for‑word when prompted with the right jailbreak sequence. The researchers demonstrated that Grok 3 reproduced 70.3 percent of The Hobbit with “high accuracy,” and that Google’s Gemini 2.5 managed 76.8 percent of Harry Potter and the Sorcerer’s Stone under identical conditions. Anthropic’s Claude 3.7 Sonnet was coaxed into emitting nearly the entire text of The Hunger Games after the team bypassed the model’s safety layers. These findings overturn the industry‑wide claim—most famously restated in a 2023 letter to the U.S. Copyright Office by Google—that “no copies of training data exist in the model” (Google, 2023).
The implications for xAI are immediate. According to the Torbenkopp report, the company’s training pipelines incorporated a swath of best‑selling fiction without securing licenses from authors or publishers. Because the model can regenerate large passages verbatim, the “fair‑use” defense that AI firms have leaned on—asserting that training is a purely transformative process that never stores the source text—now looks shaky. If a system can output three‑quarters of a novel on demand, the line between “learning patterns” and “storing copies” blurs, raising the specter of direct infringement. Legal scholars cited in the study note that the U.S. district court that last year deemed Anthropic’s use “transformative” also warned that “the storage of pirated works is fundamentally infringing,” a stance that led to a $1.5 billion settlement (Reuters, 2024).
The fallout could reverberate across the entire generative‑AI market. Bloomberg reports that xAI is burning roughly $1 billion a month on Grok development, a cash burn that will intensify scrutiny from investors and regulators alike (Bloomberg, 2025). If courts begin to treat near‑verbatim reproductions as evidence of unlawful copying, the cost of retrofitting data pipelines—scrubbing copyrighted text, negotiating licenses, or building entirely synthetic corpora—could dwarf current operating expenses. Meanwhile, rival firms such as OpenAI and Anthropic are already fielding lawsuits over similar allegations; Perplexity AI, for example, faces a suit from the Berlusconi media empire for training on unlicensed film and TV content (Bloomberg, 2025). The convergence of these legal pressures suggests a looming industry‑wide reckoning over data provenance.
Stakeholders are already positioning themselves for the next round of battles. Elon Musk’s xAI, which is courting fresh billions of capital according to Reuters (June 2025), must now answer to both shareholders demanding profitability and rights‑holders demanding compensation. The study’s authors argue that “the ability to regenerate protected text at scale” could serve as a decisive metric in future copyright litigation, potentially reshaping how courts evaluate “substantial similarity.” If judges adopt that standard, the current practice of training on massive, uncurated web scrapes may become untenable, forcing a shift toward more transparent, consent‑driven data collection.
For now, the public sees the most tangible symptom: AI chatbots that can finish a beloved saga with uncanny fidelity. That novelty is a double‑edged sword—captivating users while exposing the firms behind the scenes to unprecedented legal risk. As the Stanford‑Yale team’s paper makes clear, the era of “black‑box” training data is ending; the industry will have to decide whether to double down on aggressive data harvesting or to rebuild on a foundation that respects copyright from the ground up.
Sources
This article was created using AI technology and reviewed by the SectorHQ editorial team for accuracy and quality.