Dictionary Sues OpenAI Over Massive Copyright Infringement in AI Training Practices
Photo by Brett Jordan (unsplash.com/@brett_jordan) on Unsplash
100,000 articles. That’s the amount of copyrighted content Britannica says OpenAI scraped to train its models, prompting the dictionary publisher to sue the AI firm for “massive copyright infringement,” TechCrunch reports.
Key Facts
- •Key company: OpenAI
Britannica’s complaint, filed in New York federal court, alleges that OpenAI harvested roughly 100,000 of the publisher’s online articles—content that Britannica and its Merriam‑Webster imprint own outright—without any license or consent, then fed those texts into the training pipelines for ChatGPT and its sister models. The suit says the AI firm not only used the material to teach its large language models (LLMs) but also continues to surface “full or partial verbatim reproductions” of Britannica’s prose when users query the system, a practice the publisher characterizes as “massive copyright infringement” (TechCrunch). In addition, the complaint claims OpenAI’s Retrieval‑Augmented Generation (RAG) feature, which pulls live data from the web to supplement ChatGPT’s answers, directly taps Britannica’s database, effectively turning the encyclopedia into a free, searchable backend for a commercial product.
Beyond the copyright claims, Britannica invokes the Lanham Act, arguing that ChatGPT’s occasional “hallucinations”—fabricated statements that nonetheless bear the Britannica brand—mislead consumers and dilute the publisher’s trademark. The lawsuit contends that such false attributions “starve web publishers like [Britannica] of revenue” because users receive AI‑generated answers instead of paying for the original subscription or licensing fees (TechCrunch). The complaint further warns that these hallucinations jeopardize public access to reliable information, a point that underscores the broader stakes of the case for the information ecosystem.
Britannica is not alone in challenging OpenAI’s data‑scraping practices. The complaint notes a wave of parallel lawsuits targeting the same AI lab, including actions by The New York Times, Ziff Davis (owner of Mashable, CNET, IGN, PC Mag, and others), and more than a dozen newspapers across the United States and Canada such as the Chicago Tribune, Denver Post, Sun Sentinel, Toronto Star, and the Canadian Broadcasting Corporation (TechCrunch). A similar suit against the competitor Perplexity AI remains pending, suggesting that the publishing industry is coalescing around a coordinated legal front to force AI firms to renegotiate the terms of data use (TechCrunch).
OpenAI has not publicly responded to the filing, but the company’s prior statements on data provenance indicate it relies on a combination of licensed datasets, publicly available web content, and user‑generated inputs to train its models. Legal scholars cited by The Verge note that the lack of explicit licensing for large swaths of copyrighted material leaves OpenAI vulnerable under the “fair use” doctrine, especially when the output reproduces protected text verbatim (The Verge). If the court finds that OpenAI’s RAG workflow effectively copies Britannica’s articles in real time, the precedent could force the AI lab to implement stricter filtering or pay retroactive royalties for past usage.
The outcome of the case could reshape the economics of generative AI. Publishers argue that without compensation, AI services erode the financial foundations that support investigative reporting, editorial fact‑checking, and the upkeep of scholarly resources. Conversely, OpenAI and other AI developers contend that the broad ingestion of publicly available text is essential to building models that can understand and generate human language at scale. As the litigation proceeds, industry observers will watch for any settlement terms that might establish a licensing framework—potentially turning today’s “massive infringement” claim into a new revenue stream for content creators while imposing compliance costs on AI providers.
Reporting based on verified sources and public filings. Sector HQ editorial standards require multi-source attribution.