Anthropic and OpenAI Face Scraping Scandal as Unauthorized Data Trains ChatGPT and Claude
Photo by Riku Lu (unsplash.com/@riku) on Unsplash
OpenAI, Anthropic, Meta and other AI firms are accused of scraping billions of online documents—including Reddit posts, GitHub code and academic papers—without consent, prompting lawsuits from Reddit, the New York Times and the Authors Guild, reports indicate.
Key Facts
- •Key company: Anthropic
- •Also mentioned: Meta, Anthropic
OpenAI’s “Common Crawl” dataset, which the company described in its 2020 GPT‑3 paper as encompassing “nearly the entire internet,” now sits at the center of three high‑profile lawsuits, according to a report on Tiamat. The filing by Reddit in the Northern District of California alleges that OpenAI harvested more than 300 million Reddit posts without any user consent, compensation, or opt‑out mechanism, and that the same practice was mirrored by Anthropic for its Claude model and by Meta for its LLaMA series. The New York Times has joined the litigation, seeking $5 billion in damages for alleged copyright infringement, while the Authors Guild has filed a separate suit claiming that the training pipelines systematically appropriate authors’ works without permission (Tiamat, Mar 8).
The scale of the alleged data grabs is staggering. In addition to the Reddit corpus, OpenAI’s training diet reportedly included 2.3 billion web pages scraped from CommonCrawl.org, millions of books harvested from Project Gutenberg, Google Books and commercial publishers, and countless code repositories from GitHub—some of which contain proprietary business logic and trade secrets (Tiamat). Anthropic and Meta are said to have used comparable “scraping economies,” pulling academic papers, personal blogs, and even YouTube transcripts to bulk‑train their large language models (Engadget). The lawsuits argue that each of these datasets represents billions of dollars in extracted value that was never shared with the original creators, a claim echoed by CNBC’s coverage of Reddit’s suit against Anthropic for “unfair competition” and breach of contract.
Legal scholars are watching the fair‑use debate intensify. While courts have not yet ruled on whether commercial AI training on copyrighted material without permission qualifies as fair use, the plaintiffs contend that the lack of any opt‑out or notification mechanism violates both copyright law and emerging data‑privacy statutes such as FERPA, COPPA and the California Consumer Privacy Act (CCPA). Tiamat notes documented FERPA breaches in university classrooms where students’ assignments fed directly into ChatGPT, and an FTC investigation into child‑focused platforms that could expose minors’ behavioral data to AI recommendation engines (Tiamat). If judges side with the plaintiffs, the rulings could force the industry to redesign its data pipelines, potentially curbing the rapid scaling that has defined the sector over the past three years.
Industry insiders warn that the litigation could reshape revenue models for AI providers. OpenAI and Anthropic have built their commercial offerings—ChatGPT Plus, enterprise API access, and Claude‑powered services—on the premise that the underlying models contain a “unique blend of public and private knowledge” that drives premium pricing (Tiamat). If courts deem that knowledge unlawfully appropriated, the firms may be compelled to either compensate millions of individual contributors or rebuild their models using only licensed or user‑consented data, a costly and time‑consuming process. The stakes are high: the lawsuits collectively seek damages that could dwarf the $6.6 billion funding round that propelled OpenAI to a $157 billion valuation earlier this year (The Information).
For now, the AI community remains divided. Some developers argue that large‑scale web scraping is a de‑facto standard practice that enables rapid innovation, while others—particularly open‑source advocates—see the lawsuits as a necessary corrective to an industry that has “privatized the public’s digital labor” without sharing the returns (TechCrunch). As the legal battles unfold, the outcomes will likely set the parameters for how future generative models are trained, how creators are compensated, and whether the current growth trajectory of AI can be sustained without a fundamental rethink of data ethics.
Sources
No primary source found (coverage-based)
- Dev.to AI Tag
This article was created using AI technology and reviewed by the SectorHQ editorial team for accuracy and quality.