Judge Calls Anthropic’s Training of LLMs with Authors’ Works ‘Quintessentially Transformative’ But Gives No Pass on Piracy

“Like any reader aspiring to be a writer, Anthropic’s LLMs trained upon works not to race ahead and replicate or supplant them — but to turn a hard corner and create something different.” – Judge William Alsup

LLMsOn Monday, the U.S. District Court for the Northern District of California issued a mixed  order on fair use as it relates to generative AI, in part likening the training of Large Language Models (LLMs) to the process of human learning, in a case brought against generative AI tool Anthropic by a group of authors.

The lawsuit was filed by journalists and book authors Andrea Bartz, Charles Graeber and Kirk Wallace Johnson in August 2024 against Anthropic on behalf of a class of plaintiffs, alleging widespread copyright infringement of “hundreds of thousands of copyrighted books.” The suit challenged only the inputs of the LLMs, not the outputs.

Anthropic’s core product is the AI ChatBot, Claude, which the complaint claimed was fed “known pirated versions of Plaintiffs’ works” in order to train the Chatbot to generate human-like responses. “An essential component of Anthropic’s business model—and its flagship ‘Claude’ family of large language models (or “LLMs”)—is the largescale theft of copyrighted works,” said the complaint.

Far from compensating the plaintiffs for their works, Anthropic “has taken multiple steps to hide the full extent of its copyright theft,” it continued.

According to reports cited in the complaint, Anthropic “has raised $7.6 billion from tech giants like Amazon and Google” and, as of December 2023, the company was valued in excess of $18 billion. The company has become particularly popular with corporate clients, including Slack, Zoominfo, Asama, Bridgewater, LexisNexis, and Jane Street Capital, according to the lawsuit.

The infringement allegations stem chiefly from Anthropic’s admission in a December 2021 paper that it created a training dataset relying mostly on “The Pile,” which is “an 800 GB+ open-source dataset created for large language model training,” according to the complaint. One of the architects of the Pile, Shawn Presser, created a dataset called “Books3” in the Pile, which, according to the plaintiffs, is “a trove of pirated books.” Books3 consists of “all of Bibliotik,” according to public posts by Presser, and Bibliotik, according to sources cited in the complaint, is a “notorious pirated collection” of “pirated books.”

The complaint additionally argued that Anthropic purchased millions of copies of print books, some that overlapped with the digital pirated copies it obtained, “tore off the bindings, scanned every page, and stored them in digitized, searchable files” in order to create a “central library” of “all the books in the world” to retain “forever,” according to Monday’s order, which was authored by Judge William Alsup.

In his analysis, Alsup first said, with respect to the copies of the works that were used to train specific LLMS that “the purpose and character of using copyrighted works to train LLMs to generate new text was quintessentially transformative.” He explained:

“Like any reader aspiring to be a writer, Anthropic’s LLMs trained upon works not to race ahead and replicate or supplant them — but to turn a hard corner and create something different. If this training process reasonably required making copies within the LLM or otherwise, those copies were engaged in a transformative use.”

However, with respect to the copies that were used to build a central library, while Alsup found that the print copies Anthropic purchased and subsequently disposed of as it scanned were purchased “fair and square” and represented a transformative use with respect to the mere format change from print to digital, he also rejected Anthropic’s argument that the pirated copies should equally qualify as fair use. Anthropic said that because it intended to eventually use the pirated copies in the central library to train LLMs, the use should be deemed transformative. But the district court dismissed this argument, finding that in this case that the actual use was not transformative and that “piracy was the point: To build a central library that one could have paid for, just as Anthropic later did, but without paying for it.”

Alsup also found the second fair use factor—the nature of the copyrighted works—pointed against fair use for all the copies at issue because Anthropic’s works were clearly expressive. But the third factor with respect to the works used to train LLMs did favor fair use because the amount and substantiality of the portion of the work used was necessary to the transformative use, according to Alsup. With respect to the works that were purchased for the central library, the analysis was the same on the third factor, but for the pirated copies used for the central library, the third factor pointed against fair use, said the order. It added:

“[Anthropic’s] purpose, it says, was to train LLMs. But its objective conduct was to seek ‘all the books in the world’ and then retain them even after deciding it would not make further copies from them for training — indicating there were other further uses. Against the purpose of acquiring all the books one could on the chance some might prove useful for training LLMs and maybe other stuff too, almost any unauthorized copying would have been too much.”

Finally, with respect to the effect of the use upon the market value of the copyrighted works, the fourth fair use factor, Alsup found that only the use of the pirated works to create a central libray weighed against fair use.

Overall, Alsup granted summary judgment for Anthropic that the training use was a fair use and that the print-to-digital format change was a fair use. But he denied summary judgment for Anthropic that the pirated library copies must be treated as training copies and ordered a trial with respect to the pirated copies to determine damages, including potentially for willfulness. “That Anthropic later bought a copy of a book it earlier stole off the internet will not absolve it of liability for the theft but it may affect the extent of statutory damages,” Alsup noted.

In an analysis of the case for Truth on the Market, Kristian Stout, director of innovation policy at the International Center for Law & Economics (ICLE), said the decision provides a “clear roadmap” for AI companies with respect to inputs, chiefly that “companies should acquire training materials through legitimate channels—purchase, licensing, or authorized access,” but that “output liability emerges as the next frontier.”

Image Source: Deposit Photos
Author: phonlamai
Image ID: 413313112

Share

Warning & Disclaimer: The pages, articles and comments on IPWatchdog.com do not constitute legal advice, nor do they create any attorney-client relationship. The articles published express the personal opinion and views of the author as of the time of publication and should not be attributed to the author’s employer, clients or the sponsors of IPWatchdog.com.

Join the Discussion

2 comments so far.

  • [Avatar for Anon]
    Anon
    June 26, 2025 09:41 am

    A second case has also dropped on the topic.

    My main thrust (training being necessarily transformative) appears to be holding firm.

  • [Avatar for Anon]
    Anon
    June 25, 2025 08:35 am

    The race is not over, so I do not celebrate.

    That being said, the level of technical transformation necessary should have made this an easy call for all attorneys trained in this space.

Varsity Sponsors

IPWatchdog Events

Industry Events

PIUG 2026 Joint Annual and Biotechnology Conference
May 19 @ 8:00 am - May 21 @ 5:00 pm EDT
Certified Patent Valuation Analyst Training
May 28 @ 9:00 am - May 29 @ 5:00 pm EDT
2026 WIPO-U.S. Summer School on Intellectual Property
June 1 @ 9:00 am - June 12 @ 1:45 pm EDT

From IPWatchdog