“Taken together, these three decisions show that U.S. fair-use doctrine is not marching in a single direction for AI training and it will take some time for appellate decisions to start providing a more unified approach.”
The battle over whether U.S. copyright law permits artificial intelligence (AI) training on copyrighted works is no longer a theoretical debate. In 2025, three federal district court decisions began to sketch the boundaries of what counts as fair use in this context.
1. Bartz v. Anthropic: Training as Exceedingly Transformative
In Bartz v. Anthropic PBC, book authors challenged Anthropic’s use of their books, both lawfully acquired and pirated, to train the Claude large language models. On June 23, 2025, Judge William Alsup of the Northern District of California issued a summary judgment opinion in relation to whether various actions taken by Anthropic constitute fair use.
With regard to the use of copyrighted materials to train LLMs, Judge Alsup described the use as “transformative—spectacularly so,” which bears on the first fair use factor. The opinion analogizes model training to human reading and learning. That purpose, the court held, is fundamentally different, i.e., transformative, from the authors’ purpose in writing books to be read for entertainment or education purposes.
The same order blessed Anthropic’s “print-to-digital” process as fair use. Anthropic bought physical copies of books, scanned them, destroyed the paper copies, and stored searchable digital versions for internal use.
The opinion drew a sharp line, however, at Anthropic’s acquisition of millions of pirated books downloaded from shadow libraries. He viewed the acquisition of these books and the training of the LLM based on these books to be distinct steps, each of which would be evaluated with regard to fair use. He found that, while the training was fair, use, the acquisition of libraries of stolen books was not fair use on the record before him. There was no doctrinal “get out of jail free” card for downloading pirated copies simply because the intended end use (training) might be transformative. The Bartz case since settled based primarily on this decision, with Anthropic agreeing to pay up to $1.5 billion dollars.
2. Kadrey v. Meta: Fair Use But With a Market-Harm Warning Label
Two days later, in Kadrey v. Meta Platforms Inc., Judge Vince Chhabria of the Northern District of California likewise held that training an LLM on copyrighted books—including books obtained from shadow libraries—was fair use on the record before him. After discovery, the parties filed cross-motions for partial summary judgment on fair use. Judge Chhabria denied plaintiffs’ motion and granted Meta’s. But while Judge Chhabria held in favor of Meta, the takeaway for LLM developers was not all positive and was very factually specific to the evidence of record in that case.
On the first factor, the court agreed that using books to train the Llama models was highly transformative. The books exist to be read; Meta’s purpose was to extract patterns and statistical relationships from the corpus to power a general-purpose text generator. Unlike Judge Alsup, Judge Chhabria refused to carve the shadow-library downloads off as a separate, non-transformative act of piracy. He analyzed Meta’s copying of books from those sites together with its use of the copies to train Llama as a single instance of reproduction to be judged under fair use.
The reason Judge Chhabria’s decision wasn’t a complete victory for LLM developers like Meta lies in its discussion of the fourth fair use factor, which looks to the effect of the use on the potential market for the original work. He found no triable issue that Llama was regurgitating plaintiffs’ books in a way that would serve as a market substitute for them. He also concluded that the plaintiffs had not shown the existence of a well-defined, likely-to-develop licensing market for training on their books that Meta had usurped. The latter finding is especially interesting since the decision issued, there have been a number of licensing deals struck between copyright owners (like media companies) and LLM developers.
In that sense, Kadrey is as much a roadmap for future plaintiffs as a win for Meta: it affirms fair use in this instance while warning that a richer evidentiary record might tip factor four in a different direction.
3. ROSS Intelligence: When Training Cannibalizes the Market
If Bartz and Kadrey occupy one side of the developing fair use landscape, Thomson Reuters Enterprise Centre GmbH v. ROSS Intelligence Inc. sits at the other. ROSS set out to build an AI-driven legal research tool that would compete directly with Westlaw. After Thomson Reuters refused to license its content, ROSS obtained “Bulk Memos” generated from Westlaw’s copyrighted headnotes and used them in its training pipeline. On February 11, 2025, Judge Stephanos Bibas, sitting by designation in the District of Delaware, granted partial summary judgment to Thomson Reuters, holding that the headnotes were protectable and that ROSS’s fair-use defense failed as a matter of law. To be clear – ROSS Intelligence differs substantively from Bartz and Kadrey in that while the technology at use in ROSS Intelligence is AI-related, it is not generative AI.
On the purpose-and-character factor, ROSS argued that its use was intermediate and transformative: headnotes were copied only to generate numerical weights, never displayed to users, and the resulting system was not a database of headnotes but an AI tool.
On the market-effect factor, the court emphasized that ROSS was trying to build a substitute for Westlaw using Westlaw’s own curated content. That kind of commercial substitution lies at the core of factor four’s concerns. With both the first and fourth factors cutting sharply against ROSS, the court held that the training use was not fair use.
The result in ROSS underscores that AI developers may be on shakier ground when their systems are designed to replicate a rightsholder’s core product rather than to generate new works that only indirectly compete with it. Notably, the ROSS Intelligence decision is currently on appeal in the Third Circuit, where the court will be deciding at least two core questions: whether short quotes/paraphrases of judicial holdings in headnotes are copyrightable at all, and whether fair use protects ROSS’s internal use of those headnotes as training data.
What These Cases Signal
Taken together, these three decisions show that U.S. fair-use doctrine is not marching in a single direction for AI training and it will take some time for appellate decisions to start providing a more unified approach. Whether training generative AI models on copyrighted material is infringing or constitutes fair use is moving target. The decision in Thomson Reuters v. Ross Intelligence is on appeal to the Third Circuit, and the decisions in Kadrey v. Meta (still in discovery) and Bartz v. Anthropic (settled) are not binding. Numerous other cases that address these issues are also making their way through the courts and may not result in binding precedent any time soon. In addition, output-based infringement may increasingly become a factor. Whether training a generative AI model constitutes fair use will, at the present moment, generally depend on a variety of factors, including how transformative the training and outputs are, and whether the outputs compete with, or dilute the market for, the original works. Of course, the output of the model may also infringe copyrighted material if it is identical or substantially similar to a copyrighted work, but consideration must also be given to whether the model output impacts the market for the work.
The courts take a dim view of uses (training or outputs) that substitute for the copyright owners’ work. Thomson Reuters v. Ross Intelligence held that that Westlaw’s headnotes and Key Number system are original works protectable by copyright and that Ross’s use of the copyrighted headnotes to train its own AI search platform was not fair use. The decision emphasized factor four of the fair use analysis set forth in 17 U.S.C. § 107—the infringing’s uses effect on the copyrighted work’s “potential market” or “value.” Bartz v. Anthropic’s finding of fair use relied on the determination that copies of the works “did not and will not displace demand for copies of the Authors’ works …” If the proposed use serves as a replacement for a paid product, the likelihood of successfully claiming fair use decreases.
Even if training generative AI models is considered “transformative,” use of tainted source material may result in liability. Recent decisions including Kadrey v. Meta (still in discovery) and Bartz v. Anthropic highlight the distinction between transformative training practices and wholesale copying from unauthorized sources such as shadow libraries. Even where training itself may be framed as fair use, reliance on pirated or unlicensed content can undermine that defense. Clean, well?documented sourcing is therefore critical to reducing exposure.
Future decisions may also flesh out the concern in regarding market harm, including dilution of the market for copyrighted works. While the fair use analysis has traditionally focused on whether the secondary use displaces sales of the original work, the copyright owner is also harmed if an AI system produces outputs that erodes the demand for the owner’s work. Kadrey v. Meta noted the danger of this “’indirect’ substitution” that may “enable the rapid generation of countless works that compete with the originals, even if those works aren’t themselves infringing.” The Copyright Office has taken a preliminary position that section 107 encompasses any “effect” on the potential market, and that the “speed and scale at which AI systems generate content pose a serious risk of diluting markets for works of the same kind as in their training data.” This dilution, according to the Copyright Office, can come in the form of works in the same genre as the copyrighted works or works that are stylistically similar.
Coping with Legal Uncertainty
Given the unsettled state of the law as of the end of 2025, overlooking any of these risks could result in significant legal exposure. And, as always, the viability of a fair use defense turns on a fact?specific application of the four statutory factors under 17 U.S.C. §?107. Accordingly, both the methods used in training and the potential as well as actual uses of generative AI should be carefully documented and subjected to rigorous legal analysis for possible copyright implications. Although this area of practice cannot yet be navigated without risk, early and deliberate attention to these issues provides the best foundation for managing uncertainty and mitigating exposure.
Join the Discussion
4 comments so far. Add my comment.
Anon
December 30, 2025 10:00 amI hear your point of, “enabling data owners to exercise informed consent about what enters training pipelines”
But that begs the question as to “data owners” actually having that legal right over what enters training pipelines.
They (often) do not**.
This is the crux of the matter.
It is critical for creatives, data owners and the like to recognize the limits of any legal rights that they believe that they have.
** In most cases. Certainly, there are cases in which data owners may have ‘locked down’ their data to such an extent that breaking in may entail a violation of law (such as for example, a contract term violation, or other legal regime such as HIPPA).
But be aware that even breaking through some level of encryption or other ‘lock-down’ may not entail a breaking of law or violation of rights if the use of that data falls within Fair Use.
In such cases, there is no actual legal ‘ownership’ of data that prevents a use within Fair Use. As you may well be aware, even such things as ‘notice not to do something can be NOT legally binding – e.g., robots.txt is only a non-binding suggestion.
Hunkguk Do
December 29, 2025 09:06 pmThank you for this critical clarification and the important legal guidance.
Your distinction between “defensive protection” and “sabotage” is absolutely right, and I deeply appreciate you taking the time to explain the boundaries.
To clarify our intent: Our focus is not on “collapsing” AI models but on
enabling data owners to exercise informed consent about what enters training pipelines—essentially a “Do Not Track” mechanism for AI training, not a weapon.
Your analogy about nuclear waste in a stream is particularly illuminating.
We recognize that even property owners have responsibilities to the broader ecosystem, and the same principle must apply in the digital commons.
This dialogue has been invaluable in helping us refine our approach to stay within legal and ethical boundaries while still addressing the legitimate concerns of data sovereignty.
Thank you again for your thoughtful engagement.
Anon
December 29, 2025 10:23 amHunkguk Do,
Thank you for sharing your own insights in reply. While conversations in the medium are difficult to maintain (or engage), I thought that I would share my own views sparked by your sharing.
Your comment of “These three decisions highlight a critical tension: while courts evaluate fair use through the lens of market substitution…” first draws to mind an excellent piece of legal writing that I need to give credit to Prof. Edward Lee on the attempt to improperly broaden exactly what the limited rights of copyright under US law are allowed to protect: that is, the single explicit item of expression to which copyright attaches. The notion of market substitution is illicit and improperly being used to broaden copyright in an attempt to protect a business model of digital goods that merely is to the concept of “style of,” which expressly cannot be protected.
And
“… the technical reality of vector embeddings makes post-hoc enforcement nearly impossible.” draw the technical transformation nature the requires a finding of Fair Use – under existing Supreme Court law; in that what is technically happening is happening to unprotected meta characteristics – at scale – to which no single work can properly meet the legal test for showing infringement culpability.
This single statement of yours shows a double reasoning why copyright infringement cannot be legally present in the training of AI engines.
This though is far less a matter of “ Legal remedies arrive years after the harm.” and far more understanding that legal remedies are not available for a “harm” that is not legally present. One cannot BE harmed for protecting a set of business methods that are note legally protected to begin with.
Does this mean that digital business methods need something different?
Yes!
But that is not the horrible answer that many seem to want it to be.
Quite in fact, that is expressly the disruptive promise that promoting innovation is meant to call forth.
Technology always carries the nature of moving faster than ‘law’ and is the very reason why innovation law must be open-ended and NOT constrained merely by existing business models. If that were allowed to happen, those in control of existing business models would choke out true innovation.
Second Prong:
You are correct to be hesitant in viewing what may well be destruction of another’s property! Items such as that which you mention as having created (along the lines of Nightshade and the like) that cause harm to another’s property runs a very real risk of being illegal in and of themselves – and may bring causes of action for knowingly being placed in the stream of commerce. This is especially true given the reflection that these knowingly cause sabotage – as opposed to a more benign ‘just protecting my own item’ approach.
It is critical – absolutely critical – to understand that the rights of a grant of copyright are a bundle of limited rights and creatives simply do not have absolute dominion over their own creations so as to be able to cause a toxic harm in the larger stream of commerce.
One analogy here that might help would be personal property (real property). One may have a stream on own’s personal property, but one is not permitted to throw nuclear waste into the stream, even if one does so within the bounds of one’s property lines.
Hunkguk Do
December 25, 2025 11:08 pmThank you for sharing your insight.
These three decisions highlight a critical tension: while courts evaluate fair use through the lens of market substitution, the technical reality of vector embeddings makes post-hoc enforcement nearly impossible.
Once content is transformed into vector embeddings and integrated into training datasets, tracing the relationship back to original works becomes mathematically intractable. Watermarks don’t prevent training. Encryption eliminates usability. Legal remedies arrive years after the harm.
I recently developed technology that causes AI models to collapse when they attempt to train on content without user consent (applicable to text, PDFs, images, etc.). But I’m genuinely conflicted about commercialization:
Legal remedies vs. technical prevention—which truly serves creators and the ecosystem?
Is technology that sabotages AI models for unauthorized training too aggressive? Or is it the natural evolution of digital rights in an age where My Data, My Algorithm, and now My Content are all effectively controlled by platforms & AI models?
The law is important, but technology and markets move faster. Perhaps we need technical safeguards that operate at the data layer, not just legal frameworks that operate in courtrooms.
Add Comment