“Apple did not compensate creators for use of their copyrighted works and concealed the sources of their training datasets to evade legal scrutiny.” – Complaint
Taking their cue from the recent Bartz v. Anthropic saga, the authors of a neuroscience book and professors at the State University of New York filed a class action complaint on October 9 with the U.S. District Court for the Northern District of California, alleging that Apple Inc. committed mass copyright infringement by using pirated books to train its artificial intelligence systems. Plaintiffs Susana Martinez-Conde and Stephen Macknik claimed that Apple built its Apple Intelligence platform, including its OpenELM and Foundation Models, by making unauthorized copies of copyrighted works without permission or compensation.
Apple infringed upon Martinez-Conde, Macknik, and Class members’ copyrighted materials by reproducing their registered works without obtaining authorization to build databases of training materials, according to the filing. Central to the allegations was Apple’s use of datasets containing Books3, described as a “notorious ‘shadow library,’ a dataset of pirated, copyrighted books.” This dataset, derived from a private tracker called Bibliotik, contained approximately 196,640 books, including Martinez-Conde and Macknik’s international bestseller, Sleights of Mind: What the Neuroscience of Magic Reveals About Our Everyday Deceptions.
According to the lawsuit, Apple’s own documentation indicated its use of the infringing materials. Apple’s model card and GitHub repository for its Open Efficient Language Models (OpenELM) stated that the pre-training dataset included the Pile and a subset of RedPajama, which included Books3, a well-known component of the Pile, by a dataset curated by the research organization EleutherAI.
Furthermore, the “Books” component of the RedPajama dataset was described as a direct copy of the Books3 dataset. By using these datasets, Apple trained its OpenELM models on a known collection of pirated works, thereby directly infringing on the copyrights of thousands of authors, the Martinez-Conde and Macknik argue.
Books3 was removed from the website that hosts each OpenELM model, Hugging Face, in October 2023 with a message stating it was “defunct and no longer accessible due to reported copyright infringement.” The dataset’s creator, Shawn Presser, acknowledged the copyright concerns, stating that “we almost didn’t release the data sets at all because of copyright concerns.”
The allegations extended to Apple’s Foundation Models, which are central to the Apple Intelligence features integrated across its products. A research paper released by Apple in July 2024, the Foundation Language Model (FLM) Paper, identified three sources for its training data: licensed data from publishers, “curated publicly available or open-sourced datasets,” and information crawled by its web-crawler Applebot. The complaint contended that the terms “publicly available” and “open-sourced” are commonly used to falsely suggest that works are made publicly available by the author within the AI industry for pirated content.
Apple’s licensed data was characterized as a “limited amount” and was not used during what Apple calls “core pre-training” but rather during a subsequent phase called “continued pre-training,” according to the complaint.
Applebot, a web-crawling program that had been scraping internet data for nearly a decade, came under particular scrutiny. The complaint stated that Apple only disclosed in June 2024 that this scraped data was being used for AI training. It further alleged Apple’s Foundation Models had necessarily been trained well before the July 2024 release of the FLM Paper describing them. By the time Apple disclosed this use in June 2024, it was too late for any opt-outs to matter, as Apple had already scraped the data and trained language models with it.
The FLM Paper stated that Applebot “employs advanced crawling strategies to prioritize high-quality and diverse content” and that “high-quality filtering plays a critical role in overall model performance.” Apple used “model-based classifiers” to filter this scraped data for quality, and these classifiers are themselves trained on datasets that include unlicensed copyrighted works, the filing alleged.
The market for licensing AI training data is growing rapidly, with some researchers estimating its value at around $2.5 billion and projecting it could reach nearly $30 billion within a decade. Despite this growth, Apple did not license the authors’ books, which contrasted sharply with its reported deal with Shutterstock, valued between $25 million and $50 million, for images, and its negotiations with news publishers like Condé Nast and NBC News, according to the complaint.
“Apple did not compensate creators for use of their copyrighted works and concealed the sources of their training datasets to evade legal scrutiny,” the complaint states. “Good writing in training data makes AI outputs better and models more valuable,” which is why high-quality copyrighted works are prioritized for training.
Apple’s conduct has directly harmed the market for the authors’ works, Martinez-Conde and Macknik argued. The outputs from Apple Intelligence will compete with and dilute the market for human-authored books, highlighting the ongoing problem of “low-quality sham ‘books’” and unauthorized AI-generated summaries flooding online marketplaces. The day after Apple officially introduced Apple Intelligence, the company’s value increased by over $200 billion, described in the suit as “the single most lucrative day in the history of the company.”
Apple also trained its models on unauthorized copies of eBooks it sells to users through Apple Books, the plaintiffs allege. Copying and using such eBook files for any purpose beyond the explicit, limited scope of Apple’s license to sell them constitutes copyright infringement, the complaint said.
Martinez-Conde and Macknik are seeking to represent a class of all owners of a registered U.S. copyright for any work Apple used without authorization to train its AI models. They are pursuing statutory damages for willful infringement, an injunction, and the destruction of all AI models and training datasets built using the copyrighted works, including the OpenELM and Foundation Models.
The complaint cited to the recent decision in Bartz v. Anthropic, emphasizing that “the person who copies the textbook from a pirate website has infringed already, full stop.” In that case, “the largest publicly reported copyright recovery in history, larger than any other copyright class action settlement or any individual copyright case litigated to final judgment,” was recently preliminarily approved.
Image Source: Deposit Photos
Image ID: 730100126
Author: MuhammadAlimak
Join the Discussion
4 comments so far. Add my comment.
Anon
October 20, 2025 01:36 pmThanks for the return, TFCFM – that being said, my reply holds, and the one to which one would seek legal redress from would be the one that committed the copyright infringement.
Are you suggesting otherwise?
(and yes, I am putting ‘all due weight’ on assertions made in lawsuits, given that assertions are just that: assertions)
TFCFM
October 17, 2025 10:39 amI wonder how analogous my hypothetical situation is to the situation in this matter.
Apple appears to be accused of “using” and “accessing” a copyrighted work, but I don’t see any indication that they’re accused of having made an unauthorized copy of it. (In my hypothetical, I should perhaps have been more specific that Dr. K read the neighbor’s copy, rather than making a new copy… and eliminated the guesswork by specifying that neighbor’s copy absolutely was unauthorized by the copyright owner — but such details detracted from the story-telling, I thought).
Here, Apple appears to have been accused of ‘accessing’ an unauthorized copy and letting its AI algorithm ‘read’ that unauthorized copy and ‘learn’ from it.
Anon
October 15, 2025 04:33 pmTFCFM,
Interesting question, and I suspect (but could be pleasantly incorrect) that you are wanting an answer of “Yes, A. Knowitall has infringed,” but alas, the infringement is on the neighbor rather than the reader of the infringed item that the neighbor is guilty of. For all that A. Knowitall knows, that photocopied item may not even be an infringing copy under a whole host of facts not present.
It matters not at all any payment of Knowitall to his neighbor.
TFCFM
October 15, 2025 11:31 amDr. Aye Knowitall, widely known as a voracious reader and bibliophile, is walking around his neighborhood.
“Hey, Knowitall! Wanna read a great book?” asks a neighbor. “I gots that new best-seller, The World’s Greatest Book.”
Intrigued, the doctor agrees and enters. He’s presented with what is obviously a photocopy of a library-owned book (including pix of the classification label and fingers-on-the-photocopier), but he reads it anyway. As usual, he learns a lot and even remember several clever quotes which he later employs to (he thinks) wow his friends (at least the ones who no longer read books).
Has A. Knowitall infringed the copyright of the author/publisher of TWGB?
(Does it matter if Knowitall gave his neighbor a dollar to read the copy?)
Add Comment