Comedian Sarah Silverman Takes Aim at OpenAI and Meta for Copyright Infringement

“‘Generative artificial intelligence’ is just human intelligence, repackaged and divorced from its creators.” – Saveri Law Firm

Last week, comedian Sarah Silverman and authors Christopher Golden and Richard Kadrey sued OpenAI in a U.S. district court, alleging the company’s generative AI product, ChatGPT, infringes on their copyrighted content. In addition to copyright infringement, the trio also claimed that the AI company violated the Digital Millennium Copyright Act (DMCA), unfair competition laws and unjustly enriched the company.

The lawsuit accuses OpenAI of “copying massive amounts of text” used to train ChatGPT to produce new text from prompts. Language models like OpenAI rely on datasets of text or other media to train its generative capabilities.

While ChatGPT can produce a wide variety of texts based on user prompts, this lawsuit is restricted to material produced using copyrighted books.

“Much of the material in OpenAI’s training datasets, however, comes from copyrighted works—including books written by Plaintiffs—that were copied by OpenAI without consent, without credit, and without compensation,” the lawsuit says.

Training

The lawsuit claims that long-form books are a key ingredient for OpenAI “because books offer the best examples of high-quality long-form writing.” The authors also cited a 2018 OpenAI paper that outlines how crucial long-form writing is to training ChatGPT.

OpenAI originally used BookCorpus to train ChatGPT, which compiled and copied books into a dataset without offering the authors of copyrighted materials compensation. OpenAI has also stated that a more recent model used two book datasets which the lawsuit claims likely contain over 350,000 books, according to OpenAI’s estimates of the number of books contained in each dataset.

Since there are only a few internet-based datasets that contain that much material—one being Project Gutenberg, a platform that hosts ebooks of classics that have expired copyrights, and the other being “shadow library” websites like Library Genesis (aka LibGen), Z-Library (aka B-ok), Sci-Hub, and Bibliotik—the lawsuit speculates that one of the book datasets is likely to contain copyrighted material.

The most recent model ChatGPT-4 was released with no information about its dataset. OpenAI claimed this change was made because the company was considering the competitive landscape of AI and safety concerns.

Shadow Libraries

Of main concern to the authors, is OpenAI’s use of so-called shadow libraries, which aggregate illegal torrents of books.

“These flagrantly illegal shadow libraries have long been of interest to the AI-training community,” the plaintiffs claimed.

While OpenAI has kept key information about its datasets private, the lawsuit claims that the company publicly acknowledged its ability to filter out texts that it deemed to be inappropriate.

While the lawsuit is unable to point to a smoking gun in a publicly available dataset, the plaintiffs used alternative means to exhibit potential copyright infringement.

ChatGPT Interrogation

The lawsuit includes a section where the plaintiffs attempt to show that ChatGPT is trained using copyrighted material by providing it prompts to generate text.

The authors claim that ChatGPT produced accurate results when provided with a prompt to summarize their books. The summaries spanned multiple pages and run through the plot and contents of the books.

“ChatGPT retains knowledge of particular works in the training dataset and is able to output similar textual content,” the authors claimed.

However, the lawsuit continued, “at no point did ChatGPT reproduce any of the copyright management information Plaintiffs included with their published works.”

ChatGPT can also be used to produce texts written in a similar style to an author, but the lawsuit does not address the question of whether this constitutes copyright infringement.

Other Lawsuits

The increased use of generative AI has prompted a barrage of concerns surrounding intellectual property.

At the same time the authors sued OpenAI, they also filed a similar lawsuit against Meta accusing the company of training its generative AI model LLaMA with copyrighted materials.

The lawyers representing the three authors in these cases also filed a nearly identical class action lawsuit last week on behalf of two authors that also accuses ChatGPT of copyright infringement.

Joseph Saveri and Matthew Butterick of the Joseph Saveri Law Firm said in a press release that they are fighting this legal battle “because AI needs to be fair and ethical for everyone” and called OpenAI and Meta “industrial-strength plagiarists that violate the rights of book authors.”

“As usual, ‘generative artificial intelligence’ is just human intelligence, repackaged and divorced from its creators,” wrote the lawyers.

Image Source: Deposit Photos
Image ID: 634549688
Author: rokas91

Alec Pronk Alec is a freelance journalist and editor who has covered a broad range of topics ranging from international law to US foreign policy. He holds a master’s degree in political [...see more]

Warning & Disclaimer: The pages, articles and comments on IPWatchdog.com do not constitute legal advice, nor do they create any attorney-client relationship. The articles published express the personal opinion and views of the author as of the time of publication and should not be attributed to the author’s employer, clients or the sponsors of IPWatchdog.com.

Join the Discussion

2 comments so far.

Anon
July 13, 2023 10:59 am
I have to wonder why these suits insist on ignoring the aspect of “generative” in Generative Artificial Intelligence?

Also, there seems to be a rampant** misperception that ‘training’ somehow simply stashes extant works and then the machine literally copies protected aspects of those works in its generative output.

** sadly, I include several attorneys with whom I have had conversations with and who should know better in this group.

B,

To your point (2) – and not including you in my comment above, just to be clear, I think it to be a logical fallacy – or at best, an issue of law not yet decided by any court – that there NEED BE any ‘payments’ for use of anything in training sets. While your quip on paying Silverman and possibly muting her, that would be a rather dangerous precedent given the sheer volume of ingested data points.

Think about it: “for training” appears to scream out “Fair Use.” I would draw a stronger parallel to the Google API case then to the recent Warhol case, by the by, as to an analysis under Fair Use.
B
July 12, 2023 03:23 pm
I’m old enough to remember when Silverman was funny.

That said, this suit will likely fail under copyright law unless: (1) ChatGPT or Meta are reproducing Silverman’s expressions, or (2) the training set(s) included works not paid for.

There’s no evidence that ChatGPT or Meta are reproducing Silverman’s expressions that I am aware of.

Assuming that Meta and OpenAI didn’t pay for Silverman’s book, both companies should immediately write a check for $24 (under 17 USC 504(a)) or $750 (under 17 USC 504(b)), then ask for dismissal.

Heck – make it an even $1K