Meta Hit With Class Action for Allegedly Using Pirated Books to Train AI Models

Facebook parent Meta Platforms Inc. is the latest target of litigation aimed at Big Tech companies that allegedly use copyright-protected books to train their artificial intelligence models without the authors’ consent.

Lieff Cabraser Heimann & Bernstein and Cowan, DeBaets, Abrahams & Sheppard filed a class action on behalf of lead plaintiff Christopher Farnsworth, author of the “Nathaniel Cade” fiction series, against Meta on Tuesday, claiming it stole “hundreds of thousands” of copyrighted books from a pirated online collection to build its large language model set, “Llama.” The complaint, filed in the U.S. District Court for the Northern District of California in San Jose alleged copyright infringement under 17 U.S. Code § 501. Counsel has not yet appeared for the defendant.

Meta first launched its flagship LLM family, then stylized as LLaMA, in Feb. 2023 in the Big Tech race to compete with the debut of OpenAI’s trailblazing generative AI chatbot, ChatGPT, in Nov. 2022. Meta released “Llama 2″ for commercial use in July 2023 and its latest iteration, “Llama 3,” to build its AI assistant “Meta AI” on April 18, 2024.

According to the complaint, Meta downloaded and copied almost 200,000 copyrighted books from “Books3,” a library of copyrighted works scraped by developer Shawn Presser from the pirated book website Bibliotik. “Books3″ is part of “The Pile,” an open-source online dataset hosted by nonprofit EleutherAI that was specifically designed to train large language models. LLMs are conditioned to simulate human communication by ingesting and processing massive quantities of data that effectively “teach” it to generate predictive written responses. The complaint claims that Meta publicly disclosed it used data from Books3 to train its LLMs in a Feb. 2023 research paper.

Meta and the plaintiff’s counsel did not immediately respond to requests for comment.

“These platforms are operating on the principle ‘move fast and break things and pay for it later,'” said Sullivan & Worcester partner Mike Palmisciano, who specializes in transactional intellectual property matters. “Let’s develop these products, become kind of essential in the marketplace, and then figure out how we go from there.”

This is not the first time Meta has faced allegations of stealing copyrighted material from Books3 for AI training purposes. A coalition of writers including comedian Sarah Silverman sued both Meta and OpenAI in California federal court in July 2023 on similar claims of copyright infringement. The Associated Press reported on Sept. 27 that Meta’s CEO, Mark Zuckerberg, will be deposed as part of the class action against Meta.

Lieff Cabraser, along with co-counsel at Susman Godfrey, are also representing the plaintiffs in a class action filed in August that accuses the AI startup Anthropic of misappropriating the texts on Books3 to train its own LLM collection, “Claude.”

Palmisciano said that these types of copyright infringement claims will continue to escalate until a regulatory solution or court judgment “sets the guidelines for what’s permissible in the AI context.”

“I think the fair use argument that’s being made in the defense is hard to square with decades of case law on copyright fair use,” he said. “That being said, I would assume at some point we will get a … Supreme Court ruling on what constitutes fair use in the AI context and whether this type of large dataset ingestion is transformative in a way that protects the providers.”

Until the high court rules on the fair use issue, Palmisciano predicts that companies targeted by the litigation will continue to reach one-off settlements and monetary agreements.

“That seems to be what a lot of early financing for platforms like OpenAI is earmarked toward,” he said. “They have their tech development, of course, but they’re also reaching these really expensive and extensive licensing agreements for content that they’ve already ingested into their platform.”

Leave a Reply

Your email address will not be published. Required fields are marked *