Latest Developments on Training GenAI with Copyrighted Works and Some 'What Ifs?'

Luca Schirru

‘Boring’ is not a word that can be used to describe the past few days for those interested in litigation involving copyright issues in the development and use of Generative AI systems. Two major cases saw significant updates, issuing orders that addressed one of the main questions raised in these lawsuits: is the use of copyrighted materials to train Generative AI systems fair use?

This blog post aims to briefly describe each case’s key points related to fair use and to highlight what was left unresolved, including all the ‘what if’ scenarios that were hinted at but not decided upon

Bartz, Graeber & Johnson v. Anthropic

Judge William Alsup’s order on fair use addressed not only the different copies of copyrighted material made for training generative AI systems but also uses related to Anthropic’s practice of keeping copies as a “permanent, general-purpose resource”. It also distinguished between legally purchased copies and millions of pirated copies retained by Anthropic, applying a different fair use analysis to each category.

Regarding the overall analysis of fair use for copyrighted works used to train Anthropic’s Generative AI system, Judge Alsup found that the use “was exceedingly transformative and was a fair use.” Among the four factors, only the second factor weighed against using copyrighted works to train the GenAI system.

Concerning the digitization of legally purchased books, it was also considered fair use not because of the purpose of training AI systems, but for a much simpler reason: “because all Anthropic did was replace the print copies it had purchased for its central library with more convenient space-saving and searchable digital copies for its central library — without adding new copies, creating new works, or redistributing existing copies”. For this specific use, of the four factors, only factor two weighed against fair use, while factor four remained neutral.

On the other hand, Judge Alsup clearly stated that using pirated copies to create the “general-purpose library” was not fair use, even if some copies might be used to train LLMs. All factors weighed against it. Specifically, Judge Alsup noted: “it denies summary judgment for Anthropic that the pirated library copies must be treated as training copies. We will have a trial on the pirated copies used to create Anthropic’s central library and the resulting damages, actual or statutory (including for willfulness).”

Kadrey v. Meta

At the very beginning of the order, Judge Vince Chhabria clarified that the case questions whether using copyrighted material to train generative AI models without permission or remuneration is illegal and affirmed that:

“although the devil is in the details, in most cases the answer will likely be yes. What copyright law cares about, above all else, is preserving the incentive for human beings to create artistic and scientific works. Therefore, it is generally illegal to copy protected works without permission. And the doctrine of “fair use,” which provides a defense to certain claims of copyright infringement, typically doesn’t apply to copying that will significantly diminish the ability of copyright holders to make money from their works (thus significantly diminishing the incentive to create in the future).”

Judge Chhabria explained further that “by training generative AI models with copyrighted works, companies are creating something that often will dramatically undermine the market for those works, and thus dramatically undermine the incentive for human beings to create things the old-fashioned way.” According to him, this would primarily affect not classic works or renowned authors but rather the market for the “typical human-created romance or spy novel,” which could be substantially diminished by similar AI-created works.

However, all these points were framed as “this Court’s general understanding of generative AI models and their capabilities”, with Judge Chhabria emphasizing that “Courts can’t decide cases based on general understandings. They must decide cases based on the evidence presented by the parties.”

Despite this general understanding that “copying the protected works, however transformative, involves the creation of a product with the ability to severely harm the market for the works being copied, and thus severely undermine the incentive for human beings to create“, Judge Chhabria found two of the plaintiffs’ three market harm theories “clear losers,” and the third, a “potentially winning” argument, underdeveloped:

“First, the plaintiff might claim that the model will regurgitate their works (or outputs that are substantially similar), thereby allowing users to access those works or substitutes for them for free via the model. Second, the plaintiff might point to the market for licensing their works for AI training and contend that unauthorized copying for training harms that market (or precludes the development of that market). Third, the plaintiff might argue that, even if the model can’t regurgitate their own works or generate substantially similar ones, it can generate works that are similar enough (in subject matter or genre) that they will compete with the originals and thereby indirectly substitute for them. In this case, the first two arguments fail. The third argument is far more promising, but the plaintiffs’ presentation is so weak that it does not move the needle, or even raise a dispute of fact sufficient to defeat summary judgment.“

In the overall analysis of the four factors, only the second factor weighed against Meta. Summary judgment was granted to Meta regarding the claim of copyright infringement from using plaintiffs’ books for AI training. Nevertheless, Judge Chhabria clarified that “this ruling does not stand for the proposition that Meta’s use of copyrighted materials to train its language models is lawful. It stands only for the proposition that these plaintiffs made the wrong arguments and failed to develop a record in support of the right one.”

The use of pirated copies was also addressed in Kadrey v. Meta. In this case, “there is no dispute that Meta torrented LibGen and Anna’s Archive […].” According to Judge Chhabria, while downloading from shadow libraries wouldn’t automatically win the plaintiffs’ case, it was relevant for the fair use analysis, especially regarding “bad faith” and whether the downloads benefited or perpetuated unlawful activities.

Lessons learned from these decisions and one big “what if?”

Expression matters in training LLMs

Both cases clarified that books are valuable for training because of their creative expression, quality, and consistency. Meta’s argument, relying on precedents like Sega and Google Books about accessing “functional elements” or “non-expressive elements,” was rejected. Judge Chhabria emphasized, “Meta’s use of the plaintiffs’ books does depend on the books’ creative expression,” unlike Google Books’ “content agnostic” technology:

“The database wouldn’t work any better or worse if it contained books full of complete gibberish or written in unknown languages. If someone searched for that text, those books would appear. Here, by contrast, if Meta’s LLMs are to generate high-quality text, they need coherent, reasonably high-quality training data. In other words, they need high-quality expression. Therefore, the “intermediate copying” cases don’t apply.”

Training LLMs requires multiple copies

As summarized in the decision from Judge Alsup, training LLMs requires multiple copies of the books, including, but not limited to, a copy from the central library for the training set, a ‘clean’ copy after the removal of repeated or lower-value elements, a subsequent tokenized copy derived from the clean one, which was copied several times during the training process, and the retention of ‘compressed’ copies of works the model was trained upon.

Training humans may not be the best analogy for training AI systems, and this matters for the fourth factor

Judges disagreed on the analogy of training humans and also on the impact of the fourth factor. Judge Chhabria expressly disagreed with the analogy made by Judge Alsup that the harm caused by the use of copyrighted material to train AI systems would be analogous to the harm caused by the use of works to train schoolchildren to write well, which would “result in an explosion of competing works.”

Judge Chhabria stated that “when it comes to market effects, using books to teach children to write is not remotely like using books to create a product that a single individual could employ to generate countless competing works with a minuscule fraction of the time and creativity it would otherwise take. This inapt analogy is not a basis for blowing off the most important factor in the fair use analysis.“

One big “What if?”

When reading both orders, a question constantly came to mind: what if the outputs of the Generative AI systems were infringing? Judge Alsup clearly states that “Here, if the outputs seen by users had been infringing, Authors would have a different case. And, if the outputs were ever to become infringing, Authors could bring such a case.” Judge Chhabria also recognized that Meta’s efforts to prevent models from ‘memorizing’ and outputting copyrighted material were successful, as the models were not reproducing a significant percentage of the authors’ books.

But what if the outputs were found infringing, as some plaintiffs claim in other lawsuits? Would this change the fair use analysis? Other ‘what if’ scenarios were also raised, as seen in Judge Chhabria’s decision: What if the case involved news articles? What if the use were solely for nonprofit purposes?

For these, all we can do is wait for the next chapters.