Meta admits using pirated books to train AI, but won't pay for it

Lee Duna · 11 months ago

Meta admits using pirated books to train AI, but won't pay for it

archomrade [he/him] · 11 months ago

Your link is merely proposed recommendations. That is not legislation nor case law.

It’s also not talking about building AI, but circumventing DRM in order to preserve art. They’re saying that there should be an exemption to the illegal practice of circumventing DRM in certain, limited circumstances. However, they’re still only suggesting this! So not only does this not apply to your argument, it isn’t even actually in force.

At the bottom of the document, the Library of Congress approves all recommendations and adopts them as legal defenses against copyright claims. This is established law, not merely recommendations. Please understand the legal processes we’re discussing here.

Regardless, I’m not arguing that this exemption class 7(a) and 7(b) actually apply to AI and LLM’s, only that they serve as precedent guidance on how they should be treated in any suit raised. Granted, OpenAI is not a research institution, so this classification would not apply on those grounds, but the way they treat the work being challenged is still relevant. LLM’s are transformative in nature. Their use and nature are distinctly similar to that of a searchable database described in Authors Guild, Inc. v. HathiTrust and Authors Guild v. Google (the legal strength is even greater here, since LLM outputs are creative, and do not provide ‘copied’ expressions as a matter of course - fringe cases not withstanding), and as such we have no reason to expect they’d view it differently in the case of an LLM. Training data is a utilitarian precursor to an expressive tool, as repeatedly affirmed as fair use in existing precedent.

The flaw here is that the work isn’t processed in situ, it is copied into a training database, then processed. The processing may be fine, but the copying is illegal

Fair use describes exemptions to the illegality of unauthorized copies, it is explicitly asserting the copying as legal for a given use. See Authors Guild, Inc. v. HathiTrust and Authors Guild v. Google for reference. Worthy to point out the distinction between a right to control unauthorized use and unauthorized access, and admittedly this would be the weakest point in Meta’s case. However, I share the paper author’s perspective on illicit sources:

On the other hand, as Michael Carroll argues, there are strong arguments to be made that copying from an infringing source may still be fair use. Carroll argues that ‘[t]reating an otherwise fair use as unfair because it was made from an infringing source would lead a court to deny the public access to the products of secondary uses that fair use is designed to encourage.’ He notes that significant doubt exists as to whether good faith is a consideration in fair use at all. Judge Pierre Leval has also persuasively argued that using a good faith inquiry in fair use analysis ‘produces anomalies that conflict with the goals of copyright and adds to the confusion surrounding the doctrine.’ Moreover, even if good faith is part of the broader fair use calculus, courts have found that knowing use of an infringing source is not bad faith when the user acts in the reasonable belief that their use is a fair use. There is no recognized ‘fruit of the poisonous tree’ doctrine in copyright law.

The argument being proposed in the paper (for once, you are correct that this is not established law) is that in other, different cases where TDM is used as a precursor to expressive use, the collection of data for that purpose has been found to be lawful (provided sufficient security is used to prevent infringing, non-exempt abuses). However, the issue we’re discussing is novel. The paper is proposing frameworks for how to apply existing precedent to the novel use-case being investigated. There is no case-law to refer to that addresses this specific situation. I can’t tell if you’re just trying to debate-bro me or actually discuss the merits of the case, but i’d just remind you that none of this is settled, nor am I suggesting it is. My perspective is that precedent supports training data for LLM’s as a fair use, and that strengthening copyright in the way proposed does not mitigate the harm being claimed by plaintiffs, and in fact increases harm to the greater public by gatekeeping access to automation tools and consolidating the benefits to already gigantic companies.

If AI is used to pass off as someone else, then the AI manufacturer has built a tool that facilitates an illegal act, by copying the original work.

That’s not an issue for copyright, but I agree it ought to be addressed. Once again, the harm doesn’t stem from the use of copyrighted material, it stems from the technology itself (the harm doesn’t change weather the material is authorized or not, nor does it change to whom harm is done). I really have to stress again that the issues and concerns being raised over AI cannot be sufficiently addressed through the use of copyright law.

TWeaK · 11 months ago

At the bottom of the document, the Library of Congress approves all recommendations and adopts them as legal defenses against copyright claims. This is established law, not merely recommendations.

Thank you for the clarification.

Their use and nature are distinctly similar to that of a searchable database described in Authors Guild, Inc. v. HathiTrust and Authors Guild v. Google (the legal strength is even greater here, since LLM outputs are creative, and do not provide ‘copied’ expressions as a matter of course - fringe cases not withstanding), and as such we have no reason to expect they’d view it differently in the case of an LLM. Training data is a utilitarian precursor to an expressive tool, as repeatedly affirmed as fair use in existing precedent.

This is indeed a complicated subject, and thank you again for your insight. These are very good example cases, because Google’s searchable book database is exactly the same as the training databases LLM’s use to develop their transform nodes.

The difference between the Authors Guild cases and this one, as I see it, is that Google and HathiTrust are acting to preserve information and art for future generations - there is an inherent benefit to society front and centre with their goals. With LLM’s, the goal is to develop a commercial product. Yes, people can use it for free (right now) but ultimately they expect to sell access and profit from it. Also, no one else gets access to their training database, it is kept as some sort of trade secret.

for once, you are correct that this is not established law

Yay!

My perspective is that precedent supports training data for LLM’s as a fair use, and that strengthening copyright in the way proposed does not mitigate the harm being claimed by plaintiffs, and in fact increases harm to the greater public by gatekeeping access to automation tools and consolidating the benefits to already gigantic companies.

I wouldn’t want to restrict or gatekeep access to art for genuine fair purpose uses. I agree with the Authors Guild rulings in those circumstances, I just disagree that LLM’s are a similar enough circumstance that LLM’s deserve the same exemption with how they’re developed.

I really have to stress again that the issues and concerns being raised over AI cannot be sufficiently addressed through the use of copyright law.

I agree. Certainly, not copyright law as it exists right now, and even then there are so many aspects of the use of AI that fall well oustide the scope of copyright law.

Ultimately, my gripe is that a commercial business has used copyrighted work to develop a product without paying the rightsholders. Their product is their own unique creation, but the copyrighted work their product learned from was not. The training database they’ve used is not “research” because it is not scholarly; even if it were research, it is highly commercial in nature and as such does not warrant a fair use exemption.