Hiltzik: How AI makes the mess in copyright law even worse

The snap judgment among legal experts was that a federal judge’s dismissal on Nov. 7 of a copyright infringement lawsuit against OpenAI, the leader in advanced chatbots, will short-circuit an ever-growing effort by artists and writers to keep AI firms from stealing their content.

There’s no question that the ruling handed down Thursday by Judge Colleen McMahon in New York landed with a thud among lawyers trying to bring such cases.

McMahon went beyond merely dismissing the lawsuit brought against OpenAI by Raw Story Media, the owner of progressive news websites. She undermined the basic argument that content creators have made against AI firms: that the process of feeding their AI models data indiscriminately “scraped” from the internet inevitably involves using copyrighted content without permission.

I don’t put a lot of stock in anyone who tells you how these cases are going to turn out.

— Copyright expert Aaron Moss

McMahon’s ruling, based on a Supreme Court decision in an unrelated case, “could leave AI copyright claims on shaky ground,” wrote Los Angeles intellectual property lawyer Aaron Moss on his website. The judge not only dismissed Raw Story’s case; she implied that no copyright holder might be able to show enough harm from AI scraping to win an infringement case.

That’s because the amount of content fed to AI bots such as OpenAI’s ChatGPT to “train” them is so immense that it’s almost impossible to pinpoint any particular content that has been infringed when the bot spits out an answer to a user’s query.

“Given the quantity of information,” McMahon asserted, “the likelihood that ChatGPT would output plagiarized content from one of [Raw Story’s] articles seems remote.”

McMahon’s ruling may also undermine what has been a growing trend toward the licensing of copyrighted content by AI developers — in part to forestall copyright infringement claims. Dow Jones, the parent of the Wall Street Journal, reached a licensing deal with OpenAI in May that could be worth more than $250 million over five years. That followed multimillion-dollar licensing deals OpenAI reached with the Axel Springer, the owner of Business Insider and Politico; the Financial Times; and the Associated Press.

“This court is allowing this thriving, lucrative market for licensed content for AI training to be taken away from Raw Story Media,” Peter Csathy, chairman of Creative Media, a Los Angeles entertainment and media marketing and consulting firm, told me.

That may have happened because Raw Story didn’t make much of that market’s potential in its lawsuit. In its complaint it mentioned the licensing deals OpenAI reached with the Associated Press and Axel Springer, but noted only that the AI firm has “offered no compensation” to Raw Story.

For all that, the full import of McMahon’s decision is anything but clear. That’s because the case brings together two muddy legal regimes: copyright law, which is renowned for its craziness and confusion; and AI law, which may be years away from coalescing into coherence.

At least 12 lawsuits against AI developers alleging copyright violations are currently wending their way through the federal courts — with plaintiffs including the publishers of Mother Jones, the Wall Street Journal and the New York Times; the music recording industry; and writers Michael Chabon and Sarah Silverman.

Intermediate court rulings in these cases contradict each other and raise issues that haven’t been seen before even in high-tech intellectual property law.

Judges have struggled even to define how copyright infringement principles apply to technology that doesn’t output exact copies of copyrighted works but “mimics” them — rather like how the beverage machine in Douglas Adams’ “Hitchhiker’s Guide to the Galaxy” delivered “a cupful of liquid that was almost, but not quite, entirely unlike tea.”

All those cases are still in their early stages. “I don’t put a lot of stock in anyone who tells you how these cases are going to turn out,” Moss says.

Before wading into the legal morass these lawsuits are attempting to navigate, let’s take a quick look at how the technology is developed and why copyright has become an issue.

The models that are in the forefront of artificial intelligence research and development just now don’t think for themselves. They’re repositories of billions of articles, software lines and music or art made by humans. When asked a question, they ply through their database and try to synthesize from it the most probable answer. Often they get it right; often they get it wrong.

Sometimes they’re confused enough to output obvious errors, as Apple researchers found when asking the models to solve math problems written in plain English. Sometimes they show that they don’t know what they don’t know, and fill in the blanks in their knowledge with fabrications — or as AI developers call them, “hallucinations.”

As McMahon observed, the sheer volume of materials the bots draw from and the synthesizing process make it unlikely that any answer will replicate any specific content exactly.

That has been an obstacle for some of the plaintiffs in the copyright cases. Most of those claiming their written content has been infringed assert chiefly that the databases known to have been fed to some AI models are known to include their books or other writing. (At least one of the content repositories used by some AI developers includes three of my own books, but I’m not a party to any of the lawsuits.)

In its lawsuit, the New York Times cites text output by OpenAI’s ChatGPT-4 that reproduces portions of its articles verbatim, without credit or permission. (Microsoft, named as a defendant as an investor in OpenAI and a user of its technology, replied that the New York Times had effectively “coaxed” the chatbot to reproduce its texts by artfully framing its queries to elicit infringing answers.)

That brings us back to Raw Story Media’s lawsuit. The company, which operates the Raw Story and AlterNet news sites, didn’t fashion its claim as a copyright infringement complaint. Instead, it asserted that OpenAI had deliberately removed author, title and copyright labels — collectively known as copyright management information, or CMI — from the articles it imported to train its bots.

Raw Story argued that this process facilitated future infringement by leaving users unaware that they were receiving, and possibly distributing, copyrighted material without permission.

Deliberately removing CMI with the intention of fostering copyright violations is a direct violation of the 1998 Digital Millennium Copyright Act, which governs intellectual property rights of producers of digital content. Raw Story sought damages for OpenAI’s violation of the law and an injunction requiring the AI company to remove from its database all Raw Story content from which the CMI had been removed.

That’s where Raw Story ran into a roadblock erected by the Supreme Court. In a 5-4 decision involving the credit bureau TransUnion in 2021, the court declared that it is not enough for a plaintiff to sue over a defendant’s violation of a federal statute. To have the standing to bring a federal case, the court ruled, a plaintiff must show that they have suffered a “concrete harm” stemming from the violation.

Raw Story couldn’t show that because it couldn’t produce evidence that any of its content had been copied in answers to user queries and therefore that it had suffered “concrete harm.” As a result, McMahon dismissed the lawsuit on grounds that Raw Story didn’t have standing to bring it.

Indeed, McMahon seemed irked at the thought that Raw Story was trying to pull a fast one. “Let’s be clear about what’s really at stake here,” she wrote. The supposed injury for which Raw Story was seeking relief, she wrote, “is not the exclusion of CMI” from OpenAI’s database, but the “use of Plaintiffs’ articles to develop Chat GPT without compensation for Plaintiffs.”

McMahon gave Raw Story the opportunity to refile its lawsuit to show that it was damaged by OpenAI’s acts. She didn’t sound sanguine, calling herself “skeptical” that the company will be able to allege a “cognizable injury.”

But Csathy contends that McMahon overlooked the possibility that her ruling might undermine the licensing market — if AI developers can remove CMI from training data with impunity, they might not feel any need to license copyrighted material in the future. “There’s some real substantial money there,” he says.

Raw Story may well cite the loss of licensing income as a “cognizable injury” if and when it files an amended complaint. That would be a new wrinkle in a field that at this point is virtually nothing but wrinkles.

Source link

Leave a Comment