OpenAI’s ChatGPT and image-generator Dall-E, as well as Google’s Bard and Stability AI’s Stable Diffusion, were all trained on billions of news articles, books, images, videos and blog posts scraped from the internet, much of which is copyrighted.
This past week, comedian Sarah Silverman filed a lawsuit against OpenAI and Facebook parent company Meta, alleging they used a pirated copy of her book in training data because the companies’ chatbots can summarize her book accurately. Novelists Mona Awad and Paul Tremblay filed a similar lawsuit against OpenAI. And more than 5,000 authors, including Jodi Picoult, Margaret Atwood and Viet Thanh Nguyen, have signed a petition asking tech companies to get consent from and give credit and compensation to writers whose books were used in training data.
Two class-action lawsuits were filed against OpenAI and Google, both alleging the companies violated the rights of millions of internet users by using their social media comments to train conversational AIs. And the Federal Trade Commission opened an investigation into whether OpenAI violated consumer rights with its data practices.
Meanwhile, Congress held the second of two hearings focusing on AI and copyright Wednesday, hearing from representatives of the music industry, Photoshop maker Adobe, Stability AI and concept artist and illustrator Karla Ortiz.
“These AI companies use our work as training data and raw materials for their AI models without consent, credit, or compensation,” Ortiz, who has worked on movies such as “Black Panther” and “Guardians of the Galaxy” said in prepared remarks. “No other tool solely relies on the works of others to generate imagery. Not Photoshop, not 3D, not the camera, nothing comes close to this technology.”
The wave of lawsuits, high-profile complaints and proposed regulation could pose the biggest barrier yet to the adoption of “generative” AI tools, which have gripped the tech world since OpenAI launched ChatGPT to the public late last year and spurred executives from Microsoft, Google and other tech giants to declare the tech is the most important innovation since the advent of the mobile phone.
Artists say the livelihoods of millions of creative workers are at stake, especially because AI tools are already being used to replace some human-made work. Mass scraping of art, writing and movies from the web for AI training is a practice creators say they never considered or consented to.
But in public appearances and in responses to lawsuits, the AI companies have argued that the use of copyrighted works to train AI falls under fair use — a concept in copyright law that creates an exception if the material is changed in a “transformative” way.
“The AI models are basically learning from all of the information that’s out there. It’s akin to a student going and reading books in a library and then learning how to write and read,” Kent Walker, Google’s president of global affairs, said in an interview Friday. “At the same time you have to make sure that you’re not reproducing other people’s works and doing things that would be violations of copyright.”
The movement of creators asking for more consent over how their copyrighted content is used is part of a larger movement as AI shifts long-standing ground rules and norms for the internet. For years, websites have been happy to have Google and other tech giants scrape their data for the purpose of helping them show up in search results or access digital advertising networks, both of which helped them make money or get in front of new customers.
There are some precedents that could work in the tech companies’ favor, like a 1992 U.S. Appeals Court ruling that allowed companies to reverse engineer other firms’ software code to design competing products, said Andres Sawicki, a law professor at the University of Miami who studies intellectual property. But many people say there’s an intuitive unfairness to huge, wealthy companies using the work of creators to make new moneymaking tools without compensating anyone.
“The generative AI question is really hard,” he said.
The battle over who will benefit from AI is already getting contentious.
In Hollywood, AI has become a flash point for writers and actors who have recently gone on strike. Studio executives want to preserve the right to use AI to come up with ideas, write scripts and even replicate the voices and images of actors. Workers see AI as an existential threat to their livelihoods.
The content creators are finding allies among major social media companies, which have also seen the comments and discussions on their sites scraped and used to teach AI bots how human conversation works.
On Friday, Twitter owner Elon Musk said the website was contending with companies and organizations “illegally” scraping his site constantly, to the point where he decided to limit the number of tweets individual accounts could look at in an attempt to stop the mass scraping.
“We had multiple entities trying to scrape every tweet ever made,” Musk said.
Other social networks, including Reddit, have tried to stop content from their sites from being collected as well, by beginning to charge millions of dollars to use their application programing interfaces or APIs — the technical gateways through which other apps and computer programs interact with social networks.
Some companies are being proactive in signing deals with AI companies to license their content for a fee. On Thursday, the Associated Press agreed to license its archive of news stories going back to 1985 to OpenAI. The news organization will get access to OpenAI’s tech to experiment with using it in its own work as part of the deal.
A June statement released by Digital Content Next, a trade group that includes the New York Times and The Washington Post among other online publishers, said that the use of copyrighted news articles in AI training data would “likely be found to go far beyond the scope of fair use as set forth in the copyright act.”
“Creative professionals around the world use ChatGPT as a part of their creative process, and we have actively sought their feedback on our tools from day one,” said Niko Felix, a spokesman for OpenAI. “ChatGPT is trained on licensed content, publicly available content, and content created by human AI trainers and users.”
Spokespeople for Facebook and Microsoft declined to comment. A spokesperson for Stability AI did not return a request for comment.
“We’ve been clear for years that we use data from public sources — like information published to the open web and public data sets — to train the AI models behind services like Google Translate,” said Google General Counsel Halimah DeLaine Prado. “American law supports using public information to create new beneficial uses, and we look forward to refuting these baseless claims.”
Fair use is a strong defense for AI companies, because most outputs from AI models do not explicitly resemble the work of specific humans, Sawicki, the copyright law professor, said. But if creators suing the AI companies can show enough examples of AI outputs that are very similar to their own works, they will have a solid argument that their copyright is being violated, he said.
Companies could avoid that by building filters into their bots to make sure they don’t spit out anything that is too similar to an existing piece of art, Sawicki said. YouTube, for example, already uses technology to detect when copyrighted works are uploaded to its site and automatically take it down. In theory, AI companies could build algorithms that could spot outputs that are highly similar to existing art, music or writing.
The computer science techniques that enable modern-day “generative” AI have been theorized for decades, but it wasn’t until Big Tech companies such as Google, Facebook and Microsoft combined their massive data centers of powerful computers with the huge amounts of data they had collected from the open internet that the bots began to show impressive capabilities.
By crunching through billions of sentences and captioned images, the companies have created “large language models” able to predict what the logical thing to say or draw in response to any prompt is, based on their understanding of all the writing and images they’ve ingested.
In the future, AI companies will use more curated and controlled data sets to train their AI models, and the practice of throwing heaps of unfiltered data scraped from the open internet will be looked back on as “archaic,” said Margaret Mitchell, chief ethics scientist at AI start-up Hugging Face. Beyond the copyright problems, using open web data also introduces potential biases into the chatbots.
“It’s such a silly approach and an unscientific approach, not to mention an approach that hits on people’s rights,” Mitchell said. “The whole system of data collection needs to change, and it’s unfortunate that it needs to change via lawsuits, but that is often how tech operates.”
Mitchell said she wouldn’t be surprised if OpenAI has to delete one of its models completely by the end of the year because of lawsuits or new regulation.
OpenAI, Google and Microsoft do not release information on what data they use to train their models, saying that it could allow bad actors to replicate their work and use the AIs for malicious purposes.
A Post analysis of an older version of OpenAI’s main language-learning model showed that the company had used data from news sites, Wikipedia and a notorious database of pirated books that has since been seized by the Department of Justice.
Not knowing what exactly goes into the models makes it even harder for artists and writers to get compensation for their work, Ortiz, the illustrator, said during the Senate hearing.
“We need to ensure there’s clear transparency,” Ortiz said. “That is one of the starting foundations for artists and other individuals to be able to gain consent, credit and compensation.”