C.State-of-the-art AI systems can help you escape a parking ticket, write an academic essay or trick you into thinking that Pope Francis is a fashionista. But the virtual libraries behind this awesome technology are vast, and there are concerns that they are operating in violation of copyright and personal data laws.
The huge data sets used to train the latest generation of these AI systems, like those behind ChatGPT and Stable Diffusion, are likely to contain billions of images pulled from the internet, millions of hacked e-books, the entire minutes of the 16 years of the European Parliament and the entire Wikipedia in English.
But the industry’s voracious appetite for big data is starting to cause problems, as regulators and courts around the world crack down on researchers who sniff content without consent or notice. In response, AI labs are fighting to keep their data sets secret, or even challenging regulators to push the issue forward.
In Italy, ChatGPT was banned from working after the country’s data protection regulator said there was no legal basis to justify the collection and “mass storage” of personal data to train GPT’s AI. On Tuesday, Canada’s privacy commissioner followed suit with an investigation into the company in response to a complaint alleging “the collection, use and disclosure of personal information without consent.”
Britain’s data watchdog expressed concerns of its own. “Data protection law still applies when the personal information you are processing comes from publicly accessible sources,” said Stephen Almond, director of technology and innovation at the Information Commissioner’s Office.
Michael Wooldridge, a professor of computer science at Oxford University, says that “large language models” (LLMs), such as those behind OpenAI’s ChatGPT and Google’s Bard, suck up colossal amounts of data.
“This includes the entire world wide web, everything. Every link on every page is followed, and every link on those pages is followed… In that unimaginable amount of data, there is probably a lot of data about you and me,” he says, adding that comments about a person and their work they could also be collected by an LLM. “And it’s not stored in some big database somewhere, we can’t look to see exactly what information it has on me. It’s all buried in huge opaque neural networks.”
Wooldridge says copyright is a “gathering storm” for AI companies. LLMs may have accessed copyrighted material, such as news articles. In fact, the GPT-4-assisted chatbot attached to Microsoft’s Bing search engine cites news sites in its responses. “I didn’t give explicit permission for my work to be used as training data, but it almost certainly was, and now it contributes to what these models know,” he says.
“Many artists are very concerned that their livelihoods are at risk from generative AI. Expect to see legal battles,” she adds.
Lawsuits have already surfaced, with the stock photo company. Getty Images suing UK startup Stability AI, the company behind the AI imager Stable Diffusion, after claiming that the imaging company violated copyrights by using millions of unlicensed Getty Photos to train its system. In the USA a group of artists is suing Midjourney and Stability AI in a lawsuit that claims the companies “violated the rights of millions of artists” by developing their products using artists’ work without their permission.
Uncomfortably for Stability, Stable Diffusion occasionally spits out images with a Getty Images watermark intact, examples of which the photo agency included in its lawsuit. In January, researchers at Google he even managed to get the Stable Diffusion system to almost perfectly recreate one of the unlicensed images he had been trained on, a portrait of American evangelist Anne Graham Lotz.
Copyright lawsuits and regulatory actions against OpenAI are hampered by the company’s absolute secrecy about its training data. In response to the Italian ban, Sam Altman, CEO of OpenAI, which developed ChatGPT, said: “We believe that we are following all privacy laws.” But the company has refused to share information about what data was used to train GPT-4, the latest version of the underlying technology that powers ChatGPT.
Even in histechnical reportDescribing the AI, the company only briefly says that it was trained “using publicly available data (such as data from the Internet) and data licensed from third-party providers.” More information is withheld, he says, due to “both the competitive landscape and the security implications of large-scale models like GPT-4.”
Others have the opposite point of view. EleutherAI describes itself as a “nonprofit AI research lab” and was founded in 2020 with the goal of recreating GPT-3 and releasing it to the public. To that end, the group put together the Pile, a collection of 825-gigabyte data sets collected from all corners of the Internet. Includes 100 GB of eBooks taken from the pirate site Bibliotik, another 100 GB of computer code taken from Github, and a 228 GB collection of websites collected from the Internet since 2008; everything, acknowledges the group, without the consent of the authors involved.
Eleuther argues that the data sets in the Stack have already been shared so widely that their compilation “does not constitute significantly more harm.” But the group doesn’t take the legal risk of directly hosting the data, instead turning to a group of anonymous “data enthusiasts” called the Eye, whose copyright removal policy is a video of a chorus of clothed women pretending to masturbate their imaginary penises while singing.
Some of the information chatbots produce has also been false. ChatGPT falsely accused an American law professor, Jonathan Turley of George Washington University, of sexually harassing one of his students, citing a non-existent news article. The Italian regulator also referred to the fact that ChatGPT responses do not “always match the factual circumstances” and “inaccurate personal data is processed.”
An annual report on the progress of AI showed that commercial players dominated the industry, ahead of academic institutions and governments.
According to the AI Index Report 2023, compiled by California-based Stanford University, last year there were 32 major machine learning models produced by industry, compared to three produced by academia. Until 2014, most major models came from academia, but since then the cost of developing AI models, including staff and computing power, has increased.
“Overall, large language and multimodal models are getting bigger and more expensive,” the report says. An early iteration of the LLM behind ChatGPT, known as GPT-2, had 1.5 billion parameters, analogous to neurons in a human brain, and cost approximately $50,000 to train. By comparison, Google’s PaLM had 540 billion parameters. and an estimated cost of $8 million.
This has raised concerns that corporate entities take a less measured approach to risk than government-backed or academic projects. Last week, a letter whose signatories included Elon Musk and Apple co-founder Steve Wozniak called for an immediate pause on creating “giant AI experiments” for at least six months. The letter said there were concerns that tech companies were creating “increasingly powerful digital minds” that no one could “reliably understand, predict or control.”
Dr Andrew Rogoyski, from the Institute for Human-Centered AI at the University of Surrey in England, said: “Great AI means that these AIs are only being created by large, for-profit corporations, which unfortunately means that our interests as human beings are not necessarily well represented.
He added: “We have to focus our efforts on making AI smaller, more efficient, requiring less data, less electricity, so that we can democratize access to AI.”