OpenAI, Google, and other tech companies train their chatbots with huge amounts of data drawn from books, Wikipedia articles, news, and other Internet sources. But in the future they hope to use something called synthetic data.
This is because technology companies can exhaust the high-quality text that the Internet has to offer for the development of artificial intelligence. And companies face copyright lawsuits from authors, news organizations and computer programmers for using their works without permission. (In one of those lawsuits, The New York Times sued OpenAI and Microsoft.)
They believe that synthetic data will help reduce copyright issues and increase the supply of training materials needed for ai. Here's what you should know about it.
What is synthetic data?
It is data generated by artificial intelligence.
Does that mean tech companies want ai to be trained by ai?
Yes. Instead of training ai models with text written by people, tech companies like Google, OpenAI, and Anthropic hope to train their technology with data generated by other ai models.
Does synthetic data work?
Not quite. ai models make mistakes and invent things. They have also shown that they capture biases that appear in the Internet data from which they were trained. So if companies use ai to train it, they may end up amplifying their own flaws.
Are synthetic data widely used by technology companies right now?
No. technology companies are experimenting with it. But because of the potential flaws of synthetic data, it's not a big part of how ai systems are built today.
So why do tech companies say synthetic data is the future?
Companies believe they can perfect the way synthetic data is created. OpenAI and others have explored a technique where two different ai models work together to generate synthetic data that is more useful and reliable.
An ai model generates the data. A second model then judges the data, much like a human would, deciding whether the data is good or bad, accurate or not. ai models are actually better at judging text than writing it.
“If you give two things to technology, it's pretty good to pick which one looks better,” said Nathan Lile, CEO of artificial intelligence startup SynthLabs.
The idea is that this will provide the high-quality data needed to train an even better chatbot.
Does this technique work?
Something like. It all comes down to that second ai model. How good are you at judging texts?
Anthropic has been the most vocal about its efforts to make this work. It fine-tunes the second ai model using a “constitution” selected by the company's researchers. This teaches the model to choose text that supports certain principles, such as freedom, equality, and brotherhood, or life, liberty, and personal safety. Anthropic's method is known as “constitutional ai”
Here's how two ai models work together to produce synthetic data using a process like Anthropic's:
Still, humans are needed to ensure the second ai model stays on track. That limits the amount of synthetic data this process can generate. And researchers disagree about whether a method like Anthropic's will continue to improve ai systems.
Does synthetic data help companies avoid using copyrighted information?
The ai models that generate synthetic data were in turn trained on human-created data, much of which was copyrighted. Therefore, copyright holders can still argue that companies like OpenAI and Anthropic used copyrighted text, images, and videos without permission.
Jeff Clune, a computer science professor at the University of British Columbia who previously worked as a researcher at OpenAI, said ai models could ultimately become more powerful than the human brain in some ways. But they will do it because they learned from the human brain.
“To borrow from Newton: ai sees further by relying on giant human data sets,” he said.