Online data has long been a valuable commodity. For years, Meta and Google have used data to target their online advertising. Netflix and Spotify have used it to recommend more movies and music. Political candidates have turned to data to know which groups of voters to focus their attention on.
Over the past 18 months, it has become increasingly clear that digital data is also crucial in the development of artificial intelligence. This is what you should know.
The more data, the better.
The success of ai depends on data. This is because ai models become more accurate and more human with more data.
In the same way that a student learns by reading more books, essays, and other information, large language models (the systems that underlie chatbots) also become more accurate and powerful if they are fed more data.
Some large language models, such as OpenAI's GPT-3, released in 2020, were trained with hundreds of billions of “tokens,” which are essentially words or fragments of words. More recent large language models were trained with over three billion tokens.
Online data is a valuable and finite resource.
technology companies are using publicly available online data to develop their ai models, faster than new data is produced. According to one prediction, high-quality digital data will run out by 2026.
tech companies are doing everything they can to get more data.
In the race for more data, OpenAI, Google, and Meta are turning to new tools, changing their terms of service, and engaging in internal debates.
At OpenAI, researchers created a program in 2021 that converted audio from YouTube videos to text and then fed the transcripts into one of its ai models, going against YouTube's terms of service, people with knowledge of the development said. issue.
(The New York Times has sued OpenAI and Microsoft for using copyrighted news articles without permission for ai development. OpenAI and Microsoft have said they used news articles in transformative ways that did not violate copyright law.) author).
Google, which owns YouTube, also used YouTube data to develop its artificial intelligence models, entering a legal gray copyright area, people with knowledge of the action said. And Google revised its privacy policy last year so it could use publicly available material to develop more artificial intelligence products.
At Meta, executives and lawyers last year debated how to get more data for ai development and discussed purchasing a major publisher like Simon & Schuster. In private meetings, they weighed including copyrighted works in their ai model, even if it meant they would be sued later, according to recordings of the meetings, which were obtained by The Times.
One solution may be “synthetic” data.
OpenAI, Google and other companies are exploring using their ai to create more data. The result would be what is known as “synthetic” data. The idea is that ai models generate new text that can then be used to build better ai.
Synthetic data is risky because ai models can make mistakes. Relying on that data can exacerbate those errors.