Data is at the heart of today's advanced ai systems, but it costs increasingly more, making it out of reach for all but the richest technology companies.
Last year, OpenAI researcher James Betker wrote a ai-models-is-the-dataset/” target=”_blank” rel=”noreferrer noopener”>post on your personal blog about the nature of generative ai models and the data sets on which they are trained. In it, Betker claimed that training data (not the design, architecture or any other characteristic of a model) was the key to increasingly sophisticated and capable ai systems.
“If trained on the same data set long enough, virtually all models converge to the same point,” Betker wrote.
Is Betker right? Is training data the biggest determinant of what a model can do, whether answering a question, drawing human hands, or generating a realistic cityscape?
It is certainly plausible.
statistical machines
Generative ai systems are basically probabilistic models: a huge pile of statistics. They guess, based on a large number of examples, which data makes the most “sense” to place (for example, the word “go” before “to the market” in the sentence “I am going to the market”). It seems intuitive, then, that the more examples a model has, the better the performance of models trained on those examples.
“It appears that the performance improvements are coming from data,” Kyle Lo, senior applied research scientist at the Allen Institute for ai (AI2), a nonprofit ai research organization, told TechCrunch, “at least a once you have a stable training setup. .”
This was given by the example of Meta's Llama 3, a text generation model released earlier this year, which outperforms AI2's own OLMo model despite being architecturally very similar. Llama 3 was trained on much more data than OLMo, which Lo believes explains its superiority in many popular ai benchmarks.
(I'll note here that the benchmarks widely used in the ai industry today are not necessarily the best indicator of a model's performance, but outside of qualitative tests like ours, they are one of the few measures that we have to continue.)
This is not to say that training on exponentially larger data sets is a sure path to exponentially better models. The models operate under a “garbage in, garbage out” paradigm, Lo says, so data curation and quality are very important, perhaps more so than simple quantity.
“It is possible for a small model with carefully designed data to outperform a large model,” he added. “For example, the Falcon 180B, a large model, ranks 63rd in the LMSYS benchmark, while the Llama 2 13B, a much smaller model, ranks 56th.”
In an interview with TechCrunch last October, OpenAI researcher Gabriel Goh said that higher-quality annotations contributed greatly to the improved image quality in DALL-E 3, OpenAI's text-to-image model, compared to its predecessor. DALL-E 2. “I think this is the main source of the improvements,” he said. “Text annotations are much better than they were (with DALL-E 2); it's not even comparable.”
Many ai models, including DALL-E 3 and DALL-E 2, are trained by having human annotators label data so that a model can learn to associate those labels with other observed features of that data. For example, a model that receives many images of cats with annotations for each breed will eventually “learn” to associate terms like bobtail and short hair with its distinctive visual features.
Misbehavior
Experts like Lo worry that the growing emphasis on large, high-quality training data sets will centralize ai development on the few players with billion-dollar budgets that can afford to purchase these sets. Great innovation in ai-models” target=”_blank” rel=”noreferrer noopener”>synthetic data or fundamental architecture could alter the status quo, but neither seems to be on the near horizon.
“In general, entities that control content that is potentially useful for ai development are incentivized to lock up their materials,” Lo said. “And as data access closes, we're basically blessing some of the early adopters of data acquisition and moving up the ladder so that no one else can access the data to catch up.”
In fact, while the race for more training data hasn't led to unethical (and perhaps even illegal) behavior like secretly adding copyrighted content, it has rewarded tech giants with plenty of money to spend on licensing. of data.
Generative ai models, like those from OpenAI, are primarily trained with images, text, audio, videos, and other data (some copyrighted) obtained from public web pages (including, ai-generated-data-can-poison-future-ai-models/#:~:text=But%20as%20AI%20developers%20scrape,each%20succeeding%20generation%20of%20models.” target=”_blank” rel=”noreferrer noopener”>problematically, those generated by ai). The OpenAIs of the world claim that fair use protects them from legal retaliation. Many rights holders disagree, but, at least for now, there is little they can do to prevent this practice.
There are many, many examples of generative ai vendors acquiring massive data sets through questionable means to train their models. Open ai technology/tech-giants-harvest-data-artificial-intelligence.html” target=”_blank” rel=”noreferrer noopener”>reportedly transcribed over a million hours of YouTube videos without YouTube's blessing, or the creators' blessing, to power its flagship GPT-4 model. Google recently expanded its terms of service in part to allow it to leverage public Google documents, restaurant reviews on Google Maps and other online material for its artificial intelligence products. And Meta reportedly considered risking lawsuits technology/tech-giants-harvest-data-artificial-intelligence.html” target=”_blank” rel=”noreferrer noopener”>train your models on IP-protected content.
Meanwhile, businesses large and small depend on Workers in third world countries pay only a few dollars per hour. to create annotations for training sets. Some of these annotators (used by ai-nears-13-billion-valuation-in-accel-led-round” target=”_blank” rel=”noreferrer noopener”>gigantic startups like Scale ai – they work for days on end to complete tasks that expose them to graphic depictions of violence and bloodshed without any benefit or guarantee of future jobs.
Rising cost
In other words, even the most sincere data deals aren't exactly fostering an open and equitable generative ai ecosystem.
OpenAI has spent hundreds of millions of dollars on content licenses from news publishers, archival media libraries, and more to train its ai models, a budget far larger than most academic research groups, nonprofits, and more. profit and new companies. Meta has even gone so far as to consider acquiring publisher Simon & Schuster for the rights to e-book excerpts (Simon & Schuster ultimately sold to private equity firm KKR for $1.62 billion in 2023).
As ai training data market is expected to grow technology/inside-big-techs-underground-race-buy-ai-training-data-2024-04-05/”>grow From about $2.5 billion today to about $30 billion a decade from now, data brokers and platforms are rushing to cash in big bucks, in some cases over the objections of their user bases.
Shutterstock Stock Media Library has technology/inside-big-techs-underground-race-buy-ai-training-data-2024-04-05/” target=”_blank” rel=”noreferrer noopener”>inked deals with ai vendors ranging from $25 million to $50 million, while Reddit claims to have made hundreds of millions from licensing data to organizations like Google and OpenAI. Few platforms with rich data accumulated organically over the years I have not done it It reportedly signed deals with generative ai developers, from Photobucket to Tumblr to the question-and-answer site Stack Overflow.
What needs to be sold is the data from the platforms, at least depending on the legal arguments in which one believes. But in most cases, users do not see a single cent of the profits. And it's hurting the ai research community at large.
“Smaller players will not be able to afford these data licenses and therefore will not be able to develop and study ai models,” Lo said. “I am concerned that this could lead to a lack of independent scrutiny of ai development practices.”
Independent efforts
If there's a ray of sunshine in the darkness, it's the few independent, nonprofit efforts to create massive data sets that anyone can use to train a generative ai model.
EleutherAI, a nonprofit grassroots research group that started as a Discord collective in 2020, is working with the University of Toronto, AI2, and independent researchers to create The Pile v2, a set of billions of text passages obtained mainly from the public domain. .
In April, ai startup Hugging Face released FineWeb, a leaked version of Common Crawl, the eponymous dataset maintained by the nonprofit Common Crawl, made up of billions of web pages, which Hugging Face says improves model performance on many benchmarks.
Some efforts to publish open training data sets, such as the LAION group's image sets, have run into copyright, data privacy, and other issues, equally serious ethical and legal challenges. But some of the most dedicated data curators are committed to doing better. The Pile v2, for example, removes problematic copyrighted material found in its parent dataset, The Pile.
The question is whether any of these open efforts can hope to keep pace with Big tech. As long as data collection and curation remains a resource issue, the answer is probably no, at least not until some advance in research levels up. the playing field.