With the LLM craze such as the widely popular GPT engines, every company, big or small, is in the race to either develop a better model than the existing ones or use current models in an innovative package that solves a problem.
Now, while finding the use cases and building a product around them is fine, the concern is how we will train a model, which one is better than the existing models, what its impact will be, and what kind of technique we will use. By highlighting all of these questions and raising a troubling issue, this document breaks down everything we need to know.
Current GPT engines, such as chatGPT or any other large language model, whether general or a specific niche based system, have been publicly and widely accessible Internet trained data.
This gives us an idea of where the data comes from. The source is ordinary people reading, writing, tweeting, commenting, and reviewing information.
There are two widely accepted ways to increase how efficiently a model will work and how magical a non-tech person will find it. One is to grow the data that you are training your model on. And the second is to increase the number of parameters it will consider. Think of the parameters as single data points or features of the topic that the model is learning about.
Until now, the models have been working with data in any form, audio, video, image or text, developed by humans. If treated as a large corpus, this corpus has data that was authentic in terms of semantics, made up of variety and rare occurrence, what we often refer to as variety in the data, it was there. All the live flavors were intact. Therefore, these models could develop a realistic data distribution and be trained to predict not only the most probable (common) class but also the less occurring classes or tokens.
Now this variety is threatened by the infusion of machine-generated data, for example, an article written by an LLM or an image generated by an AI. And this problem is bigger than it seems at first glance, since it gets worse over time.
Now, according to the researchers of this article, this problem is quite frequent and dangerously impactful in models that follow a continuous learning process. Unlike traditional machine learning, which seeks to learn from a static distribution of data, continuous learning attempts to learn from a dynamic distribution, where data is supplied sequentially. Approaches like this tend to be task-based, providing data with delineated task boundaries, for example classifying dogs from cats and recognizing handwritten digits. This task is more similar to non-task continuous learning, where data distributions gradually change without the notion of separate tasks.
Model Collapse is a degenerative process that affects generations of learned generative models, where the generated data contaminates the training set of the next generation of models; being trained on contaminated data, they misperceive reality. All of this leads to model collapse, which is a direct cause of data poisoning. While data poisoning, in broader terms, means anything that could lead to the creation of data that does not accurately represent reality. Researchers have used several tractable models that mimic LLM mathematical models to show how real this problem is and how it grows over time. Almost all LLMs suffer from that, as the results show.
Now that we know what the problem is and what is causing it, the obvious question is how do we solve it? The answer is quite simple and is also suggested by the article.
- Maintain the authenticity of the content. keep it real
- Add more collaborators to review training data to ensure realistic data distribution.
- Regulate the use of machine-generated data as training data.
With all of this, this paper highlights how worrisome this seemingly insignificant issue can be because it is so expensive to train LLMs from scratch, and most organizations use pre-trained models as a starting point to some degree.
Now even critical services such as life science use cases, supply chain management, and even the entire content industry are rapidly moving to LLMs for their regular assignments and suggestions; It would be interesting to see how the LLM developers will keep it realistic and continually improve the model.
review the Paper. Don’t forget to join our 23k+ ML SubReddit, discord channel, and electronic newsletter, where we share the latest AI research news, exciting AI projects, and more. If you have any questions about the article above or if we missed anything, feel free to email us at [email protected]
🚀 Check out 100 AI tools at AI Tools Club
Anant is a Computer Science Engineer currently working as a Data Scientist with a background in Finance and AI-as-a-Service products. He is interested in creating AI-powered solutions that create better data points and solve everyday problems in powerful and efficient ways.