In machine learning, the more data you get, the better results you get, the more expensive and time-consuming it is to label it. What if we could use the huge amounts of unlabeled data that are usually easy to obtain? This is where pseudo-labeling comes in handy.
TL;DR: I conducted a case study on the MNIST dataset and improved my model accuracy from 90% to 95% by applying iterative confidence-based pseudolabeling. This article covers the details of what pseudolabeling is, along with practical tips and insights from my experiments.
Pseudolabeling is a type of semi-supervised learning. It bridges the gap between supervised learning (where all data is labeled) and unsupervised learning (where none of the data is labeled).
The exact procedure I followed was as follows:
- We start with a small amount of labeled data and train our model on it.
- The model makes predictions on unlabeled data.
- We select the predictions in which the model has the most confidence (for example, above 95% confidence) and treat them as if they were real labelshoping they are trustworthy enough.
- We add this “pseudo-labeled” data to our training set and retrain the model.
- We can repeat this process multiple times, allowing the model to learn from the growing set of pseudo-labeled data.
While this approach may introduce some incorrect labels, the benefit comes from the significantly larger amount of training data.
The idea of a model learning from its own predictions may raise some eyebrows. After all, aren’t we trying to create something from nothing, relying on an “echo chamber” in which the model simply reinforces its own initial biases and errors?
This concern is valid. You might be reminded of the legendary Baron Münchhausen, who claimed to have climbed out of a swamp with his own hair, along with his horse – something physically impossible. Similarly, if a model relies solely on its own potentially erroneous predictions, it risks getting caught in a self-reinforcing loop, like people trapped in echo chambers who hear only their own beliefs reflected back to them.
So can pseudo-labeling really be effective without falling into this trap?
The answer is YeahAlthough this story of Baron Münchhausen is obviously a fairy tale, it is possible to imagine a blacksmith who advances through the ages. He starts with basic stone tools (the initial labeled data). With these, he forges rudimentary copper tools (pseudo-labels) from raw ore (unlabeled data). These copper tools, while still rudimentary, allow him to work in Previously unviable Tasks that ultimately lead to the creation of tools made of bronze, iron, etc. This iterative process is crucial: Steel swords cannot be forged using a stone hammer.
Just like the blacksmith, in machine learning we can achieve a similar progression by:
- Rigorous thresholds:The out-of-sample accuracy of the model is limited by the proportion of correct training labels. If 10% of the labels are incorrect, the model accuracy will not significantly exceed 90%. Therefore, it is important to allow as few incorrect labels as possible.
- Measurable feedbackConstantly evaluating model performance on a separate test set acts as a reality check, ensuring that we are making real progress and not just reinforcing existing bugs.
- The human being in the circuitIncorporating human feedback in the form of manual review of pseudolabels or manual labeling of low-confidence data can provide valuable course correction.
Pseudolabeling, when done correctly, can be a powerful tool for making the most of small labeled data sets, as we will see in the following case study.
I performed my experiments on the MNIST dataset, a classic collection of 28×28 pixel handwritten digit images widely used to evaluate machine learning models. It consists of 60,000 training images and 10,000 test images. The goal is to predict, from the 28×28 pixels, which digit is written.
I trained a simple CNN on an initial set of 1000 labeled images, leaving 59,000 unlabeled. I then used the trained model to predict the labels of the unlabeled images. Predictions with confidence above a certain threshold (e.g. 95%) were added to the training set, along with their predicted labels. The model was then retrained on this expanded dataset. This process was repeated iteratively, up to ten times or until there was no more unlabeled data.
This experiment was repeated with different numbers of initially labeled images and confidence thresholds.
Results
The following table summarizes the results of my experiments, comparing the performance of pseudolabeling to training on the full labeled dataset.
Even with a small set of initially labeled data, pseudolabeling can produce Remarkable resultswhich increases the accuracy by 4.87% for 1000 initial labeled samples. When only 100 initial samples are used, this effect is even stronger. However, it would have been prudent to manually label more than 100 samples.
Interestingly, the final test accuracy of the experiment with 100 initial training samples exceeded the proportion of correct training labels.
Looking at the graphs above, it becomes clear that overall, Higher thresholds lead to better results — provided that at least some predictions exceed the threshold. In future experiments, one could try varying the threshold with each iteration.
Furthermore, accuracy improves even in later iterations, indicating that the iterative nature provides a real benefit.
- Pseudo-tagging is best applied when Unlabeled data is abundant, but labeling is expensive..
- Monitor test accuracy: It is important to monitor the performance of the model on a separate test dataset over the iterations.
- Manual labeling can still be useful:If you have the resources, focus on manually labeling low confidence data. However, humans are not perfect either, and labeling high confidence data can be safely delegated to the model.
- Keep track of ai-generated labels. If more manually labeled data becomes available later, you will probably want to discard the pseudo labels and repeat this procedure, increasing the accuracy of the pseudo labels.
- Be careful when interpreting the results: When I first did this experiment a few years ago, I focused on the accuracy of the remaining unlabeled training data. This accuracy Falls With more iterations! However, this is likely because the remaining data is harder to predict – the model was never sure about it in earlier iterations. I should have focused on the test accuracy, which actually improves with more iterations.
The repository containing the experiment code can be found here.
Related document: Iterative pseudolabeling with deep feature annotation and confidence-based sampling