How to teach your model to learn from themselves | by Niklas von Moers | Sep, 2024

A case study on confidence-based iterative pseudolabeling for classification

In machine learning, the more data you get, the better results you get, the more expensive and time-consuming it is to label it. What if we could use the huge amounts of unlabeled data that are usually easy to obtain? This is where pseudo-labeling comes in handy.

TL;DR: I conducted a case study on the MNIST dataset and improved my model accuracy from 90% to 95% by applying iterative confidence-based pseudolabeling. This article covers the details of what pseudolabeling is, along with practical tips and insights from my experiments.

Pseudolabeling is a type of semi-supervised learning. It bridges the gap between supervised learning (where all data is labeled) and unsupervised learning (where none of the data is labeled).

Flowchart illustrating the procedure on the MNIST dataset. Derived from Yann LeCun, Corinna Cortes, and Christopher JC Burges. Licensed under CC BY-SA 3.0.

The exact procedure I followed was as follows:

We start with a small amount of labeled data and train our model on it.
The model makes predictions on unlabeled data.
We select the predictions in which the model has the most confidence (for example, above 95% confidence) and treat them as if they were real labelshoping they are trustworthy enough.
We add this “pseudo-labeled” data to our training set and retrain the model.
We can repeat this process multiple times, allowing the model to learn from the growing set of pseudo-labeled data.

While this approach may introduce some incorrect labels, the benefit comes from the significantly larger amount of training data.

The idea of a model learning from its own predictions may raise some eyebrows. After all, aren’t we trying to create something from nothing, relying on an “echo chamber” in which the model simply reinforces its own initial biases and errors?

This concern is valid. You might be reminded of the legendary Baron Münchhausen, who claimed to have climbed out of a swamp with his own hair, along with his horse – something physically impossible. Similarly, if a model relies solely on its own potentially erroneous predictions, it risks getting caught in a self-reinforcing loop, like people trapped in echo chambers who hear only their own beliefs reflected back to them.

So can pseudo-labeling really be effective without falling into this trap?

The answer is YeahAlthough this story of Baron Münchhausen is obviously a fairy tale, it is possible to imagine a blacksmith who advances through the ages. He starts with basic stone tools (the initial labeled data). With these, he forges rudimentary copper tools (pseudo-labels) from raw ore (unlabeled data). These copper tools, while still rudimentary, allow him to work in Previously unviable Tasks that ultimately lead to the creation of tools made of bronze, iron, etc. This iterative process is crucial: Steel swords cannot be forged using a stone hammer.

Just like the blacksmith, in machine learning we can achieve a similar progression by:

Rigorous thresholds:The out-of-sample accuracy of the model is limited by the proportion of correct training labels. If 10% of the labels are incorrect, the model accuracy will not significantly exceed 90%. Therefore, it is important to allow as few incorrect labels as possible.
Measurable feedbackConstantly evaluating model performance on a separate test set acts as a reality check, ensuring that we are making real progress and not just reinforcing existing bugs.
The human being in the circuitIncorporating human feedback in the form of manual review of pseudolabels or manual labeling of low-confidence data can provide valuable course correction.

Pseudolabeling, when done correctly, can be a powerful tool for making the most of small labeled data sets, as we will see in the following case study.

I performed my experiments on the MNIST dataset, a classic collection of 28×28 pixel handwritten digit images widely used to evaluate machine learning models. It consists of 60,000 training images and 10,000 test images. The goal is to predict, from the 28×28 pixels, which digit is written.

I trained a simple CNN on an initial set of 1000 labeled images, leaving 59,000 unlabeled. I then used the trained model to predict the labels of the unlabeled images. Predictions with confidence above a certain threshold (e.g. 95%) were added to the training set, along with their predicted labels. The model was then retrained on this expanded dataset. This process was repeated iteratively, up to ten times or until there was no more unlabeled data.

This experiment was repeated with different numbers of initially labeled images and confidence thresholds.