Oversampling and Undersampling Explained: A Visual Guide Using a 2D Mini Dataset | by Samy Baladram | October 2024

DATA PREPROCESSING

Artificially generating and deleting data for the common good

Compiling a data set where each class has exactly the same number of classes to predict can be challenging. In reality, things are rarely perfectly balanced, and when creating a classification model this can be a problem. When a model is trained on such a data set, where one class has more examples than the other, it is generally better at predicting larger groups and worse at predicting smaller ones. To help with this problem, we can use tactics like oversampling and undersampling: creating more examples from the smaller group or removing some examples from the larger group.

There are many different oversampling and undersampling methods (with intimidating names like SMOTE, ADASYN, and Tomek Links), but there don't seem to be many resources that visually compare how they work. So here we will use a simple 2D data set to show the changes that occur in the data after applying those methods, so that we can see how different the result of each method is. You'll see from the images that these various approaches provide different solutions, and who knows, one might be right for your specific machine learning challenge!

All visuals: Created by the author using Canva Pro. Optimized for mobile devices; It may look oversized on the desktop.

Oversampling

Oversampling makes a data set more balanced when one group has many fewer examples than the other. The way it works is by making more copies of the smaller group's examples. This helps the data set represent both groups more equally.

Subsampling

On the other hand, subsampling works by removing some of the examples from the larger group until they are almost the same size as the smaller group. In the end, the data set is smaller, of course, but both groups will have a more similar number of examples.

Hybrid sampling

The combination of oversampling and undersampling can be called “hybrid sampling.” It increases the size of the smaller group by making more copies of its examples and also removes some examples from the larger group by deleting some of its examples. Try to create a data set that is more balanced, neither too large nor too small.

Let's use a simple artificial golf data set to show both oversampling and undersampling. This data set shows what type of golf activity a person performs in a particular weather condition.

Columns: Temperature (0–3), Humidity (0–3), Golf Activity (A=Normal Course, B=Driving Range or C=Indoor Golf). The training data set has 2 dimensions and 9 samples.

Please note that while this small data set is good for understanding the concepts, in real applications you will want much larger data sets before applying these techniques, as sampling with too little data can lead to unreliable results.

Random oversampling

Random oversampling It's a simple way to make the smallest group bigger. It works by making duplicates of the smaller group's examples until all classes are balanced.

Ideal for very small data sets that need to be balanced quickly
Not recommended for complicated data sets

Random oversampling simply doubles the samples selected from the smallest group (A) while keeping all samples from the largest groups (B and C) unchanged, as shown by the A×2 markings on the graph to the right.

WOUNDED

WOUNDED (Synthetic Minority Oversampling Technique) is an oversampling technique that creates new examples by interpolating the smallest group. Unlike random oversampling, it doesn't just copy what's there, but uses the examples from the smaller group to generate some examples among them.

It's best when you have a decent number of examples to work with and need variety in your data
Not recommended if you have very few examples.
Not recommended if data points are too spread out or noisy

SMOTE creates new A samples by selecting pairs of A points and placing new points somewhere along the line between them. Similarly, a new point B is placed between pairs of point Bs chosen at random.

IT'S SIMPLE

IT'S SIMPLE (Adaptive Synthetic) is like SMOTE but focuses on creating new examples in the most difficult to learn parts of the smaller group. Find the examples that are most difficult to classify and present more new points around them. This helps the model better understand challenging areas.

It's better when some parts of your data are harder to classify than others
Best for complex data sets with challenging areas
Not recommended if your data is quite simple and direct

ADASYN creates more synthetic points of the smaller group (A) in “difficult areas” where points A are close to other groups (B and C). It also generates new B points in similar areas.

Subsampling reduces the largest group to be closer in size to the smaller group. There are a few ways to do this:

Random subsampling

Random subsampling remove examples from the largest group at random until it is the same size as the smallest group. Like random oversampling, the method is fairly simple, but it could eliminate important information that really shows how different the groups are.

Ideal for very large data sets with many repetitive examples
Best when you need a quick and easy solution
Not recommended if all examples in your larger group are important
Not recommended if you cannot afford to lose information

Random subsampling removes randomly chosen points from the larger groups (B and C) and keeps all points from the smaller group (A) unchanged.

Tom's Links

Tom's Links It is a subsampling method that clarifies the “lines” between groups. Look for pairs of examples from different groups that are really similar. When it finds a pair where the examples are nearest neighbors but belong to different groups, it removes the example from the largest group.

Better when your groups overlap too much
Best for cleaning up messy or noisy data
Best when you need clear boundaries between groups
Not recommended if your groups are already well separated

Tomek Links identifies pairs of points from different groups (AB, BC) that are nearest neighbors of each other. Points from the larger groups (B and C) that form these pairs are removed while all points from the smaller group (A) are kept.”

near accident

near accident is a set of subsampling techniques that work with different rules:

Near miss-1: Keeps the examples from the larger group that are closest to the examples from the smaller group.
Almost Miss-2: Keeps examples from the largest group that have the smallest average distance to their three nearest neighbors in the smallest group.
Almost Miss-3: Keeps examples from the largest group that are furthest from other examples from its own group.

The main idea here is to keep the most informative examples from the larger group and get rid of the ones that aren't as important.

Best when you want to control which examples to keep
Not recommended if you need a simple and quick solution

THAT

Edited nearest neighbors (ENN) removes examples that are likely noise or outliers. For each example in the largest group, check whether most of its nearest neighbors belong to the same group. If they don't, remove that example. This helps create cleaner boundaries between groups.

Best for cleaning messy data
Best when you need to remove outliers
Best for creating cleaner group boundaries
Not recommended if your data is already clean and well organized

ENN removes points from larger clusters (B and C) whose majority of nearest neighbors belong to a different cluster. In the graph on the right, the crossed out points are removed because most of their nearest neighbors belong to other groups.

SMOTETomek

SMOTETomek works by first creating new examples for the smaller group using SMOTE, then cleaning up messy boundaries by removing “confusing” examples using Tomek Links. This helps create a more balanced data set with clearer boundaries and less noise.

Best for imbalanced data that is really serious
Best when you need more examples and clearer boundaries
Best when dealing with noisy, overlapping groups
Not recommended if your data is already clean and well organized
Not recommended for small data sets

SMOTETomek combines two steps: first applying SMOTE to create new A points along the lines between existing A points (shown in the middle graph), then removing Tomek links from larger groups (B and C). The end result has more balanced groups with clearer boundaries between them.

GENTLE

GENTLE works by first creating new examples for the smaller group using SMOTE, then cleaning both groups by removing examples that do not fit well with their neighbors using ENN. Like SMOTETomek, this helps create a cleaner data set with clearer boundaries between groups.

Best for cleaning both groups at the same time
Best when you need more examples but cleaner data
Best when dealing with many outliers
Not recommended if your data is already clean and well organized
Not recommended for small data sets

SMOTEENN combines two steps: first use SMOTE to create new A points along lines between existing A points (middle graph), then apply ENN to remove points from larger groups (B and C) whose nearest neighbors are mostly from different groups. The final graph shows the clean and balanced data set.

Oversampling and Undersampling Explained: A Visual Guide Using a 2D Mini Dataset | by Samy Baladram | October 2024

Technical Terrence Team

Will Tesla shares help me become an ISA millionaire?

Leave a Reply Cancel reply

Recommended.

6 Best Procure-to-Pay Software Solutions in 2024

How to print on school networks with laptops

BTC Falls as Analysts Share Post-Halving Target for XRP and NuggetRush

AI is the solution to costly and ineffective dyslexia programs

Why Ford will lose $150 million a week after the latest UAW strike

Categories

Important Links

Oversampling and Undersampling Explained: A Visual Guide Using a 2D Mini Dataset | by Samy Baladram | October 2024

DATA PREPROCESSING

Artificially generating and deleting data for the common good

Oversampling

Subsampling

Hybrid sampling

Random oversampling

WOUNDED

IT'S SIMPLE

Random subsampling

Tom's Links

near accident

THAT

SMOTETomek

GENTLE

Related

Technical Terrence Team

Will Tesla shares help me become an ISA millionaire?

Leave a Reply Cancel reply

Recommended.

6 Best Procure-to-Pay Software Solutions in 2024

How to print on school networks with laptops

BTC Falls as Analysts Share Post-Halving Target for XRP and NuggetRush

AI is the solution to costly and ineffective dyslexia programs

Why Ford will lose $150 million a week after the latest UAW strike

Categories

Important Links

Get daily news updates to your inbox!