DATA PREPROCESSING
Compiling a data set where each class has exactly the same number of classes to predict can be challenging. In reality, things are rarely perfectly balanced, and when creating a classification model this can be a problem. When a model is trained on such a data set, where one class has more examples than the other, it is generally better at predicting larger groups and worse at predicting smaller ones. To help with this problem, we can use tactics like oversampling and undersampling: creating more examples from the smaller group or removing some examples from the larger group.
There are many different oversampling and undersampling methods (with intimidating names like SMOTE, ADASYN, and Tomek Links), but there don't seem to be many resources that visually compare how they work. So here we will use a simple 2D data set to show the changes that occur in the data after applying those methods, so that we can see how different the result of each method is. You'll see from the images that these various approaches provide different solutions, and who knows, one might be right for your specific machine learning challenge!
Oversampling
Oversampling makes a data set more balanced when one group has many fewer examples than the other. The way it works is by making more copies of the smaller group's examples. This helps the data set represent both groups more equally.
Subsampling
On the other hand, subsampling works by removing some of the examples from the larger group until they are almost the same size as the smaller group. In the end, the data set is smaller, of course, but both groups will have a more similar number of examples.
Hybrid sampling
The combination of oversampling and undersampling can be called “hybrid sampling.” It increases the size of the smaller group by making more copies of its examples and also removes some examples from the larger group by deleting some of its examples. Try to create a data set that is more balanced, neither too large nor too small.
Let's use a simple artificial golf data set to show both oversampling and undersampling. This data set shows what type of golf activity a person performs in a particular weather condition.
Please note that while this small data set is good for understanding the concepts, in real applications you will want much larger data sets before applying these techniques, as sampling with too little data can lead to unreliable results.
Random oversampling
Random oversampling It's a simple way to make the smallest group bigger. It works by making duplicates of the smaller group's examples until all classes are balanced.
Ideal for very small data sets that need to be balanced quickly
Not recommended for complicated data sets
WOUNDED
WOUNDED (Synthetic Minority Oversampling Technique) is an oversampling technique that creates new examples by interpolating the smallest group. Unlike random oversampling, it doesn't just copy what's there, but uses the examples from the smaller group to generate some examples among them.
It's best when you have a decent number of examples to work with and need variety in your data
Not recommended if you have very few examples.
Not recommended if data points are too spread out or noisy
IT'S SIMPLE
IT'S SIMPLE (Adaptive Synthetic) is like SMOTE but focuses on creating new examples in the most difficult to learn parts of the smaller group. Find the examples that are most difficult to classify and present more new points around them. This helps the model better understand challenging areas.
It's better when some parts of your data are harder to classify than others
Best for complex data sets with challenging areas
Not recommended if your data is quite simple and direct
Subsampling reduces the largest group to be closer in size to the smaller group. There are a few ways to do this:
Random subsampling
Random subsampling remove examples from the largest group at random until it is the same size as the smallest group. Like random oversampling, the method is fairly simple, but it could eliminate important information that really shows how different the groups are.
Ideal for very large data sets with many repetitive examples
Best when you need a quick and easy solution
Not recommended if all examples in your larger group are important
Not recommended if you cannot afford to lose information
Tom's Links
Tom's Links It is a subsampling method that clarifies the “lines” between groups. Look for pairs of examples from different groups that are really similar. When it finds a pair where the examples are nearest neighbors but belong to different groups, it removes the example from the largest group.
Better when your groups overlap too much
Best for cleaning up messy or noisy data
Best when you need clear boundaries between groups
Not recommended if your groups are already well separated
near accident
near accident is a set of subsampling techniques that work with different rules:
- Near miss-1: Keeps the examples from the larger group that are closest to the examples from the smaller group.
- Almost Miss-2: Keeps examples from the largest group that have the smallest average distance to their three nearest neighbors in the smallest group.
- Almost Miss-3: Keeps examples from the largest group that are furthest from other examples from its own group.
The main idea here is to keep the most informative examples from the larger group and get rid of the ones that aren't as important.
Best when you want to control which examples to keep
Not recommended if you need a simple and quick solution
THAT
Edited nearest neighbors (ENN) removes examples that are likely noise or outliers. For each example in the largest group, check whether most of its nearest neighbors belong to the same group. If they don't, remove that example. This helps create cleaner boundaries between groups.
Best for cleaning messy data
Best when you need to remove outliers
Best for creating cleaner group boundaries
Not recommended if your data is already clean and well organized
SMOTETomek
SMOTETomek works by first creating new examples for the smaller group using SMOTE, then cleaning up messy boundaries by removing “confusing” examples using Tomek Links. This helps create a more balanced data set with clearer boundaries and less noise.
Best for imbalanced data that is really serious
Best when you need more examples and clearer boundaries
Best when dealing with noisy, overlapping groups
Not recommended if your data is already clean and well organized
Not recommended for small data sets
GENTLE
GENTLE works by first creating new examples for the smaller group using SMOTE, then cleaning both groups by removing examples that do not fit well with their neighbors using ENN. Like SMOTETomek, this helps create a cleaner data set with clearer boundaries between groups.
Best for cleaning both groups at the same time
Best when you need more examples but cleaner data
Best when dealing with many outliers
Not recommended if your data is already clean and well organized
Not recommended for small data sets