The Earth is an outlier: the theory
What are outliers?
We live in an outlier. Earth is the only living hump of rock in the Milky Way galaxy. Other planets in our galaxy are inliers or normal data points in a putative database of stars and planets.
There are many definitions of outliers. In simple terms, we define outliers as data points that are significantly different from the majority in a data set. Outliers are the rare and extreme samples that do not fit or align with internal values in a data set.
Statistically speaking, outliers come from a different distribution than the rest of the samples in one characteristic. They present statistically significant abnormalities.
These definitions depend on what we consider “normal.” For example, it is perfectly normal for CEOs to earn millions of dollars, but if we add their salary information to a data set of household income, they become abnormal.
Outlier detection is the field of statistics and machine learning that uses various techniques and algorithms to detect such extreme samples.
Why bother with outlier detection?
But why nonetheless? Why do we have to find them? What is the harm in them? Well, consider this distribution of 12 numbers ranging from 50 to 100. One of the data points is 2534, which is clearly an outlier.
import numpy as nparray = [97, 87, 95, 62, 53, 66, 2534, 60, 68, 90, 52, 63, 65]
[97, 87, 95, 62, 53, 66, 2534, 60, 68, 90, 52, 63, 65]
array
The mean and standard deviation are two of the critical and most used attributes of a distribution, so we need to provide realistic values for these two metrics when fitting machine learning models.
Let’s calculate them for our sampling distribution.
The meaning:
np.mean(array)260.9230769230769
The standard deviation:
np.std(array)656.349984212042
Now, let’s do the same, removing the outlier:
# Array without the outlier
array_wo = [97, 87, 95, 62, 53, 66, 60, 68, 90, 52, 63, 65]np.mean(array_wo)
71.5np.std(array_wo)15.510748961069977
As you can see, the distribution without outliers has a mean 3.6 times smaller and a standard deviation almost 45 times smaller.
In addition to skewing the actual mean and STD values, outliers also create noise in the training data. They create trends and attributes in the distributions that distract machine learning models from the actual patterns in the data, leading to performance losses.
Therefore, it is critical to find outliers, explore the reasons for their presence, and eliminate them if appropriate.
What you will learn in this tutorial
Once you understand the important theory behind the process, outlier detection is easy to do in code with libraries like PyOD or Sklearn. For example, here’s how to perform outlier detection using a popular Isolation Forest algorithm.
from pyod.models.iforest import IForestiforest = IForest().fit(training_features)
# 0 for inliers, 1 for outliers
labels = iforest.labels_outliers = training_features[labels == 1]
136
len(outliers)
Only a few lines of code are needed.
Therefore, this tutorial will focus more on theory. Specifically, we’ll look at outlier detection in the context of unsupervised learning, the concept of contamination in data sets, the difference between anomalies, outliers, and novelties, and univariate/multivariate outliers.
Let us begin.
Outlier detection is an unsupervised problem
Unlike many other ML tasks, outlier detection is an unsupervised learning problem. What do we mean by that?
For example, in classification, we have a set of features that are assigned to specific results. We have labels that tell us which sample is a dog and which is a cat.
In outlier detection, that is not the case. We have no prior knowledge of outliers when presented with a new data set. This causes several challenges (but nothing we can’t handle).
First, we won’t have an easy way to measure the effectiveness of outlier detection methods. In classification, we use metrics such as accuracy or precision to measure how well the algorithm fits our training data set. In outlier detection, we can’t use these metrics because we won’t have any labels that allow us to compare the predictions with the truth on the ground.
And since we can’t use traditional metrics to measure performance, we can’t perform hyperparameter tuning efficiently. This makes it even more difficult to find the best outlier classifier (an algorithm that returns outlier/outlier labels for each row in the data set) for the task at hand.
However, don’t despair. We’ll look at two great solutions in the next tutorial.
Anomalies vs. Outliers vs. Novelty
You will see the terms “anomalies” and “what’s new” often cited alongside outliers in many sources. Although they are close in meaning, there are important distinctions.
An anomaly is a general term that encompasses anything out of the ordinary and abnormal. Anomalies can refer to irregularities in the training or test sets.
As for outliers, they only exist in the training data. Outlier detection refers to finding abnormal data points from the training set. Outlier classifiers only perform a fit
to the training data and returns outlier/lower value labels.
On the other hand, novelties exist only in the test set. In novelty detection, you have a clean data set with no outliers, and you are trying to see if the new, unseen observations have different attributes than the training samples. Therefore, irregular instances in a test set become new.
In summary, anomaly detection is the main field of both outlier detection and novelty detection. While outliers only refer to abnormal samples in the training data, there is something new in the test set.
This distinction is essential for when we start using outlier classifiers in the next tutorial.
Univariate vs. Multivariate Outliers
Univariate and multivariate outliers refer to outliers in different types of data.
As the name suggests, univariate outliers only exist in unique distributions. An example is a very tall person in a data set of height measurements.
Multivariate outliers are a bit tricky. They refer to outliers with two or more attributes that, when viewed individually, do not appear to be outliers but only become outliers when all attributes are considered in unison.
An example of a multivariate outlier might be an old car with very low mileage. The attributes of this car may be ordinary when looked at individually, but when combined, you’ll realize that older cars generally have high mileage commensurate with their age. (There are many old cars and many low mileage cars, but there are few cars that are old and low mileage.)
When choosing an algorithm to detect them, the distinction between the types of outliers becomes important.
Since univariate outliers exist in data sets with only one column, you can use simple and lightweight methods like z-scores either modified z scores.
Multivariate outliers pose a more significant challenge, as they may only appear in many columns of data sets. For that reason, you need to bring out big guns like Isolation Forest, KNN, Local Outlier Factor, etc.
In the next few tutorials, we’ll see how to use some of the above methods.
Conclution
There you go! You now know all the essential terminology and theory behind outlier detection, and all that remains is to apply it in practice using outlier classifiers.
In the following parts of the article, we will cover some of the most popular and robust outlier classifiers that use the PyOD library. Stay tuned!
More articles from…