ohUsers are individuals who are very different from the majority of the population. Traditionally, there is some distrust among professionals towards outliers, so ad hoc measures are often taken, such as removing them from the data set.
However, when working with real data, outliers are the order of the day. Sometimes, they are even more important than other observations! Take, for example, people who are outliers because they are high-paying customers: you don't want to rule them out; In fact, you probably want to treat them with extra care.
An interesting (and rather unexplored) aspect of outliers is how they interact with machine learning models. My sense is that data scientists believe that outliers hurt the performance of their models. But this belief is probably based more on preconceived idea than actual evidence.
Thus, the question that I will try to answer in this article is the following:
Is a machine learning model more likely to make errors when making predictions about outliers?
Suppose we have a model that has been trained with these data points:
We receive new data points for which the model should make predictions.
Let's consider two cases:
- the new data point is an outlier, that is, different from the majority of the training observations.
- The new data point is “standard”, that is, it is in an area that is quite “dense” in training points.
We would like to understand if, in general, the outlier is more difficult to predict than the standard observation.