The way we build traditional machine learning models is to first train the models on a “training data set” (usually a data set of historical values) and then generate predictions on a new data set, the “data set.” of inference”. If the columns of the training data set and the inference data set do not match, your machine learning algorithm will usually fail. This is primarily due to new or missing factor levels in the inference data set.
The first problem: missing factors
For the following examples, assume you used the above data set to train your machine learning model. He hot-coded the data set into dummy variables and his fully transformed training data looks like below:
Now let's introduce the inference data set, this is what you would use to make predictions. Let's say it is given as below:
# Creating the inference_data DataFrame in Python
inference_data = pd.DataFrame({
'numerical_1': (11, 12, 13, 14, 15, 16, 17, 18),
'color_1_': ('black', 'blue', 'black', 'green',
'green', 'black', 'black', 'blue'),
'color_2_': ('orange', 'orange', 'black', 'orange',
'black', 'orange', 'orange', 'orange')
})
Using a naive one-hot encoding strategy like the one we used above (pd.get_dummies
)
# Converting categorical columns in inference_data to
# Dummy variables with integers
inference_data_dummies = pd.get_dummies(inference_data,
columns=('color_1_', 'color_2_')).astype(int)
This would transform your inference data set in the same way and you would get the following data set:
Do you notice the problems? The first problem is that the inference data set is missing columns:
missing_colmns =('color_1__red', 'color_2__pink',
'color_2__blue', 'color_2__purple')
If you were to run this on a model trained with the “training data set” it would normally fail.
The second problem: new factors
The other problem that can occur with one-hot coding is if your inference data set includes new, unseen factors. Let's consider again the same data sets above. If you look closely, you'll see that the inference data set now has a new column: color_2__orange.
This is the opposite problem from the previous one, and our inference data set contains new columns that our training data set did not have. This is actually common and can happen if one of your factor variables changed. For example, if the colors above represent the colors of a car and a car manufacturer suddenly started making orange cars, this data may not be available in the training data, but could still appear in the data. of inference. In this case, you need a solid way to address the problem.
One could argue, well, why don't you list all the columns in the transformed training data set as columns that would be needed for your inference data set? The problem here is that you often don't know in advance what factor levels are in the training data.
For example, new levels could be introduced periodically, which could make them difficult to maintain. On top of that comes the process of matching your inference data set to the training data, so you would need to check all the actual transformed column names that were included in the training algorithm and then compare them to the data set transformed inference. If any columns were missing, you would need to insert new columns with 0 values and if you had additional columns, such as color_2__orange
previous columns, it would be necessary to delete them. This is a rather cumbersome way to solve the problem and fortunately there are better options available.
The solution to this problem is quite simple; However, many of the packages and libraries that attempt to simplify the process of creating prediction models fail to implement it well. The key is to have a function or class that first fits the training data and then use that same instance of the function or class to transform both the training data set and the inference data set. Below we explore how this is done using Python and R.
in python
Python is arguably one of the best programming languages for machine learning, largely due to its extensive developer network and mature package libraries, and its ease of use, which promotes rapid development.
Regarding the problems related to one-hot coding that we described above, they can be mitigated by using the widely available and tested scikit-learn library, and more specifically the sklearn.preprocessing.OneHotEncoder
class. So let's see how we can use it on our training and inference data sets to create robust one-hot encoding.
from sklearn.preprocessing import OneHotEncoder# Initialize the encoder
enc = OneHotEncoder(handle_unknown='ignore')
# Define columns to transform
trans_columns = ('color_1_', 'color_2_')
# Fit and transform the data
enc_data = enc.fit_transform(training_data(trans_columns))
# Get feature names
feature_names = enc.get_feature_names_out(trans_columns)
# Convert to DataFrame
enc_df = pd.DataFrame(enc_data.toarray(),
columns=feature_names)
# Concatenate with the numerical data
final_df = pd.concat((training_data(('numerical_1')),
enc_df), axis=1)
This produces a final DataFrame
of transformed values as shown below:
If we break down the code above, we see that the first step is to initialize an instance of the encoder class. We use the option handle_unknown='ignore'
to avoid problems with unknown values for columns when we use the encoder to transform our inference data set.
After that, we combine a snap and transform action in a single step with the fit_transform
method. And finally, we create a new data frame from the encoded data and concatenate it with the rest of the original data set.