Robust One-Hot Encoding. Production-grade one-hot encoding… | by Hans Christian Ekne | April 2024

The way we build traditional machine learning models is to first train the models on a “training data set” (usually a data set of historical values) and then generate predictions on a new data set, the “data set.” of inference”. If the columns of the training data set and the inference data set do not match, your machine learning algorithm will usually fail. This is primarily due to new or missing factor levels in the inference data set.

The first problem: missing factors

For the following examples, assume you used the above data set to train your machine learning model. He hot-coded the data set into dummy variables and his fully transformed training data looks like below:

Training dataset transformed with pd.get_dummies/author image

Now let's introduce the inference data set, this is what you would use to make predictions. Let's say it is given as below:

# Creating the inference_data DataFrame in Python
inference_data = pd.DataFrame({
'numerical_1': (11, 12, 13, 14, 15, 16, 17, 18),
'color_1_': ('black', 'blue', 'black', 'green', 
'green', 'black', 'black', 'blue'),
'color_2_': ('orange', 'orange', 'black', 'orange', 
'black', 'orange', 'orange', 'orange')
})

Inference data with 3 columns/image per author.

Using a naive one-hot encoding strategy like the one we used above (pd.get_dummies)

# Converting categorical columns in inference_data to 
# Dummy variables with integers
inference_data_dummies = pd.get_dummies(inference_data, 
columns=('color_1_', 'color_2_')).astype(int)

This would transform your inference data set in the same way and you would get the following data set:

Inference dataset transformed with pd.get_dummies/author image

Do you notice the problems? The first problem is that the inference data set is missing columns:

missing_colmns =('color_1__red', 'color_2__pink', 
'color_2__blue', 'color_2__purple')

If you were to run this on a model trained with the “training data set” it would normally fail.

The second problem: new factors

The other problem that can occur with one-hot coding is if your inference data set includes new, unseen factors. Let's consider again the same data sets above. If you look closely, you'll see that the inference data set now has a new column: color_2__orange.

This is the opposite problem from the previous one, and our inference data set contains new columns that our training data set did not have. This is actually common and can happen if one of your factor variables changed. For example, if the colors above represent the colors of a car and a car manufacturer suddenly started making orange cars, this data may not be available in the training data, but could still appear in the data. of inference. In this case, you need a solid way to address the problem.

One could argue, well, why don't you list all the columns in the transformed training data set as columns that would be needed for your inference data set? The problem here is that you often don't know in advance what factor levels are in the training data.

For example, new levels could be introduced periodically, which could make them difficult to maintain. On top of that comes the process of matching your inference data set to the training data, so you would need to check all the actual transformed column names that were included in the training algorithm and then compare them to the data set transformed inference. If any columns were missing, you would need to insert new columns with 0 values and if you had additional columns, such as color_2__orange previous columns, it would be necessary to delete them. This is a rather cumbersome way to solve the problem and fortunately there are better options available.

The solution to this problem is quite simple; However, many of the packages and libraries that attempt to simplify the process of creating prediction models fail to implement it well. The key is to have a function or class that first fits the training data and then use that same instance of the function or class to transform both the training data set and the inference data set. Below we explore how this is done using Python and R.

in python

Python is arguably one of the best programming languages for machine learning, largely due to its extensive developer network and mature package libraries, and its ease of use, which promotes rapid development.

Regarding the problems related to one-hot coding that we described above, they can be mitigated by using the widely available and tested scikit-learn library, and more specifically the sklearn.preprocessing.OneHotEncoder class. So let's see how we can use it on our training and inference data sets to create robust one-hot encoding.

from sklearn.preprocessing import OneHotEncoder# Initialize the encoder
enc = OneHotEncoder(handle_unknown='ignore')
# Define columns to transform
trans_columns = ('color_1_', 'color_2_')
# Fit and transform the data
enc_data = enc.fit_transform(training_data(trans_columns))
# Get feature names
feature_names = enc.get_feature_names_out(trans_columns)
# Convert to DataFrame
enc_df = pd.DataFrame(enc_data.toarray(), 
columns=feature_names)
# Concatenate with the numerical data
final_df = pd.concat((training_data(('numerical_1')), 
enc_df), axis=1)

This produces a final DataFrameof transformed values as shown below:

Training dataset transformed with sklearn/image by author

If we break down the code above, we see that the first step is to initialize an instance of the encoder class. We use the option handle_unknown='ignore' to avoid problems with unknown values for columns when we use the encoder to transform our inference data set.

After that, we combine a snap and transform action in a single step with the fit_transform method. And finally, we create a new data frame from the encoded data and concatenate it with the rest of the original data set.

Robust One-Hot Encoding. Production-grade one-hot encoding… | by Hans Christian Ekne | April 2024

Technical Terrence Team

NBA appears poised to go streaming after reported framework agreement with Amazon

Leave a Reply Cancel reply

Recommended.

Twitter targets posts that link to its rival substack

Elon Musk’s latest stunt is especially immature

Bitcoin Proponents Respond to New York Times BTC Mining Report

Python One Liners Data Cleaning: Quick Guide

A new era of Bitcoin has just begun as nations compete for dominance

Categories

Important Links

Robust One-Hot Encoding. Production-grade one-hot encoding… | by Hans Christian Ekne | April 2024

The first problem: missing factors

The second problem: new factors

in python

Related

Technical Terrence Team

NBA appears poised to go streaming after reported framework agreement with Amazon

Leave a Reply Cancel reply

Recommended.

Twitter targets posts that link to its rival substack

Elon Musk’s latest stunt is especially immature

Bitcoin Proponents Respond to New York Times BTC Mining Report

Python One Liners Data Cleaning: Quick Guide

A new era of Bitcoin has just begun as nations compete for dominance

Categories

Important Links

Get daily news updates to your inbox!