Image by author
Using Scikit-learn pipelines can simplify your preprocessing and modeling steps, reduce code complexity, ensure consistency in data preprocessing, help with hyperparameter tuning, and make your workflow more organized and efficient. easier to maintain. By integrating multiple transformations and the final model into a single entity, Pipelines improves reproducibility and makes everything more efficient.
In this tutorial, we will work with the Bank turnover Kaggle dataset to train a random forest classifier. We will compare the conventional approach of data preprocessing and model training with a more efficient method that uses Scikit-learn pipelines and ColumnTransformers.
In the process of data processing, we will learn how to transform categorical and numeric columns individually. We'll start with a traditional style of code and then show a better way to do similar processing.
After extracting the data from the zip file, upload the `train.csv` file with “id” as the index column. Remove unnecessary columns and shuffle the data set.
import pandas as pd
bank_df = pd.read_csv("train.csv", index_col="id")
bank_df = bank_df.drop(('CustomerId', 'Surname'), axis=1)
bank_df = bank_df.sample(frac=1)
bank_df.head()
We have categorical, integer and floating columns. The data set looks pretty clean.
Scikit learning simple code
As a data scientist, I have written this code several times. Our goal is to fill in missing values for both categorical and numerical features. To achieve this, we will use a `SimpleImputer` with different strategies for each type of feature.
Once the missing values are filled in, we will convert categorical features to integers and apply a min-max scale to the numerical features.
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OrdinalEncoder, MinMaxScaler
cat_col = (1,2)
num_col = (0,3,4,5,6,7,8,9)
# Filling missing categorical values
cat_impute = SimpleImputer(strategy="most_frequent")
bank_df.iloc(:,cat_col) = cat_impute.fit_transform(bank_df.iloc(:,cat_col))
# Filling missing numerical values
num_impute = SimpleImputer(strategy="median")
bank_df.iloc(:,num_col) = num_impute.fit_transform(bank_df.iloc(:,num_col))
# Encode categorical features as an integer array.
cat_encode = OrdinalEncoder()
bank_df.iloc(:,cat_col) = cat_encode.fit_transform(bank_df.iloc(:,cat_col))
# Scaling numerical values.
scaler = MinMaxScaler()
bank_df.iloc(:,num_col) = scaler.fit_transform(bank_df.iloc(:,num_col))
bank_df.head()
As a result, we got a clean, transformed data set with only integer or float values.
Scikit-learn pipelines code
Let's convert the above code using `Pipeline` and `ColumnTransformer`. Instead of applying the preprocessing technique, we will create two pipelines. One is for numeric columns and the other is for categorical columns.
- In the numerical process, we use a simple imputation with a “mean” strategy and apply a min-max scaler for normalization.
- In the categorical channel, we use the simple imputer with the “most_frequent” strategy and the original encoder to convert the categories into numerical values.
We combine the two pipelines using ColumnTransformer and provide each with the column index. It will help you apply these pipelines on certain columns. For example, a categorical transformer pipeline will be applied to columns 1 and 2 only.
Note: rest=”step” means that columns that have not been processed will be added last. In our case, it is the destination column.
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OrdinalEncoder, MinMaxScaler
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
# Identify numerical and categorical columns
cat_col = (1,2)
num_col = (0,3,4,5,6,7,8,9)
# Transformers for numerical data
numerical_transformer = Pipeline(steps=(
('imputer', SimpleImputer(strategy='mean')),
('scaler', MinMaxScaler())
))
# Transformers for categorical data
categorical_transformer = Pipeline(steps=(
('imputer', SimpleImputer(strategy='most_frequent')),
('encoder', OrdinalEncoder())
))
# Combine transformers into a ColumnTransformer
preproc_pipe = ColumnTransformer(
transformers=(
('num', numerical_transformer, num_col),
('cat', categorical_transformer, cat_col)
),
remainder="passthrough"
)
# Apply the preprocessing pipeline
bank_df = preproc_pipe.fit_transform(bank_df)
bank_df(0)
After the transformation, the resulting matrix contains a numeric transformation value at the beginning and a categorical transformation value at the end, according to the order of the pipes in the column transformer.
array((0.712 , 0.24324324, 0.6 , 0. , 0.33333333,
1. , 1. , 0.76443485, 2. , 0. ,
0. ))
You can run the pipeline object in Jupyter Notebook to visualize the pipeline. Make sure you have the latest version of Scikit-learn.
To train and evaluate our model, we need to split our data set into two subsets: training and testing.
To do this, we will first create dependent and independent variables and convert them to NumPy arrays. Then, we will use the `train_test_split` function to split the data set into two subsets.
from sklearn.model_selection import train_test_split
X = bank_df.drop("Exited", axis=1).values
y = bank_df.Exited.values
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.3, random_state=125
)
Scikit learning simple code
The conventional way to write training code is to first perform feature selection using “SelectKBest” and then provide the new feature to our Random Forest Classifier model.
We will first train the model using the training set and evaluate the results using the test data set.
from sklearn.feature_selection import SelectKBest, chi2
from sklearn.ensemble import RandomForestClassifier
KBest = SelectKBest(chi2, k="all")
X_train = KBest.fit_transform(X_train, y_train)
X_test = KBest.transform(X_test)
model = RandomForestClassifier(n_estimators=100, random_state=125)
model.fit(X_train,y_train)
model.score(X_test, y_test)
We achieved a reasonably good accuracy score.
Scikit-learn pipelines code
Let's use the `Pipeline` function to combine both training steps into one pipeline. We can then fit the model to the training set and evaluate it on the test set.
KBest = SelectKBest(chi2, k="all")
model = RandomForestClassifier(n_estimators=100, random_state=125)
train_pipe = Pipeline(
steps=(
("KBest", KBest),
("RFmodel", model),
)
)
train_pipe.fit(X_train,y_train)
train_pipe.score(X_test, y_test)
We achieved similar results, but the code appears to be more efficient and simpler. It is quite easy to add or remove new steps from the training process.
Run the pipeline object to display the pipeline.
Now, we will combine the preprocessing process and the training process by creating another channel and adding both channels.
Here is the complete code:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OrdinalEncoder, MinMaxScaler
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.feature_selection import SelectKBest, chi2
from sklearn.ensemble import RandomForestClassifier
#loading the data
bank_df = pd.read_csv("train.csv", index_col="id")
bank_df = bank_df.drop(('CustomerId', 'Surname'), axis=1)
bank_df = bank_df.sample(frac=1)
# Splitting data into training and testing sets
X = bank_df.drop(("Exited"),axis=1)
y = bank_df.Exited
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.3, random_state=125
)
# Identify numerical and categorical columns
cat_col = (1,2)
num_col = (0,3,4,5,6,7,8,9)
# Transformers for numerical data
numerical_transformer = Pipeline(steps=(
('imputer', SimpleImputer(strategy='mean')),
('scaler', MinMaxScaler())
))
# Transformers for categorical data
categorical_transformer = Pipeline(steps=(
('imputer', SimpleImputer(strategy='most_frequent')),
('encoder', OrdinalEncoder())
))
# Combine pipelines using ColumnTransformer
preproc_pipe = ColumnTransformer(
transformers=(
('num', numerical_transformer, num_col),
('cat', categorical_transformer, cat_col)
),
remainder="passthrough"
)
# Selecting the best features
KBest = SelectKBest(chi2, k="all")
# Random Forest Classifier
model = RandomForestClassifier(n_estimators=100, random_state=125)
# KBest and model pipeline
train_pipe = Pipeline(
steps=(
("KBest", KBest),
("RFmodel", model),
)
)
# Combining the preprocessing and training pipelines
complete_pipe = Pipeline(
steps=(
("preprocessor", preproc_pipe),
("train", train_pipe),
)
)
# running the complete pipeline
complete_pipe.fit(X_train,y_train)
# model accuracy
complete_pipe.score(X_test, y_test)
Production:
Viewing the complete pipeline.
One of the main advantages of using pipelines is that you can save the pipeline with the model. During inference, you only need to load the pipeline object, which will be ready to process the raw data and give you accurate predictions. There is no need to rewrite the processing and transform functions in the application file as it will work out of the box. This makes the machine learning workflow more efficient and saves time.
Let's first save the pipe using the stingy-dev/stingy library.
import skops.io as sio
sio.dump(complete_pipe, "bank_pipeline.skops")
Then, load the saved pipeline and display it.
new_pipe = sio.load("bank_pipeline.skops", trusted=True)
new_pipe
As we can see, we have successfully loaded the pipeline.
To evaluate our loaded pipeline, we will make predictions on the test set and then calculate the accuracy and F1 scores.
from sklearn.metrics import accuracy_score, f1_score
predictions = new_pipe.predict(X_test)
accuracy = accuracy_score(y_test, predictions)
f1 = f1_score(y_test, predictions, average="macro")
print("Accuracy:", str(round(accuracy, 2) * 100) + "%", "F1:", round(f1, 2))
It turns out that we have to focus on the minority classes to improve our F1 score.
Project files and code are available at Deepnote workspace. The workspace has two notebooks: one with the Scikit-learn channel and one without.
In this tutorial, we learned how Scikit-learn pipelines can help optimize machine learning workflows by chaining together sequences of transformations and data models. By combining preprocessing and model training into a single Pipeline object, we can simplify code, ensure consistent data transformations, and make our workflows more organized and reproducible.
Abid Ali Awan (@1abidaliawan) is a certified professional data scientist who loves building machine learning models. Currently, he focuses on content creation and writing technical blogs on data science and machine learning technologies. Abid has a Master's degree in technology Management and a Bachelor's degree in Telecommunications Engineering. His vision is to build an artificial intelligence product using a graph neural network for students struggling with mental illness.