First, for our example, we need to develop a model. Since this article focuses on the implementation of the model, we will not worry about the performance of the model. Instead, we will create a simple model with limited features to focus on the implementation of the learning model.
In this example, we will predict the salary of a data professional based on some characteristics such as experience, job title, company size, etc.
See data here: https://www.kaggle.com/datasets/ruchi798/data-science-job-salaries (CC0: Public Domain). I modified the data slightly to reduce the number of options for certain functions.
#import packages for data manipulation
import pandas as pd
import numpy as np#import packages for machine learning
from sklearn import linear_model
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder, OrdinalEncoder
from sklearn.metrics import mean_squared_error, r2_score
#import packages for data management
import joblib
First, let's look at the data.
Since all of our functions are categorical, we will use encoding to transform our data to numeric. Next, we used ordinal coders to code experience level and company size. They are ordinal because they represent some type of progression (1 = initial level, 2 = middle level, etc.).
For job title and employment type, we will create dummy variables for each option (note that we removed the first one to avoid multicollinearity).
#use ordinal encoder to encode experience level
encoder = OrdinalEncoder(categories=(('EN', 'MI', 'SE', 'EX')))
salary_data('experience_level_encoded') = encoder.fit_transform(salary_data(('experience_level')))#use ordinal encoder to encode company size
encoder = OrdinalEncoder(categories=(('S', 'M', 'L')))
salary_data('company_size_encoded') = encoder.fit_transform(salary_data(('company_size')))
#encode employmeny type and job title using dummy columns
salary_data = pd.get_dummies(salary_data, columns = ('employment_type', 'job_title'), drop_first = True, dtype = int)
#drop original columns
salary_data = salary_data.drop(columns = ('experience_level', 'company_size'))
Now that we have transformed our model inputs, we can create our training and test sets. We will feed these characteristics into a simple linear regression model to predict the employee's salary.
#define independent and dependent features
x = salary_data.drop(columns = 'salary_in_usd')
y = salary_data('salary_in_usd')#split between training and testing sets
X_train, X_test, y_train, y_test = train_test_split(
x, y, random_state = 104, test_size = 0.2, shuffle = True)
#fit linear regression model
regr = linear_model.LinearRegression()
regr.fit(X_train, y_train)
#make predictions
y_pred = regr.predict(X_test)
#print the coefficients
print("Coefficients: \n", regr.coef_)
#print the MSE
print("Mean squared error: %.2f" % mean_squared_error(y_test, y_pred))
#print the adjusted R2 value
print("R2: %.2f" % r2_score(y_test, y_pred))
Let's see how our model did.
It looks like our R squared is 0.27, ouch! Much more work would need to be done with this model. We would probably need more data and additional information about the observations. But for the sake of this article, we'll go ahead and save our model.
#save model using joblib
joblib.dump(regr, 'lin_regress.sav')