Image by author
GitHub Actions is a powerful feature of the GitHub platform that allows you to automate software development workflows, such as testing, building, and deploying code. This not only speeds up the development process but also makes it more reliable and efficient.
In this tutorial, we'll explore how to use GitHub Actions for a beginner machine learning (ML) project. From setting up our ML project on GitHub to creating a GitHub Actions workflow that automates your ML tasks, we'll cover everything you need to know.
GitHub Actions is a powerful tool that provides a continuous integration and continuous delivery (CI/CD) pipeline for all GitHub repositories for free. It automates the entire software development workflow, from building and testing to code deployment, all within the GitHub platform. You can use it to improve the efficiency of your development and deployment.
Key Features of GitHub Actions
Now we will learn about the key components of the workflow.
Workflows
Workflows are automated processes that you define in your GitHub repository. They are made up of one or more jobs and can be triggered by GitHub events such as a push, pull request, issue creation, or by workflows. Workflows are defined in a YML file within the .github/workflows directory of your repository. You can edit it and rerun the workflow directly from the GitHub repository.
Jobs and Steps
Within a workflow, jobs define a set of steps that are executed on the same executor. Each step of a job can execute commands or actions, which are reusable pieces of code that can perform a specific task, such as formatting the code or training the model.
Events
Workflows can be triggered by various GitHub events such as push, pull request, forks, stars, releases, and more. You can also schedule workflows to run at specific times using the cron syntax.
runners
Corridors are the virtual environments/machines where the workflows are executed. GitHub provides hosted runners in Linux, Windows, and macOS environments, or you can host your own runner for more control over the environment.
Behavior
Actions are reusable units of code that you can use as steps within your jobs. You can create your own actions or use actions shared by the GitHub community on the GitHub Marketplace.
GitHub Actions makes it easy for developers to automate their build, test, and deploy workflows directly within GitHub, helping to improve productivity and streamline the development process.
In this project, we will use two actions:
- actions/payment@v3: to check your repository so that the workflow can access the file and data.
- iterative/setup-cml@v2: to display the model metrics and confusion matrix below the commit as a message.
We will work on a simple machine learning project using the Bank turnover Kaggle dataset to train and evaluate a random forest classifier.
Setting
- We will create the GitHub repository by providing the name and description, verifying the readme and license.
- Go to the project manager and clone the repository.
- Change the directory to the repository folder.
- Start the code editor. In our case, it is VSCode.
$ git clone https://github.com/kingabzpro/GitHub-Actions-For-Machine-Learning-Beginners.git
$ cd .\GitHub-Actions-For-Machine-Learning-Beginners\
$ code .
- Create a `requirements.txt` file and add all the necessary packages to run the workflow successfully.
pandas
scikit-learn
numpy
matplotlib
skops
black
- Download the data from Kaggle using the link and extract it into the main folder.
- The data set is large, so we have to install GitLFS in our repository and crawl the train CSV file.
$ git lfs install
$ git lfs track train.csv
Training and evaluation code
In this section, we will write the code that will train, evaluate, and save the model pipelines. The code is from my previous tutorial, Optimize Your Machine Learning Workflow with Scikit-learn Pipelines. If you want to know how the scikit-learn process works, you should read it.
- Create a `train.py` file and copy and paste the following code.
- The code uses ColumnTransformer and Pipeline to preprocess the data and Pipeline for feature selection and model training.
- After evaluating the performance of the model, both the metrics and the confusion matrix are saved in the main folder. These metrics will be used later by the CML action.
- In the end, the final scikit-learn process is saved for model inference.
import pandas as pd
from sklearn.compose import ColumnTransformer
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.feature_selection import SelectKBest, chi2
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import MinMaxScaler, OrdinalEncoder
from sklearn.metrics import accuracy_score, f1_score
import matplotlib.pyplot as plt
from sklearn.metrics import ConfusionMatrixDisplay, confusion_matrix
import skops.io as sio
# loading the data
bank_df = pd.read_csv("train.csv", index_col="id", nrows=1000)
bank_df = bank_df.drop(("CustomerId", "Surname"), axis=1)
bank_df = bank_df.sample(frac=1)
# Splitting data into training and testing sets
X = bank_df.drop(("Exited"), axis=1)
y = bank_df.Exited
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.3, random_state=125
)
# Identify numerical and categorical columns
cat_col = (1, 2)
num_col = (0, 3, 4, 5, 6, 7, 8, 9)
# Transformers for numerical data
numerical_transformer = Pipeline(
steps=(("imputer", SimpleImputer(strategy="mean")), ("scaler", MinMaxScaler()))
)
# Transformers for categorical data
categorical_transformer = Pipeline(
steps=(
("imputer", SimpleImputer(strategy="most_frequent")),
("encoder", OrdinalEncoder()),
)
)
# Combine pipelines using ColumnTransformer
preproc_pipe = ColumnTransformer(
transformers=(
("num", numerical_transformer, num_col),
("cat", categorical_transformer, cat_col),
),
remainder="passthrough",
)
# Selecting the best features
KBest = SelectKBest(chi2, k="all")
# Random Forest Classifier
model = RandomForestClassifier(n_estimators=100, random_state=125)
# KBest and model pipeline
train_pipe = Pipeline(
steps=(
("KBest", KBest),
("RFmodel", model),
)
)
# Combining the preprocessing and training pipelines
complete_pipe = Pipeline(
steps=(
("preprocessor", preproc_pipe),
("train", train_pipe),
)
)
# running the complete pipeline
complete_pipe.fit(X_train, y_train)
## Model Evaluation
predictions = complete_pipe.predict(X_test)
accuracy = accuracy_score(y_test, predictions)
f1 = f1_score(y_test, predictions, average="macro")
print("Accuracy:", str(round(accuracy, 2) * 100) + "%", "F1:", round(f1, 2))
## Confusion Matrix Plot
predictions = complete_pipe.predict(X_test)
cm = confusion_matrix(y_test, predictions, labels=complete_pipe.classes_)
disp = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=complete_pipe.classes_)
disp.plot()
plt.savefig("model_results.png", dpi=120)
## Write metrics to file
with open("metrics.txt", "w") as outfile:
outfile.write(f"\nAccuracy = {round(accuracy, 2)}, F1 Score = {round(f1, 2)}\n\n")
# saving the pipeline
sio.dump(complete_pipe, "bank_pipeline.skops")
We got a good result.
$ python train.py
Accuracy: 88.0% F1: 0.77
You can learn more about the inner workings of the code mentioned above by reading “Simplify your machine learning workflow with Scikit-learn Pipelines.”
We don't want Git to send output files as they are always generated at the end of the code, so we will add them to the .gitignore file.
Simply type `.gitignore` in the terminal to launch the file.
Add the following file names.
metrics.txt
model_results.png
bank_pipeline.skops
This is what it should look like in your VSCode.
Now we will prepare the changes, create a commit, and push the changes to the main GitHub branch.
git add .
git commit -m "new changes"
git push origin main
This is how your GitHub works repository should look like.
LMC
Before you start working on the workflow, it is important to understand the purpose of Continuous Machine Learning (CML) behavior. CML functions are used in the workflow to automate the process of generating a model evaluation report. What does this mean? Well, when we push changes to GitHub, a report will be automatically generated under the commit. This report will include performance metrics and a confusion matrix, and we will also receive an email with all this information.
GitHub Actions
It's time for the main part. We will develop a machine learning workflow to train and evaluate our model. This workflow will be triggered every time we push our code to the master branch or when someone submits a pull request to the master branch.
To create our first workflow, navigate to the “Actions” tab in the repository and click on the blue text “set up a workflow yourself”. It will create a YML file in the .github/workflows directory and provide us with the interactive code editor to add the code.
Add the following code to the workflow file. In this code, we are:
- Name our workflow.
- Configuring triggers on push and pull requests using on keys.
- Provide actions with written permission so that the CML action can create the message under the commit.
- Use the Ubuntu Linux runner.
- Use the `actions/checkout@v3` action to access all files in the repository, including the dataset.
- Using the `iterative/setup-cml@v2` action to install the CML package.
- Create the run to install all Python packages.
- Create the run to format the Python files.
- Create the run to train and evaluate the model.
- Create the run with GITHUB_TOKEN to move the model metrics and confusion matrix graph to the report.md file. Then use the CML command to create the report under the commit comment.
name: ML Workflow
on:
push:
branches: ( "main" )
pull_request:
branches: ( "main" )
workflow_dispatch:
permissions: write-all
jobs:
build:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v3
with:
lfs: true
- uses: iterative/setup-cml@v2
- name: Install Packages
run: pip install --upgrade pip && pip install -r requirements.txt
- name: Format
run: black *.py
- name: Train
run: python train.py
- name: Evaluation
env:
REPO_TOKEN: ${{ secrets.GITHUB_TOKEN }}
run: |
echo "## Model Metrics" > report.md
cat metrics.txt >> report.md
echo '## Confusion Matrix Plot' >> report.md
echo '!(Confusion Matrix)(model_results.png)' >> report.md
cml comment create report.md
This is what it should look like in your GitHub workflow.
After making the changes. The workflow will begin to execute the command sequentially.
After completing the workflow, we can view the logs by clicking on the recent workflow in the “Actions” tab, opening the build, and reviewing the logs for each task.
Now we can see the evaluation of the model in the confirmation messages section. We can access it by clicking on the confirmation link: fixed location in workflow kingabzpro/GitHub-Actions-For-Machine-Learning-Beginners@44c74fa
You will also receive an email from GitHub.
The source code is available in my GitHub repository: kingabzpro/GitHub-Actions-for-Machine-Learning-Beginners. You can clone it and try it yourself.
Machine Learning Operations (MLOps) is a vast field that requires knowledge of various tools and platforms to successfully build and deploy models in production. To get started with MLOps, it is recommended to follow a complete tutorial.”A Beginner's Guide to CI/CD for Machine Learning“. It will provide you with a solid foundation to effectively implement MLOps techniques.
In this tutorial, we cover what GitHub Actions are and how they can be used to automate your machine learning workflow. We also learned about CML actions and how to write scripts in YML format to run jobs correctly. If you still don't know where to start, I suggest you take a look at the only free course you need to become an MLOps engineer.
Abid Ali Awan (@1abidaliawan) is a certified professional data scientist who loves building machine learning models. Currently, he focuses on content creation and writing technical blogs on data science and machine learning technologies. Abid has a master's degree in technology management and a bachelor's degree in telecommunications engineering. His vision is to build an artificial intelligence product using a graph neural network for students struggling with mental illness.