A Data Scientist’s Guide to Make and Makefiles | by Egor Howell | Aug, 2023

How to use Make and Makefiles to optimise your machine learning pipeline

Data Scientists are now expected to write production code to deploy their machine learning algorithms. Therefore, we need to be aware of software engineering standards and methods to ensure our models are deployed robustly and effectively. One such tool that is very well known in the developer community is make. This a powerful Linux command that has been known to developers for a long time and in this article I want to show how it can be used to build efficient machine learning pipelines.

make is a terminal command/executable just like ls or cd that is in most UNIX-like operating systems such as MacOS and Linux.

The use of make is to simplify and breakdown your workflow into a logical grouping of shell commands.

It is used widely by developers and is also being adopted by Data Scientists as it simplifies the machine learning pipeline and enables more robust production deployment.

make is a powerful tool that Data Scientists should be utilising for the following reasons:

Automate the setup of machine learning environments
Clearer end-to-end pipeline documentation
Easier to test models with different parameters
Obvious structure and execution of your project

A Makefile is basically what the make commands read and execute from. It has three components:

Targets: These are the files you are trying to build or you have a PHONY target if you are just carrying out commands.
Dependencies: Source files that need to be run before this target is executed.
Command: As it says on the tin, these are the list of steps to produce the target.

Let’s run through a very simple example to make this theory concrete.