Data Scientists are now expected to write production code to deploy their machine learning algorithms. Therefore, we need to be aware of software engineering standards and methods to ensure our models are deployed robustly and effectively. One such tool that is very well known in the developer community is make
. This a powerful Linux command that has been known to developers for a long time and in this article I want to show how it can be used to build efficient machine learning pipelines.
make
is a terminal command/executable just like ls
or cd
that is in most UNIX-like operating systems such as MacOS and Linux.
The use of make
is to simplify and breakdown your workflow into a logical grouping of shell commands.
It is used widely by developers and is also being adopted by Data Scientists as it simplifies the machine learning pipeline and enables more robust production deployment.
make
is a powerful tool that Data Scientists should be utilising for the following reasons:
- Automate the setup of machine learning environments
- Clearer end-to-end pipeline documentation
- Easier to test models with different parameters
- Obvious structure and execution of your project
A Makefile
is basically what the make
commands read and execute from. It has three components:
- Targets: These are the files you are trying to build or you have a
PHONY
target if you are just carrying out commands. - Dependencies: Source files that need to be run before this target is executed.
- Command: As it says on the tin, these are the list of steps to produce the target.
Let’s run through a very simple example to make this theory concrete.