Building algorithm-independent models with Mlflow | by Mena Wang, PhD | August 2024

Does this sound interesting? If so, this article is here to help you get started. mlflow.pyfunc.

First, let's look at a simple example of creating a toy. mlflow.pyfunc class.
Then, we will define a mlflow.pyfunc Class that encapsulates a machine learning sequence (an estimator plus some preprocessing logic, for example). We will also train, record, and load this machine learning sequence for inference.
Finally, let's dive into encapsulation. mlflow.pyfunc object, explore the rich metadata and artifacts automatically tracked for us by mlflowand better understand all the power that mlflow.pyfunc offers.

All code and configuration are available on GitHub.

First, let's create a simple toy. mlflow.pyfunc model and then use it with the mlflow workflow.

Step 1: Create the model
Step 2: Register the model
Step 3: Load the registered model to perform inference

# Step 1: Create a mlflow.pyfunc model
class ToyModel(mlflow.pyfunc.PythonModel):
"""
ToyModel is a simple example implementation of an MLflow Python model.
"""def predict(self, context, model_input):
"""
A basic predict function that takes a model_input list and returns a new list 
where each element is increased by one.
Parameters:
- context (Any): An optional context parameter provided by MLflow.
- model_input (list of int or float): A list of numerical values that the model will use for prediction.
Returns:
- list of int or float: A list with each element in model_input is increased by one.
"""
return (x + 1 for x in model_input)

As you can see in the example above, you can create a mlflow.pyfunc template to implement any custom Python function you see fit for your machine learning solution, which does not have to be a standard machine learning algorithm.

You can then register this model and load it later to perform inference.

# Step 2: log this model as an mlflow run
with mlflow.start_run():
mlflow.pyfunc.log_model(
artifact_path = "model", 
python_model=ToyModel()
)
run_id = mlflow.active_run().info.run_id

# Step 3: load the logged model to perform inference
model = mlflow.pyfunc.load_model(f"runs:/{run_id}/model")
# dummy new data
x_new = (1,2,3)
# model inference for the new data
print(model.predict(x_new))

(2, 3, 4)

Now, let's create an ML pipeline that encapsulates an estimator with additional custom logic.

In the following example, the XGB_PIPELINE The class is a container that integrates the estimator with preprocessing steps, which may be desirable for some MLOps implementations. mlflow.pyfuncThis container is independent of the estimator and provides a uniform representation of the model. Specifically,

fit():Instead of using the native XGBoost API (xgboost.train()), this class uses .fit()which adheres to sklearn conventions, allowing easy integration into sklearn pipelines and ensuring consistency between different estimators.
DMatrix(): DMatrix is a core data structure in XGBoost that optimizes data for training and prediction. In this class, we explain the step to transform a pandas DataFrame into a DMatrix is wrapped inside the class, allowing for seamless integration with pandas DataFrames like all other sklearn estimators.
predict() :This is the mlflow.pyfunc Universal model inference API. It is consistent with this ML pipeline, with the toy model above, and with any machine learning algorithms or custom logic we include in a mlflow.pyfunc model.

import json
import xgboost as xgb
import mlflow.pyfunc
from typing import Any, Dict, Union
import pandas as pdclass XGB_PIPELINE(mlflow.pyfunc.PythonModel):
"""
XGBWithPreprocess is an example implementation of an MLflow Python model with XGBoost.
"""
def __init__(self, params: Dict(str, Union(str, int, float))):
"""
Initialize the model with given parameters.
Parameters:
- params (Dict(str, Union(str, int, float))): Parameters for the XGBoost model.
"""
self.params = params
self.xgb_model = None
self.config = None      
def preprocess_input(self, model_input: pd.DataFrame) -> pd.DataFrame:
"""
Preprocess the input data.
Parameters:
- model_input (pd.DataFrame): The input data to preprocess.
Returns:
- pd.DataFrame: The preprocessed input data.
"""
processed_input = model_input.copy()
# put any desired preprocessing logic here
processed_input.drop(processed_input.columns(0), axis=1, inplace=True)
return processed_input
def fit(self, X_train: pd.DataFrame, y_train: pd.Series):
"""
Train the XGBoost model.
Parameters:
- X_train (pd.DataFrame): The training input data.
- y_train (pd.Series): The target values.
"""
processed_model_input = self.preprocess_input(X_train.copy())
dtrain = xgb.DMatrix(processed_model_input, label=y_train)
self.xgb_model = xgb.train(self.params, dtrain)
def predict(self, context: Any, model_input: pd.DataFrame) -> Any:
"""
Predict using the trained XGBoost model.
Parameters:
- context (Any): An optional context parameter provided by MLflow.
- model_input (pd.DataFrame): The input data for making predictions.
Returns:
- Any: The prediction results.
"""
processed_model_input = self.preprocess_input(model_input.copy())
dmatrix = xgb.DMatrix(processed_model_input)
return self.xgb_model.predict(dmatrix)

Now, let's train and register this model.

from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split
import pandas as pd# Generate synthetic datasets for demo
x, y = make_regression(n_samples=1000, n_features=10, noise=0.1, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=42)
# train and log the model
with mlflow.start_run(run_name = 'xgb_demo') as run:
# Create an instance of XGB_PIPELINE
params = {
'objective': 'reg:squarederror',  
'max_depth': 3,  
'learning_rate': 0.1,
}
model = XGB_PIPELINE(params)
# Fit the model
model.fit(X_train=pd.DataFrame(X_train), y_train=y_train)
# Log the model
model_info = mlflow.pyfunc.log_model(
artifact_path = 'model',
python_model = model,
)
run_id = mlflow.active_run().info.run_id

The model has been registered successfully. Now, let's load it for inference.

loaded_model = mlflow.pyfunc.load_model(model_uri=model_info.model_uri)
loaded_model.predict(pd.DataFrame(X_test))

array(( 4.11692047e+00,  7.30551958e+00, -2.36042137e+01, -1.31888123e+02,
...

The above process is pretty straightforward, isn't it? This represents the basic functionality of the mlflow.pyfunc object. Now, let's dig deeper to explore all the power it has mlflow.pyfunc has to offer.

1. Model information

In the example above, the model_info object returned by mlflow.pyfunc.log_model() is an example of mlflow.models.model.ModelInfo class. Contains metadata and information about the registered model. For example

Screenshot showing some of the attributes of the model_info object — Some attributes of the model_info object

Feel free to run dir(model_info) To explore more or consult The source code for all defined attributes. The attribute I use the most is model_uriwhich indicates where the registered model can be found within the mlflow tracking system.

2. model_loaded

It is worth clarifying that the loaded_model It is not an instance of the XGB_PIPELINE class, but rather a container object provided by mlflow.pyfunc to make inferences independent of the algorithm. As shown below, an error will be returned if you try to retrieve the attributes of the XGB_PIPELINE class of the loaded_model.

print(loaded_model.params)

AttributeError: 'PyFuncModel' object has no attribute 'params'

3. unwrapped_model

Okay, you might be wondering, where is the trained instance of XGB_PIPELINEIs it recorded and can be retrieved through? mlflowalso?

Don't worry, it is stored safely and can be easily unwrapped as shown below.

unwrapped_model = loaded_model.unwrap_python_model()
print(unwrapped_model.params)

{'objective': 'reg:squarederror', 'max_depth': 3, 'learning_rate': 0.1}

This is how it's done. With the unwrapped_modelYou can access any property or method of your custom ML pipeline this way! Sometimes I add useful methods like explain_model either post_processing in the custom pipeline, or include useful attributes to track the model training process and offer diagnostics … Well, I'd better stop here and leave that for the next articles. Suffice it to say that you can freely customize your ML pipeline for your use case and know that

You will have access to all these custom methods and attributes for further use and
This custom made model will be wrapped inside the uniform. mlflow.pyfunc Inference API and therefore enjoys seamless migration to other estimators if needed.

4. Context

You may have noticed that there is a context parameter for the predict methods in both mlflow.pyfunc Class defined above. But interestingly, this parameter is not needed when making predictions with the loaded model. Why?

loaded_model = mlflow.pyfunc.load_model(model_uri)
# the context parameter is not needed when calling `predict`
loaded_model.predict(model_input)

This is because loaded_model Above is a container object provided by mlflowIf we use the unwrapped model, we must define the context explicitly as shown below, otherwise the code will return an error.

unwrapped_model = loaded_model.unwrap_python_model()
# need to provide context mannually
unwrapped_model.predict(context=None, model_input)

So what is this? context? And what role does it play in the predict method?

He context it's a PythonModelContext object containing artifactspyfunc The model can use it when making inferences. It is created implicitly and automatically by the log_method() method.

Navigate to the mlruns subfolder in your project repository, which is automatically created by mlflow When you log in mlflow model. Find the folder named after the model. run_idInside, you will find the model artifacts automatically registered, as shown below.

# get run_id of a loaded model
print(loaded_model.metadata.run_id)

38a617d0f30645e8ae95eea4642a03c2

Screenshot of the artifacts folder in a registered `mlflow.pyfunc` model — Artifacts folder in a registered `mlflow.pyfunc` model

Pretty neat, huh? Feel free to explore these artifacts at your leisure; below are screenshots of the items. requirements and MLmodel file in the FYR folder.

He requiarements The versions of the dependencies required to recreate the environment to run the model are specified below.