Does this sound interesting? If so, this article is here to help you get started. mlflow.pyfunc
.
- First, let's look at a simple example of creating a toy.
mlflow.pyfunc
class. - Then, we will define a
mlflow.pyfunc
Class that encapsulates a machine learning sequence (an estimator plus some preprocessing logic, for example). We will also train, record, and load this machine learning sequence for inference. - Finally, let's dive into encapsulation.
mlflow.pyfunc
object, explore the rich metadata and artifacts automatically tracked for us bymlflow
and better understand all the power thatmlflow.pyfunc
offers.
All code and configuration are available on GitHub.
First, let's create a simple toy. mlflow.pyfunc
model and then use it with the mlflow workflow.
- Step 1: Create the model
- Step 2: Register the model
- Step 3: Load the registered model to perform inference
# Step 1: Create a mlflow.pyfunc model
class ToyModel(mlflow.pyfunc.PythonModel):
"""
ToyModel is a simple example implementation of an MLflow Python model.
"""def predict(self, context, model_input):
"""
A basic predict function that takes a model_input list and returns a new list
where each element is increased by one.
Parameters:
- context (Any): An optional context parameter provided by MLflow.
- model_input (list of int or float): A list of numerical values that the model will use for prediction.
Returns:
- list of int or float: A list with each element in model_input is increased by one.
"""
return (x + 1 for x in model_input)
As you can see in the example above, you can create a mlflow.pyfunc
template to implement any custom Python function you see fit for your machine learning solution, which does not have to be a standard machine learning algorithm.
You can then register this model and load it later to perform inference.
# Step 2: log this model as an mlflow run
with mlflow.start_run():
mlflow.pyfunc.log_model(
artifact_path = "model",
python_model=ToyModel()
)
run_id = mlflow.active_run().info.run_id
# Step 3: load the logged model to perform inference
model = mlflow.pyfunc.load_model(f"runs:/{run_id}/model")
# dummy new data
x_new = (1,2,3)
# model inference for the new data
print(model.predict(x_new))
(2, 3, 4)
Now, let's create an ML pipeline that encapsulates an estimator with additional custom logic.
In the following example, the XGB_PIPELINE
The class is a container that integrates the estimator with preprocessing steps, which may be desirable for some MLOps implementations. mlflow.pyfunc
This container is independent of the estimator and provides a uniform representation of the model. Specifically,
fit()
:Instead of using the native XGBoost API (xgboost.train()
), this class uses.fit()
which adheres to sklearn conventions, allowing easy integration into sklearn pipelines and ensuring consistency between different estimators.DMatrix()
:DMatrix
is a core data structure in XGBoost that optimizes data for training and prediction. In this class, we explain the step to transform a pandas DataFrame into aDMatrix
is wrapped inside the class, allowing for seamless integration with pandas DataFrames like all other sklearn estimators.predict()
:This is themlflow.pyfunc
Universal model inference API. It is consistent with this ML pipeline, with the toy model above, and with any machine learning algorithms or custom logic we include in amlflow.pyfunc
model.
import json
import xgboost as xgb
import mlflow.pyfunc
from typing import Any, Dict, Union
import pandas as pdclass XGB_PIPELINE(mlflow.pyfunc.PythonModel):
"""
XGBWithPreprocess is an example implementation of an MLflow Python model with XGBoost.
"""
def __init__(self, params: Dict(str, Union(str, int, float))):
"""
Initialize the model with given parameters.
Parameters:
- params (Dict(str, Union(str, int, float))): Parameters for the XGBoost model.
"""
self.params = params
self.xgb_model = None
self.config = None
def preprocess_input(self, model_input: pd.DataFrame) -> pd.DataFrame:
"""
Preprocess the input data.
Parameters:
- model_input (pd.DataFrame): The input data to preprocess.
Returns:
- pd.DataFrame: The preprocessed input data.
"""
processed_input = model_input.copy()
# put any desired preprocessing logic here
processed_input.drop(processed_input.columns(0), axis=1, inplace=True)
return processed_input
def fit(self, X_train: pd.DataFrame, y_train: pd.Series):
"""
Train the XGBoost model.
Parameters:
- X_train (pd.DataFrame): The training input data.
- y_train (pd.Series): The target values.
"""
processed_model_input = self.preprocess_input(X_train.copy())
dtrain = xgb.DMatrix(processed_model_input, label=y_train)
self.xgb_model = xgb.train(self.params, dtrain)
def predict(self, context: Any, model_input: pd.DataFrame) -> Any:
"""
Predict using the trained XGBoost model.
Parameters:
- context (Any): An optional context parameter provided by MLflow.
- model_input (pd.DataFrame): The input data for making predictions.
Returns:
- Any: The prediction results.
"""
processed_model_input = self.preprocess_input(model_input.copy())
dmatrix = xgb.DMatrix(processed_model_input)
return self.xgb_model.predict(dmatrix)
Now, let's train and register this model.
from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split
import pandas as pd# Generate synthetic datasets for demo
x, y = make_regression(n_samples=1000, n_features=10, noise=0.1, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=42)
# train and log the model
with mlflow.start_run(run_name = 'xgb_demo') as run:
# Create an instance of XGB_PIPELINE
params = {
'objective': 'reg:squarederror',
'max_depth': 3,
'learning_rate': 0.1,
}
model = XGB_PIPELINE(params)
# Fit the model
model.fit(X_train=pd.DataFrame(X_train), y_train=y_train)
# Log the model
model_info = mlflow.pyfunc.log_model(
artifact_path = 'model',
python_model = model,
)
run_id = mlflow.active_run().info.run_id
The model has been registered successfully. Now, let's load it for inference.
loaded_model = mlflow.pyfunc.load_model(model_uri=model_info.model_uri)
loaded_model.predict(pd.DataFrame(X_test))
array(( 4.11692047e+00, 7.30551958e+00, -2.36042137e+01, -1.31888123e+02,
...
The above process is pretty straightforward, isn't it? This represents the basic functionality of the mlflow.pyfunc
object. Now, let's dig deeper to explore all the power it has mlflow.pyfunc
has to offer.
1. Model information
In the example above, the model_info
object returned by mlflow.pyfunc.log_model()
is an example of mlflow.models.model.ModelInfo
class. Contains metadata and information about the registered model. For example
Feel free to run dir(model_info)
To explore more or consult The source code for all defined attributes. The attribute I use the most is model_uri
which indicates where the registered model can be found within the mlflow
tracking system.
2. model_loaded
It is worth clarifying that the loaded_model
It is not an instance of the XGB_PIPELINE
class, but rather a container object provided by mlflow.pyfunc
to make inferences independent of the algorithm. As shown below, an error will be returned if you try to retrieve the attributes of the XGB_PIPELINE
class of the loaded_model
.
print(loaded_model.params)
AttributeError: 'PyFuncModel' object has no attribute 'params'
3. unwrapped_model
Okay, you might be wondering, where is the trained instance of XGB_PIPELINE
Is it recorded and can be retrieved through? mlflow
also?
Don't worry, it is stored safely and can be easily unwrapped as shown below.
unwrapped_model = loaded_model.unwrap_python_model()
print(unwrapped_model.params)
{'objective': 'reg:squarederror', 'max_depth': 3, 'learning_rate': 0.1}
This is how it's done. With the unwrapped_model
You can access any property or method of your custom ML pipeline this way! Sometimes I add useful methods like explain_model
either post_processing
in the custom pipeline, or include useful attributes to track the model training process and offer diagnostics … Well, I'd better stop here and leave that for the next articles. Suffice it to say that you can freely customize your ML pipeline for your use case and know that
- You will have access to all these custom methods and attributes for further use and
- This custom made model will be wrapped inside the uniform.
mlflow.pyfunc
Inference API and therefore enjoys seamless migration to other estimators if needed.
4. Context
You may have noticed that there is a context
parameter for the predict
methods in both mlflow.pyfunc
Class defined above. But interestingly, this parameter is not needed when making predictions with the loaded model. Why?
loaded_model = mlflow.pyfunc.load_model(model_uri)
# the context parameter is not needed when calling `predict`
loaded_model.predict(model_input)
This is because loaded_model
Above is a container object provided by mlflow
If we use the unwrapped model, we must define the context explicitly as shown below, otherwise the code will return an error.
unwrapped_model = loaded_model.unwrap_python_model()
# need to provide context mannually
unwrapped_model.predict(context=None, model_input)
So what is this? context
? And what role does it play in the predict
method?
He context
it's a PythonModelContext
object containing artifactspyfunc
The model can use it when making inferences. It is created implicitly and automatically by the log_method()
method.
Navigate to the mlruns
subfolder in your project repository, which is automatically created by mlflow
When you log in mlflow
model. Find the folder named after the model. run_id
Inside, you will find the model artifacts automatically registered, as shown below.
# get run_id of a loaded model
print(loaded_model.metadata.run_id)
38a617d0f30645e8ae95eea4642a03c2
Pretty neat, huh? Feel free to explore these artifacts at your leisure; below are screenshots of the items. requirements
and MLmodel
file in the FYR folder.
He requiarements
The versions of the dependencies required to recreate the environment to run the model are specified below.
He MLmodel
The document below defines the metadata and configuration required to load and serve the model in YAML format.