How to record your data with MLflow. Master data logging in MLOps to… | by Jack Chang | January 2025

Setting up an MLflow server locally is simple. Use the following command:

mlflow server --host 127.0.0.1 --port 8080

Then set the tracking URI.

mlflow.set_tracking_uri("http://127.0.0.1:8080")

For more advanced settings, see MLflow documentation.

For this article, we use the California Housing Dataset (CC BY license). However, you can apply the same principles to record and track any data set of your choice.

For more information on the California housing data set, see this doctor.

`mlflow.data.dataset.Dataset`

Before delving into recording, evaluating, and retrieving data sets, it is important to understand the concept of data sets in MLflow. MLflow provides the mlflow.data.dataset.Dataset object, which represents data sets used with MLflow Tracking.

class mlflow.data.dataset.Dataset(source: mlflow.data.dataset_source.DatasetSource, name: Optional(str) = None, digest: Optional(str) = None)

This object comes with key properties:

A required parameter, fountain (the data source of your dataset like mlflow.data.dataset_source.DatasetSource object)
digest (fingerprint for your data set) and name (name of your dataset), which can be set via parameters.
scheme and profile to describe the structure of the data set and the statistical properties.
Information about the data set. fountainas its storage location.

You can easily convert the data set into a dictionary using to_dict() or a JSON string using to_json().

Support for popular data set formats

MLflow makes it easy to work with various types of data sets through specialized classes that extend the core. mlflow.data.dataset.Dataset. As of this writing, these are some of the notable dataset classes supported by MLflow:

pandas: mlflow.data.pandas_dataset.PandasDataset
NumPy: mlflow.data.numpy_dataset.NumpyDataset
Spark: mlflow.data.spark_dataset.SparkDataset
hugging face: mlflow.data.huggingface_dataset.HuggingFaceDataset
TensorFlow: mlflow.data.tensorflow_dataset.TensorFlowDataset
Evaluation Data Sets: mlflow.data.evaluation_dataset.EvaluationDataset

All of these classes come with a convenient mlflow.data.from_* API to load datasets directly into MLflow. This makes it easy to build and manage data sets, regardless of their underlying format.

mlflow.data.dataset_source.DatasetSource

He mlflow.data.dataset.DatasetSource The class is used to represent the source of the dataset in MLflow. When creating a mlflow.data.dataset.Dataset object, the source The parameter can be specified as a string (for example, a file path or URL) or as an instance of the mlflow.data.dataset.DatasetSource class.

class mlflow.data.dataset_source.DatasetSource

If a string is provided like sourceMLflow internally calls resolve_dataset_source function. This function iterates through a predefined list of data sources and DatasetSource classes to determine the most appropriate type of font. However, MLflow's ability to accurately resolve the source of the data set is limited, especially when the candidate_sources The argument (a list of potential sources) is set to Nonewhich is the default value.

In cases where the DatasetSource The class cannot resolve the raw source, an MLflow exception is raised. For best practices, I recommend explicitly creating and using an instance of the mlflow.data.dataset.DatasetSource class when defining the origin of the data set.

class HTTPDatasetSource(DatasetSource)
class DeltaDatasetSource(DatasetSource)
class FileSystemDatasetSource(DatasetSource)
class HuggingFaceDatasetSource(DatasetSource)
class SparkDatasetSource(DatasetSource)

One of the easiest ways to register data sets in MLflow is through the mlflow.log_input() API. This allows you to record data sets in any format that is supported by mlflow.data.dataset.Datasetwhich can be extremely useful when managing large-scale experiments.

Step by step guide

First, let's find the California Housing data set and convert it to a pandas.DataFrame for easier handling. Here, we create a data frame that combines the feature data (california_data) and the destination data (california_target).

california_housing = fetch_california_housing()
california_data: pd.DataFrame = pd.DataFrame(california_housing.data, columns=california_housing.feature_names)
california_target: pd.DataFrame = pd.DataFrame(california_housing.target, columns=('Target'))california_housing_df: pd.DataFrame = pd.concat((california_data, california_target), axis=1)

To register the dataset with meaningful metadata, we define some parameters such as data source URL, dataset name, and target column. These will provide useful context when we retrieve the data set later.

If we look deeper into the fetch_california_housing source codewe can see that the data originated from https://www.dcc.fc.up.pt/~ltorgo/Regression/cal_housing.tgz.

dataset_source_url: str = 'https://www.dcc.fc.up.pt/~ltorgo/Regression/cal_housing.tgz'
dataset_source: DatasetSource = HTTPDatasetSource(url=dataset_source_url)
dataset_name: str = 'California Housing Dataset'
dataset_target: str = 'Target'
dataset_tags = {
'description': california_housing.DESCR,
}

Once the data and metadata are defined, we can convert the pandas.DataFrame in a mlflow.data.Dataset object.

dataset: PandasDataset = mlflow.data.from_pandas(
df=california_housing_df, source=dataset_source, targets=dataset_target, name=dataset_name
)print(f'Dataset name: {dataset.name}')
print(f'Dataset digest: {dataset.digest}')
print(f'Dataset source: {dataset.source}')
print(f'Dataset schema: {dataset.schema}')
print(f'Dataset profile: {dataset.profile}')
print(f'Dataset targets: {dataset.targets}')
print(f'Dataset predictions: {dataset.predictions}')
print(dataset.df.head())

Example output:

Dataset name: California Housing Dataset
Dataset digest: 55270605
Dataset source: 
Dataset schema: ('MedInc': double (required), 'HouseAge': double (required), 'AveRooms': double (required), 'AveBedrms': double (required), 'Population': double (required), 'AveOccup': double (required), 'Latitude': double (required), 'Longitude': double (required), 'Target': double (required))
Dataset profile: {'num_rows': 20640, 'num_elements': 185760}
Dataset targets: Target
Dataset predictions: None
MedInc  HouseAge  AveRooms  AveBedrms  Population  AveOccup  Latitude  Longitude  Target
0  8.3252      41.0  6.984127   1.023810       322.0  2.555556     37.88    -122.23   4.526
1  8.3014      21.0  6.238137   0.971880      2401.0  2.109842     37.86    -122.22   3.585
2  7.2574      52.0  8.288136   1.073446       496.0  2.802260     37.85    -122.24   3.521
3  5.6431      52.0  5.817352   1.073059       558.0  2.547945     37.85    -122.25   3.413
4  3.8462      52.0  6.281853   1.081081       565.0  2.181467     37.85    -122.25   3.422

Note that you can even convert the data set to a dictionary to access additional properties like source_type:

for k,v in dataset.to_dict().items():
print(f"{k}: {v}")

name: California Housing Dataset
digest: 55270605
source: {"url": "https://www.dcc.fc.up.pt/~ltorgo/Regression/cal_housing.tgz"}
source_type: http
schema: {"mlflow_colspec": ({"type": "double", "name": "MedInc", "required": true}, {"type": "double", "name": "HouseAge", "required": true}, {"type": "double", "name": "AveRooms", "required": true}, {"type": "double", "name": "AveBedrms", "required": true}, {"type": "double", "name": "Population", "required": true}, {"type": "double", "name": "AveOccup", "required": true}, {"type": "double", "name": "Latitude", "required": true}, {"type": "double", "name": "Longitude", "required": true}, {"type": "double", "name": "Target", "required": true})}
profile: {"num_rows": 20640, "num_elements": 185760}

Now that we have our dataset ready, it's time to log it to an MLflow run. This allows us to capture the metadata of the dataset, making it part of the experiment for future reference.

with mlflow.start_run():
mlflow.log_input(dataset=dataset, context='training', tags=dataset_tags)

 View run sassy-jay-279 at: http://127.0.0.1:8080/#/experiments/0/runs/5ef16e2e81bf40068c68ce536121538c
 View experiment at: http://127.0.0.1:8080/#/experiments/0

Let's explore the dataset in the MLflow() UI. You will find your data set in the default experiment list. In it Data sets used In the section, you can see the context of the dataset, which in this case is marked as used for training. Additionally, all relevant fields and properties of the data set will be displayed.

Training dataset in MLflow UI; Source: Me

Congratulations! You have recorded your first set of data!

How to record your data with MLflow. Master data logging in MLOps to… | by Jack Chang | January 2025

Technical Terrence Team

Ford CEO seeks to dominate this iconic market

Leave a Reply Cancel reply

Recommended.

Why could web3 revolutionize?

Why Taking ‘Any Data-Driven Job’ Is a Terrible Career Move and What You Should Do Instead | by Khouloud El Alami | November 2023

How does AI help teachers?

Accelerate analysis and discovery of cancer biomarkers with Amazon Bedrock Agents

What does selling to platform engineering teams mean for developer relations? • TechCrunch

Categories

Important Links

How to record your data with MLflow. Master data logging in MLOps to… | by Jack Chang | January 2025

mlflow.data.dataset.Dataset

Support for popular data set formats

mlflow.data.dataset_source.DatasetSource

Step by step guide

Related

Technical Terrence Team

Ford CEO seeks to dominate this iconic market

Leave a Reply Cancel reply

Recommended.

Why could web3 revolutionize?

Why Taking ‘Any Data-Driven Job’ Is a Terrible Career Move and What You Should Do Instead | by Khouloud El Alami | November 2023

How does AI help teachers?

Accelerate analysis and discovery of cancer biomarkers with Amazon Bedrock Agents

What does selling to platform engineering teams mean for developer relations? • TechCrunch

Categories

Important Links

Get daily news updates to your inbox!

`mlflow.data.dataset.Dataset`