How to use the Hugging Face dataset library for efficient data loading

Editor's Image | Midjourney

This tutorial demonstrates how to use the Hugging Face dataset library to load datasets from different sources with just a few lines of code.

The Hugging Face Dataset Library simplifies the process of loading and processing datasets. It provides a unified interface to thousands of datasets in the Hugging Face Hub. The library also implements several performance metrics for evaluating transformer-based models.

Initial setup

Some Python development environments may require the Datasets library to be installed before importing.

!pip install datasets
import datasets

How to upload a Hugging Face Hub dataset by name

Hugging Face hosts a large number of datasets in its core. The following function generates a list of these datasets by name:

from datasets import list_datasets
list_datasets()

We are going to load one of them, specifically the emotion dataset To classify emotions in tweets, specifying their name:

data = load_dataset("jeffnyman/emotions")

If you want to upload a dataset that you found while browsing the Hugging Face website and you are not sure what the correct naming convention is, click the “copy” icon next to the dataset name, as shown below:

The dataset is loaded into a DatasetDict object containing three subsets or folds: training, validation, and testing.

DatasetDict({
train: Dataset({
features: ('text', 'label'),
num_rows: 16000
})
validation: Dataset({
features: ('text', 'label'),
num_rows: 2000
})
test: Dataset({
features: ('text', 'label'),
num_rows: 2000
})
})

Each fold is in turn a Data set object. Using dictionary operations, we can retrieve the training data fold:

train_data = all_data("train")

The length of this dataset object indicates the number of training instances (tweets).

Which leads to this result:

Getting a single instance by index (say, the fourth) is as easy as mimicking a list operation:

which returns a Python dictionary with the two attributes in the dataset acting as keys: the input tweet textand the label indicating the emotion with which it has been classified.

{'text': 'i am ever feeling nostalgic about the fireplace i will know that it is still on the property',
'label': 2}

We can also simultaneously obtain several consecutive instances by means of cuts:

This operation returns a single dictionary as before, but now each key has an associated list of values instead of a single value.

{'text': ('i didnt feel humiliated', ...),
'label': (0, ...)}

Finally, to access a single attribute value, we specify two indexes: one for its position and one for the attribute name or key:

Uploading your own data

If instead of using the Hugging Face Dataset Hub you would like to use your own dataset, the Dataset Library also allows you to do so using the same 'load_dataset()' function with two arguments: the file format of the dataset to load (such as “csv”, “text”, or “json”) and the path or URL where it is located.

This example loads the Palmer Archipelago penguin dataset from a public GitHub repository:

url = "https://raw.githubusercontent.com/allisonhorst/palmerpenguins/master/inst/extdata/penguins.csv"
dataset = load_dataset('csv', data_files=url)

Convert a dataset to a Pandas dataframe

Last but not least, sometimes it is convenient to convert the loaded data into a Pandas Data frame object, which makes it easy to manipulate, analyze, and visualize data with the extensive functionality of the Pandas library.

penguins = dataset("train").to_pandas()
penguins.head()

Now that you have learned how to efficiently load datasets using the dedicated Hugging Face library, the next step is to leverage them by using Large Language Models (LLMs).

Ivan Palomares Carrascosa is a leader, writer, speaker, and advisor on ai, machine learning, deep learning, and law. He trains and guides others to leverage ai in the real world.

How to use the Hugging Face dataset library for efficient data loading

Technical Terrence Team

US stocks rebound on tech buying after slump By Investing.com

Leave a Reply Cancel reply

Recommended.

Evaluación de indicaciones a escala con Prompt Management y Prompt Flows para Amazon Bedrock

Hong Kong Crypto Push Could Have China Backing: Reports

The first 22 seasons of Pokémon will have their own FAST channel

NFT Sales Drop 32% In Last Week, Ordinals Enter Top 10 Collections Via Emblem Vault – Markets & Prices Bitcoin News

Jerry Jones Net Worth: How the Cowboys Owner Made Money

Categories

Important Links

How to use the Hugging Face dataset library for efficient data loading

Initial setup

How to upload a Hugging Face Hub dataset by name

Uploading your own data

Convert a dataset to a Pandas dataframe

Related

Technical Terrence Team

US stocks rebound on tech buying after slump By Investing.com

Leave a Reply Cancel reply

Recommended.

Evaluación de indicaciones a escala con Prompt Management y Prompt Flows para Amazon Bedrock

Hong Kong Crypto Push Could Have China Backing: Reports

The first 22 seasons of Pokémon will have their own FAST channel

NFT Sales Drop 32% In Last Week, Ordinals Enter Top 10 Collections Via Emblem Vault – Markets & Prices Bitcoin News

Jerry Jones Net Worth: How the Cowboys Owner Made Money

Categories

Important Links

Get daily news updates to your inbox!