Editor's Image | Midjourney
The Hugging Face Transformers library provides tools to easily load and use pre-trained language models (LMs) based on the Transformer architecture. But did you know that this library also allows you to deploy and train your Transformer model from scratch? This tutorial illustrates how to do so through a step-by-step sentiment classification example.
Important note: Training a transformer model from scratch is computationally expensive, and a training cycle typically takes hours, at a minimum. To run the code in this tutorial, access to high-performance computing resources, either on-premises or through a cloud provider, is highly recommended.
Step by step process
Initial setup and loading the dataset
Depending on the type of Python development environment you are working in, you may need to install Hugging Face transformers and data sets libraries, as well as the accelerate Library for training your transformer model in a distributed computing environment.
!pip install transformers datasets
!pip install accelerate -U
Once the necessary libraries are installed, let's load the emotion dataset for sentiment classification of twitter messages from the Hugging Face center:
from datasets import load_dataset
dataset = load_dataset('jeffnyman/emotions')
To use the data to train a transformer-based LM, the text needs to be tokenized. The following code initializes a BERT tokenizer (BERT is a family of transformer models suitable for text classification tasks), defines a function to tokenize the text data with padding and truncation, and applies it to the dataset in batches.
from transformers import AutoTokenizer
def tokenize_function(examples):
return tokenizer(examples('text'), padding="max_length", truncation=True)
tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')
tokenized_datasets = dataset.map(tokenize_function, batched=True)
Before proceeding with the initialization of the transformer model, let’s verify the unique labels in the dataset. Having a verified set of existing class labels helps to avoid GPU-related errors during training by verifying the consistency and correctness of the labels. We will use this set of labels later.
unique_labels = set(tokenized_datasets('train')('label'))
print(f"Unique labels in the training set: {unique_labels}")
def check_labels(dataset):
for label in dataset('train')('label'):
if label not in unique_labels:
print(f"Found invalid label: {label}")
check_labels(tokenized_datasets)
Next, we create and define a model configuration, and then instantiate the transformer model with this configuration. This is where we specify hyperparameters about the transformer architecture, such as the embedding size, the number of attention heads, and the pre-computed set of unique labels that are key to creating the final output layer for sentiment classification.
from transformers import BertConfig
from transformers import BertForSequenceClassification
config = BertConfig(
vocab_size=tokenizer.vocab_size,
hidden_size=512,
num_hidden_layers=6,
num_attention_heads=8,
intermediate_size=2048,
max_position_embeddings=512,
num_labels=len(unique_labels)
)
model = BertForSequenceClassification(config)
We are almost ready to train our transformer model. We just need to instantiate two necessary instances: Training argumentswith specifications about the training cycle, such as the number of epochs and Coachwhich unites the model instance, the training arguments, and the data used for training and validation.
from transformers import TrainingArguments, Trainer
training_args = TrainingArguments(
output_dir="./results",
evaluation_strategy="epoch",
learning_rate=2e-5,
per_device_train_batch_size=16,
per_device_eval_batch_size=16,
num_train_epochs=3,
weight_decay=0.01,
)
trainer = Trainer(
model=model,
args=training_args,
train_dataset=tokenized_datasets("train"),
eval_dataset=tokenized_datasets("test"),
)
It's time to train the model, sit back and relax. Remember that completing this instruction will take you a long time:
Once trained, your transformer model should be ready to pass input examples for sentiment prediction.
Troubleshooting
If you encounter or persist issues when running your training loop or setting it up, you may need to inspect the configuration of the GPU/CPU resources being used. For example, if you are using a CUDA GPU, adding these statements to the beginning of your code can help prevent errors in your training loop:
import os
os.environ("CUDA_LAUNCH_BLOCKING") = "1"
These lines disable the GPU and make CUDA operations synchronous, providing more immediate and accurate error messages for debugging.
On the other hand, if you are testing this code on a Google Colab instance, you are likely to see this error message at runtime, even if you have previously installed the acceleration library:
ImportError: Using the `Trainer` with `PyTorch` requires `accelerate>=0.21.0`: Please run `pip install transformers(torch)` or `pip install accelerate -U`
To fix this issue, try restarting your session from the 'Runtime' menu – the acceleration library usually requires resetting the runtime environment after installation.
Summary and conclusion
This tutorial shows the key steps to create a transformer-based LM from scratch using the Hugging Face libraries. The main steps and elements involved can be summarized as follows:
- Loading the dataset and tokenizing the text data.
- Initialize your model using a model configuration instance for the type of model (language task) it is intended for, for example Bert Configuration.
- Set up a Coach and Training arguments instances and run the training cycle.
As a next learning step, we recommend exploring how to make predictions and inferences with your newly trained model.
Ivan Palomares Carrascosa is a leader, writer, speaker, and advisor on ai, machine learning, deep learning, and law. He trains and guides others to leverage ai in the real world.