Image by author
Over the past year and a half, the natural language processing (NLP) landscape has seen a notable evolution, primarily thanks to the rise of large language models (LLMs) like OpenAI's GPT family.
These powerful models have revolutionized our approach to handling natural language tasks, offering unprecedented capabilities in translation, sentiment analysis, and automated text generation. Its ability to understand and generate human-like text has opened up possibilities that were previously considered unattainable.
However, despite their impressive capabilities, the path to training these models is fraught with challenges, including the significant financial and time investments they require.
This brings us to the fundamental role of perfecting LLMs.
By refining these pre-trained models to better suit specific applications or domains, we can significantly improve their performance on particular tasks. This step not only raises its quality but also extends its usefulness to a wide range of sectors.
This guide aims to break this process down into 7 easy steps to tailor any LLM to a specific task.
LLMs are a specialized category of machine learning algorithms designed to predict the next word in a sequence based on the context provided by previous words. These models are based on the Transformers architecture, a breakthrough in machine learning techniques and explained for the first time on Google. All you need is attention article.
Models like GPT (Generative Pre-Trained Transformer) are examples of pre-trained language models that have been exposed to large volumes of textual data. This extensive training allows them to grasp the underlying rules of language use, including how words are combined to form coherent sentences.
Image by author
A key strength of these models lies in their ability to not only understand natural language but also produce text that closely mimics human writing based on the inputs they receive.
So what's the best thing about this?
These models are now open to the masses via API.
What is fine tuning and why is it important?
Tuning is the process of choosing a pre-trained model and improving it with additional training on a domain-specific data set.
Most LLM models have very good natural language skills and generic knowledge performance, but fail on specific task-oriented problems. The tuning process offers an approach to improve model performance for specific problems while reducing computational overhead without the need to build them from scratch.
Image by author
Simply put, fine-tuning tailors the model to perform better on specific tasks, making it more effective and versatile in real-world applications. This process is essential to improve an existing model for a particular task or domain.
Let's exemplify this concept by fitting a real model in just 7 steps.
Step 1: Be clear about our specific objective
Let's imagine that we want to infer the sentiment of any text and we decide to try GPT-2 for this task.
I'm pretty sure it's no surprise that we soon detect that it doesn't do it right. So a natural question that comes to mind is:
Can we do anything to improve its performance?
And of course, the answer is that we can!
Leveraging fine-tuning by training our pre-trained GPT-2 model from Hugging Face Hub with a dataset containing tweets and their corresponding sentiments so that performance improves.
So our ultimate goal is have a model that is good at inferring sentiment from text.
Step 2 – Choose a pre-trained model and data set
The second step is to choose which model to take as a base model. In our case we have already chosen the model: GPT-2. So let's make some simple adjustments.
Screenshot of Hugging Face Datasets Hub. Selecting the GPT2 model from OpenAI.
Always keep in mind to select a model that fits your task.
Step 3: Load data to use
Now that we have our model and our main task, we need some data to work with.
But don't worry, Hugging Face has it all taken care of!
This is where your dataset library comes into play.
In this example, we will leverage the Hugging Face dataset library to import a dataset with tweets tagged with their corresponding sentiment (Positive, Neutral, or Negative).
from datasets import load_dataset
dataset = load_dataset("mteb/tweet_sentiment_extraction")
df = pd.DataFrame(dataset('train'))
The data looks like this:
The data set to use.
Step 4: Tokenizer
Now we have our model and the data set to fit it. So the natural next step is to load a tokenizer. Since LLMs work with tokens (and not words!), we need a tokenizer to send the data to our model.
We can easily do this by leveraging the map method to tokenize the entire data set.
from transformers import GPT2Tokenizer
# Loading the dataset to train our model
dataset = load_dataset("mteb/tweet_sentiment_extraction")
tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
tokenizer.pad_token = tokenizer.eos_token
def tokenize_function(examples):
return tokenizer(examples("text"), padding="max_length", truncation=True)
tokenized_datasets = dataset.map(tokenize_function, batched=True)
COUSIN: To improve our processing performance, two smaller subsets are generated:
- The training set: To fine-tune our model.
- The test suite: To evaluate it.
small_train_dataset = tokenized_datasets("train").shuffle(seed=42).select(range(1000))
small_eval_dataset = tokenized_datasets("test").shuffle(seed=42).select(range(1000))
Step 5: Initialize our base model
Once we have the data set to use, we load our model and specify the number of expected labels. From the Tweet sentiment data set, you can know that there are three possible labels:
- 0 or Negative
- 1 or neutral
- 2 or Positive
from transformers import GPT2ForSequenceClassification
model = GPT2ForSequenceClassification.from_pretrained("gpt2", num_labels=3)
Step 6: Evaluate the method
The Transformers library provides a class called “Trainer” that optimizes both the training and evaluation of our model. Therefore, before starting the actual training, we need to define a function to evaluate the fitted model.
import evaluate
metric = evaluate.load("accuracy")
def compute_metrics(eval_pred):
logits, labels = eval_pred
predictions = np.argmax(logits, axis=-1)
return metric.compute(predictions=predictions, references=labels)
Step 7: Tune Using the Coach Method
The last step is to tune the model. To do this, we configure the training arguments along with the evaluation strategy and execute the Trainer object.
To execute the Trainer object we simply use the train() command.
from transformers import TrainingArguments, Trainer
training_args = TrainingArguments(
output_dir="test_trainer",
#evaluation_strategy="epoch",
per_device_train_batch_size=1, # Reduce batch size here
per_device_eval_batch_size=1, # Optionally, reduce for evaluation as well
gradient_accumulation_steps=4
)
trainer = Trainer(
model=model,
args=training_args,
train_dataset=small_train_dataset,
eval_dataset=small_eval_dataset,
compute_metrics=compute_metrics,
)
trainer.train()
Once our model has been tuned, we use the test set to evaluate its performance. The training object already contains an optimized evaluate() method.
import evaluate
trainer.evaluate()
This is a basic process for fine-tuning any LLM.
Also, remember that the process of fitting an LLM is compute-intensive, so your local computer may not have enough power to do it.
Nowadays, tuning pre-trained large language models, such as GPT, for specific tasks is crucial to improve the performance of LLMs in specific domains. It allows us to harness the power of natural language while improving its efficiency and potential for personalization, making the process accessible and cost-effective.
By following these simple seven steps, from selecting the right model and data set to training and evaluating the fine-tuned model, we can achieve superior model performance in specific domains.
For those who want to check out the full code, it is available on my lGitHub repository of large language models.
Joseph Ferrer He is an analytical engineer from Barcelona. He graduated in physical engineering and currently works in the field of data science applied to human mobility. He is a part-time content creator focused on data science and technology. Josep writes about all things ai, covering the application of the ongoing explosion in this field.