Introduction
Recently, with the rise of large language models and ai, we have seen countless advancements in natural language processing. Models in domains such as text, code, and image and video generation have achieved human-like reasoning and performance. These models perform exceptionally well on general knowledge-based questions. Models such as GPT-4o, Llama 2, Claude, and Gemini are trained on publicly available datasets. They fail to answer domain- or topic-specific questions that could be more useful for various organizational tasks.
Fine-tuning helps developers and businesses to adapt and train pre-trained models to a domain-specific dataset that achieves high accuracy and consistency on domain-related queries. Fine-tuning improves model performance without requiring extensive computing resources because pre-trained models have already learned the general text of the vast amount of public data.
In this blog, we will discuss why we should fine-tune pre-trained models using the Lamini platform. This allows us to fine-tune and evaluate models without using a lot of computational resources.
So, let's get started!
Learning objectives
- Exploring the need to refine open source LLMs using Lamini
- To learn about the use of Lamini and instructions on tuned models
- Gain a practical understanding of the end-to-end model tuning process.
This article was published as part of the Data Science Blogathon.
Why fine-tune large language models?
Pre-trained models are mostly trained on a large amount of general data and are very likely to lack context or domain-specific knowledge. Pre-trained models can also generate hallucinations and inaccurate and inconsistent results. Popular large language models based on chatbots such as ChatGPT, Gemini, and BingChat have repeatedly shown that pre-trained models are prone to such inaccuracies. This is where fine-tuning comes to the rescue, which can help tailor pre-trained language models to subject-specific tasks and questions effectively. Other ways to align models with your goals include prompt engineering and few-shot prompt engineering.
Still, fine-tuning remains a superior option when it comes to performance metrics. Methods such as efficient parameter fine-tuning and adaptive low-ranking fine-tuning have further improved model fine-tuning and helped developers generate better models. Let’s see how fine-tuning fits into a large language model context.
# Load the fine-tuning dataset
filename = "lamini_docs.json"
instruction_dataset_df = pd.read_json(filename, lines=True)
instruction_dataset_df
# Load it into a python's dictionary
examples = instruction_dataset_df.to_dict()
# prepare a samples for a fine-tuning
if "question" in examples and "answer" in examples:
text = examples("question")(0) + examples("answer")(0)
elif "instruction" in examples and "response" in examples:
text = examples("instruction")(0) + examples("response")(0)
elif "input" in examples and "output" in examples:
text = examples("input")(0) + examples("output")(0)
else:
text = examples("text")(0)
# Using a prompt template to create instruct tuned dataset for fine-tuning
prompt_template_qa = """### Question:
{question}
### Answer:
{answer}"""
The above code shows that instruction tuning uses a request template to prepare a dataset for instruction tuning and tune a model for a specific dataset. We can tune the pre-trained model for a specific use case using a custom dataset.
The next section will examine how Lamini can help tune large language models (LLMs) for custom datasets.
How to perfect open source LLMs using Lamini?
The Lamini platform allows users to seamlessly tune and deploy models without extensive hardware configuration costs or requirements. ai/product” target=”_blank” rel=”nofollow noopener”>Lamini It provides an end-to-end stack to develop, train, tune, and deploy models as per user convenience and model requirements. Lamini provides its own hosted GPU computing network to train models cost-effectively.
Lamini’s memory tuning and computational optimization tools help train and tune models with high accuracy while controlling costs. Models can be hosted anywhere, on a private cloud or via Lamini’s GPU network. Below, we’ll walk through a step-by-step guide to preparing data for tuning large language models (LLMs) using the Lamini platform.
Data preparation
In general, we need to select a domain-specific dataset for data cleaning, promotion, tokenization, and storage to prepare the data for any fine-tuning tasks. After loading the dataset, we preprocess it to convert it into a statement-tuned dataset. We format each sample of the dataset into a statement, question, and answer format to better fit it for our use cases. Please refer to the source of the dataset using the link provided. hereLet’s look at the code example instructions on how to fine-tune with tokenization for training using the Lamini platform.
import pandas as pd
# load the dataset and store it as an instruction dataset
filename = "lamini_docs.json"
instruction_dataset_df = pd.read_json(filename, lines=True)
examples = instruction_dataset_df.to_dict()
if "question" in examples and "answer" in examples:
text = examples("question")(0) + examples("answer")(0)
elif "instruction" in examples and "response" in examples:
text = examples("instruction")(0) + examples("response")(0)
elif "input" in examples and "output" in examples:
text = examples("input")(0) + examples("output")(0)
else:
text = examples("text")(0)
prompt_template = """### Question:
{question}
### Answer:"""
# Store fine-tuning examples as an instruction format
num_examples = len(examples("question"))
finetuning_dataset = ()
for i in range(num_examples):
question = examples("question")(i)
answer = examples("answer")(i)
text_with_prompt_template = prompt_template.format(question=question)
finetuning_dataset.append({"question": text_with_prompt_template,
"answer": answer})
In the above example, we have formatted “questions” and “answers” into a request template and stored them in a separate file for tokenization and padding before training the LLM.
Tokenize the dataset
# Tokenization of the dataset with padding and truncation
def tokenize_function(examples):
if "question" in examples and "answer" in examples:
text = examples("question")(0) + examples("answer")(0)
elif "input" in examples and "output" in examples:
text = examples("input")(0) + examples("output")(0)
else:
text = examples("text")(0)
# padding
tokenizer.pad_token = tokenizer.eos_token
tokenized_inputs = tokenizer(
text,
return_tensors="np",
padding=True,
)
max_length = min(
tokenized_inputs("input_ids").shape(1),
2048
)
# truncation of the text
tokenizer.truncation_side = "left"
tokenized_inputs = tokenizer(
text,
return_tensors="np",
truncation=True,
max_length=max_length
)
return tokenized_inputs
The above code takes the dataset samples as input for padding and truncation with tokenization to generate preprocessed tokenized dataset samples, which can be used to fine-tune the pre-trained models. Now that the dataset is ready, we will discuss training and evaluating the models using the Lamini platform.
Fine tuning process
Now that we have a prepared dataset in a tuning format, we will load the dataset into the environment and tune the pre-trained LLM model using Lamini's easy-to-use training techniques.
Setting up an environment
To start tuning open source LLMs with Lamini, we first need to make sure that our code environment has the right resources and libraries installed. We need to make sure that you have a suitable machine with enough GPU resources and install the necessary libraries such as transformers, datasets, torch, and pandas. You need to securely load environment variables such as api_url and api_key, usually from environment files. You can use packages such as dotenv to load these variables. After preparing the environment, load the dataset and models for training.
import os
from lamini import Lamini
lamini.api_url = os.getenv("POWERML__PRODUCTION__URL")
lamini.api_key = os.getenv("POWERML__PRODUCTION__KEY")
# import necessary library and load the environment files
import datasets
import tempfile
import logging
import random
import config
import os
import yaml
import time
import torch
import transformers
import pandas as pd
import jsonlines
# Loading transformer architecture and ((
from utilities import *
from transformers import AutoTokenizer
from transformers import AutoModelForCausalLM
from transformers import TrainingArguments
from transformers import AutoModelForCausalLM
from llama import BasicModelRunner
logger = logging.getLogger(__name__)
global_config = None
Load dataset
After setting up logging for monitoring and debugging, prepare your dataset using datasets or other data handling libraries like jsonlines and pandas. After loading the dataset, we will configure a tokenizer and a model with training configurations for the training process.
# load the dataset from you local system or HF cloud
dataset_name = "lamini_docs.jsonl"
dataset_path = f"/content/{dataset_name}"
use_hf = False
# dataset path
dataset_path = "lamini/lamini_docs"
Setting up the model, training settings, and tokenizer
Next, we select the model to fine-tune using open-source LLMs with Lamini, “EleutherAI/pythia-70m,” and define its configuration in training_config, specifying the name of the pre-trained model and the dataset path. We initialize the AutoTokenizer with the model’s tokenizer and set the padding to the end-of-sequence token. We then tokenize the data and split it into training and test datasets using a custom function, tokenize_and_split_data. Finally, we instantiate the base model using AutoModelForCausalLM, allowing it to perform causal language modeling tasks. Additionally, the code below sets the computational requirements for our model fine-tuning process.
# model name
model_name = "EleutherAI/pythia-70m"
# training config
training_config = {
"model": {
"pretrained_name": model_name,
"max_length" : 2048
},
"datasets": {
"use_hf": use_hf,
"path": dataset_path
},
"verbose": True
}
# setting up auto tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.pad_token = tokenizer.eos_token
train_dataset, test_dataset = tokenize_and_split_data(training_config, tokenizer)
# set up a baseline model from lamini
base_model = Lamini(model_name)
# gpu parallization
device_count = torch.cuda.device_count()
if device_count > 0:
logger.debug("Select GPU device")
device = torch.device("cuda")
else:
logger.debug("Select CPU device")
device = torch.device("cpu")
Training setup to fine-tune the model
Finally, we configure the training arguments parameters with hyperparameters. These include learning rate, epochs, batch size, output directory, evaluation steps, storage, warm-up steps, evaluation and logging strategy, etc., to fine-tune the custom training dataset.
max_steps = 3
# trained model name
trained_model_name = f"lamini_docs_{max_steps}_steps"
output_dir = trained_model_name
training_args = TrainingArguments(
# Learning rate
learning_rate=1.0e-5,
# Number of training epochs
num_train_epochs=1,
# Max steps to train for (each step is a batch of data)
# Overrides num_train_epochs, if not -1
max_steps=max_steps,
# Batch size for training
per_device_train_batch_size=1,
# Directory to save model checkpoints
output_dir=output_dir,
# Other arguments
overwrite_output_dir=False, # Overwrite the content of the output directory
disable_tqdm=False, # Disable progress bars
eval_steps=120, # Number of update steps between two evaluations
save_steps=120, # After # steps model is saved
warmup_steps=1, # Number of warmup steps for learning rate scheduler
per_device_eval_batch_size=1, # Batch size for evaluation
evaluation_strategy="steps",
logging_strategy="steps",
logging_steps=1,
optim="adafactor",
gradient_accumulation_steps = 4,
gradient_checkpointing=False,
# Parameters for early stopping
load_best_model_at_end=True,
save_total_limit=1,
metric_for_best_model="eval_loss",
greater_is_better=False
)
After setting up the training arguments, the system calculates the model’s floating-point operations per second (FLOPs) based on the input size and gradient accumulation steps, providing insight into the computational load. It also evaluates memory usage and estimates the model’s footprint in gigabytes. Once these calculations are complete, a trainer initializes the base model, FLOPs, total training steps, and prepared datasets for training and evaluation. This setup optimizes the training process and enables monitoring of resource usage, which is critical to efficiently handling large-scale model fine-tuning. At the end of training, the fine-tuned model is ready for deployment to the cloud to serve users as an API.
# model parameters
model_flops = (
base_model.floating_point_ops(
{
"input_ids": torch.zeros(
(1, training_config("model")("max_length"))
)
}
)
* training_args.gradient_accumulation_steps
)
print(base_model)
print("Memory footprint", base_model.get_memory_footprint() / 1e9, "GB")
print("Flops", model_flops / 1e9, "GFLOPs")
# Set up a trainer
trainer = Trainer(
model=base_model,
model_flops=model_flops,
total_steps=max_steps,
args=training_args,
train_dataset=train_dataset,
eval_dataset=test_dataset,
)
Conclusion
In conclusion, this article provides a detailed guide to understanding the need for tuning LLMs using the Lamini platform. It gives a complete overview of why we should tune the model for custom datasets and business use cases and the benefits of using Lamini tools. We also saw a step-by-step guide to tuning the model using a custom dataset and LLM with Lamini tools. Let us summarize the key takeaways from the blog.
Key findings
- Learning is needed to tune models against engineering augmented generation and fast recovery methods.
- Using platforms such as Lamini for easy-to-use hardware installation and deployment techniques for user-tailored models.
- We are preparing data for the fine-tuning task and setting up a pipeline to train a base model using a wide range of hyperparameters.
The media displayed in this article is not the property of Analytics Vidhya and is used at the discretion of the author.
Frequently Asked Questions
A. The fine-tuning process starts with understanding the context-specific requirements, preparing the dataset, tokenizing, and setting up training configurations such as hardware requirements, training settings, and training arguments. Finally, a training job is run for model development.
A. Tuning an LLM model means training a base model on a specific custom dataset. This generates accurate and contextually relevant results for specific queries based on the use case.
A. Lamini offers an integrated fine-tuning language model, inference, and GPU configuration for smooth, efficient, and cost-effective LLM development.