Small Language Models (SLM) are having a significant impact on ai. They provide solid performance while being efficient and cost-effective. A notable example is the Llama 3.2 3B. It performs exceptionally well on retrieval-augmented generation (RAG) tasks, reducing computational costs and memory usage while maintaining high precision. This article explores how to tune the Llama 3.2 3B model. Discover how smaller models can excel at RAG tasks and push the boundaries of what compact ai solutions can achieve.
What is Llama 3.2 3B?
The Llama 3.2 3B model, developed by Meta, is a multilingual SLM with 3 billion parameters, designed for tasks such as question answering, summarization and dialogue systems. It outperforms many open source models in industry benchmarks and supports multiple languages. Available in multiple sizes, Llama 3.2 delivers efficient computational performance and includes quantized versions for faster, more memory-efficient deployment in mobile and edge environments.
Also Read: Top 13 Small Language Models (SLM)
Finetuning Llama 3.2 3B
Tuning is essential to adapt SLM or LLM to specific domains or tasks, such as medical, legal, or RAG applications. While pre-training allows language models to generate text on a variety of topics, tuning retrains the model with domain- or task-specific data to improve relevance and performance. To address the high computational cost of tuning all parameters, techniques such as parameter efficient fine tuning (PEFT) focus on training only a subset of the model parameters, optimizing resource usage while maintaining performance.
lora
One such PEFT method is Low Range Adaptation (LoRA).
In Lora, the weight matrix in SLM or LLM is decomposed into a product of two low-rank matrices.
W = WA * WB
If W has m rows and n columns, then it can be decomposed into WA with m rows and r columns, and WB with r rows and n columns. Here r is much smaller than mo n. So instead of training m*n values, we can only train r*(m+n) values. r is called range, which is the hyperparameter we can choose.
def lora_linear(x):
h = x @ W # regular linear
h += scale * (x @ W_A @ W_B) # low-rank update
return h
Check out: Efficient parameter tuning of large language models with LoRA and QLoRA
Let's implement LoRA in the Llama 3.2 3B model.
Required Libraries
- lazy – 2024.12.9
- data sets – 3.1.0
Installing the Sloth version above will also install the supported pytorch, transformers and Nvidia GPU libraries. We can use google colab to access the GPU.
Let's see the implementation now!
Import the libraries
from unsloth import FastLanguageModel, is_bfloat16_supported, train_on_responses_only
from datasets import load_dataset, Dataset
from trl import SFTTrainer, apply_chat_template
from transformers import TrainingArguments, DataCollatorForSeq2Seq, TextStreamer
import torch
Initialize the model and tokenizers
max_seq_length = 2048
dtype = None # None for auto-detection.
load_in_4bit = True # Use 4bit quantization to reduce memory usage. Can be False.
model, tokenizer = FastLanguageModel.from_pretrained(
model_name = "unsloth/Llama-3.2-3B-Instruct",
max_seq_length = max_seq_length,
dtype = dtype,
load_in_4bit = load_in_4bit,
# token = "hf_...", # use if using gated models like meta-llama/Llama-3.2-11b
)
For other Unsloth compatible models, we can consult<a target="_blank" href="https://docs.unsloth.ai/get-started/all-our-models” target=”_blank” rel=”nofollow noopener”> this document.
Initialize the model for PEFT
model = FastLanguageModel.get_peft_model(
model,
r = 16,
target_modules = ("q_proj", "k_proj", "v_proj", "o_proj",
"gate_proj", "up_proj", "down_proj",),
lora_alpha = 16,
lora_dropout = 0,
bias = "none",
use_gradient_checkpointing = "unsloth",
random_state = 42,
use_rslora = False,
loftq_config = None,
)
Description of each parameter
- r: LoRA range; Higher values improve precision but use more memory (suggested: 8–128).
- target_modules: Modules to adjust; include everything for best results
- lora_alpha: scale factor; usually equal to or double the r range.
- lora_dropout: Dropout rate; Set to 0 for optimized and faster training.
- bias: type of bias; “None” is optimized for speed and minimal overfitting.
- use_gradient_checkpointing: reduce memory for long context training; “Unsloth” is highly recommended.
- random_state: Seed for deterministic runs, ensuring reproducible results (e.g. 42).
- use_rslora: Automate alpha selection; useful for LoRA with stabilized range.
- loftq_config: Initializes LoRA with r top singular vectors for higher precision, although it consumes a lot of memory.
Data processing
We will use RAG data to make adjustments. Download huggingface data.
dataset = load_dataset("neural-bridge/rag-dataset-1200", split = "train")
The data set has three keys as follows:
Dataset ({ features: ('context', 'question', 'answer'), num_rows: 960 })
The data must be in a specific format depending on the language model. Read more details here.
So, let's convert the data to the required format:
def convert_dataset_to_dict(dataset):
dataset_dict = {
"prompt": (),
"completion": ()
}
for row in dataset:
user_content = f"Context: {row('context')}\nQuestion: {row('question')}"
assistant_content = row('answer')
dataset_dict("prompt").append((
{"role": "user", "content": user_content}
))
dataset_dict("completion").append((
{"role": "assistant", "content": assistant_content}
))
return dataset_dict
converted_data = convert_dataset_to_dict(dataset)
dataset = Dataset.from_dict(converted_data)
dataset = dataset.map(apply_chat_template, fn_kwargs={"tokenizer": tokenizer})
The data set message will be as follows:
Setting trainer parameters
We can initialize the trainer to adjust the SLM:
trainer = SFTTrainer(
model = model,
tokenizer = tokenizer,
train_dataset = dataset,
max_seq_length = max_seq_length,
data_collator = DataCollatorForSeq2Seq(tokenizer = tokenizer),
dataset_num_proc = 2,
packing = False, # Can make training 5x faster for short sequences.
args = TrainingArguments(
per_device_train_batch_size = 2,
gradient_accumulation_steps = 4,
warmup_steps = 5,
# num_train_epochs = 1, # Set this for 1 full training run.
max_steps = 6, # using small number to test
learning_rate = 2e-4,
fp16 = not is_bfloat16_supported(),
bf16 = is_bfloat16_supported(),
logging_steps = 1,
optim = "adamw_8bit",
weight_decay = 0.01,
lr_scheduler_type = "linear",
seed = 3407,
output_dir = "outputs",
report_to = "none", # Use this for WandB etc
),
)
Description of some of the parameters:
- per_device_train_batch_size: batch size per device; scale up to use more GPU memory, but be aware of padding inefficiencies (suggested: 2).
- gradient_accumulation_steps: simulate larger batch sizes without additional memory usage; increase for smoother loss curves (suggested: 4).
- max_steps: total training steps; set it for faster runs (e.g., 60) or use `num_train_epochs` for full passes of the data set (e.g., 1–3).
- learning_rate: controls the speed and convergence of training; Lower rates (e.g., 2e-4) improve accuracy but slow down training.
Have the model train only with responses by specifying the response template:
trainer = train_on_responses_only(
trainer,
instruction_part = "<|start_header_id|>user<|end_header_id|>\n\n",
response_part = "<|start_header_id|>assistant<|end_header_id|>\n\n",
)
Tuning the model
trainer_stats = trainer.train()
Here are the training statistics:
Test and save the model
Let's use the model for inference:
FastLanguageModel.for_inference(model)
messages = (
{"role": "user", "content": "Context: The sky is typically clear during the day. Question: What color is the water?"},
)
inputs = tokenizer.apply_chat_template(
messages,
tokenize = True,
add_generation_prompt = True,
return_tensors = "pt",
).to("cuda")
text_streamer = TextStreamer(tokenizer, skip_prompt = True)
_ = model.generate(input_ids = inputs, streamer = text_streamer, max_new_tokens = 128,
use_cache = True, temperature = 1.5, min_p = 0.1)
To save the included trained LoRA weights, use the following code
model.save_pretrained_merged("model", tokenizer, save_method = "merged_16bit")
Checkout: Guide to Tuning Large Language Models
Conclusion
Fine-tuning Llama 3.2 3B for RAG tasks shows the efficiency of smaller models in delivering high performance with low computational costs. Techniques such as LoRA optimize the use of resources while maintaining accuracy. This approach empowers domain-specific applications, making advanced ai more accessible, scalable, and cost-effective, driving innovation in augmented recovery generation, and democratizing ai for real-world challenges.
Read also: Introduction to Meta Llama 3.2
Frequently asked questions
A. RAG combines retrieval systems with generative models to improve responses by basing them on external knowledge, making it ideal for tasks such as answering questions and summarizing.
A. Llama 3.2 3B offers a balance between performance, efficiency and scalability, making it suitable for RAG tasks while reducing computational and memory requirements.
A. Low Rank Adaptation (LoRA) minimizes resource usage by training only low rank matrices instead of all model parameters, enabling efficient tuning on constrained hardware.
A. Hugging Face provides the RAG dataset, which contains context, questions and answers, to tune the Llama 3.2 3B model for better task performance.
A. Yes, Llama 3.2 3B, especially in its quantized form, is optimized for memory-efficient deployment in mobile and edge environments.