Flame Adjustment 3.2 3B for RAG

Small Language Models (SLM) are having a significant impact on ai. They provide solid performance while being efficient and cost-effective. A notable example is the Llama 3.2 3B. It performs exceptionally well on retrieval-augmented generation (RAG) tasks, reducing computational costs and memory usage while maintaining high precision. This article explores how to tune the Llama 3.2 3B model. Discover how smaller models can excel at RAG tasks and push the boundaries of what compact ai solutions can achieve.

What is Llama 3.2 3B?

The Llama 3.2 3B model, developed by Meta, is a multilingual SLM with 3 billion parameters, designed for tasks such as question answering, summarization and dialogue systems. It outperforms many open source models in industry benchmarks and supports multiple languages. Available in multiple sizes, Llama 3.2 delivers efficient computational performance and includes quantized versions for faster, more memory-efficient deployment in mobile and edge environments.

Source: Meta ai

Also Read: Top 13 Small Language Models (SLM)

Finetuning Llama 3.2 3B

Tuning is essential to adapt SLM or LLM to specific domains or tasks, such as medical, legal, or RAG applications. While pre-training allows language models to generate text on a variety of topics, tuning retrains the model with domain- or task-specific data to improve relevance and performance. To address the high computational cost of tuning all parameters, techniques such as parameter efficient fine tuning (PEFT) focus on training only a subset of the model parameters, optimizing resource usage while maintaining performance.

lora

One such PEFT method is Low Range Adaptation (LoRA).

In Lora, the weight matrix in SLM or LLM is decomposed into a product of two low-rank matrices.

W = WA * WB

If W has m rows and n columns, then it can be decomposed into WA with m rows and r columns, and WB with r rows and n columns. Here r is much smaller than mo n. So instead of training m*n values, we can only train r*(m+n) values. r is called range, which is the hyperparameter we can choose.

def lora_linear(x):
h = x @ W # regular linear
h += scale * (x @ W_A @ W_B) # low-rank update
return h

Check out: Efficient parameter tuning of large language models with LoRA and QLoRA

Let's implement LoRA in the Llama 3.2 3B model.

Required Libraries

lazy – 2024.12.9
data sets – 3.1.0

Installing the Sloth version above will also install the supported pytorch, transformers and Nvidia GPU libraries. We can use google colab to access the GPU.

Let's see the implementation now!

Import the libraries

from unsloth import FastLanguageModel, is_bfloat16_supported, train_on_responses_only

from datasets import load_dataset, Dataset

from trl import SFTTrainer, apply_chat_template

from transformers import TrainingArguments, DataCollatorForSeq2Seq, TextStreamer

import torch

Initialize the model and tokenizers

max_seq_length = 2048 
dtype = None # None for auto-detection.
load_in_4bit = True # Use 4bit quantization to reduce memory usage. Can be False.

model, tokenizer = FastLanguageModel.from_pretrained(
	model_name = "unsloth/Llama-3.2-3B-Instruct",
	max_seq_length = max_seq_length,
	dtype = dtype,
	load_in_4bit = load_in_4bit,
	# token = "hf_...", # use if using gated models like meta-llama/Llama-3.2-11b
)

For other Unsloth compatible models, we can consult<a target="_blank" href="https://docs.unsloth.ai/get-started/all-our-models” target=”_blank” rel=”nofollow noopener”> this document.

Initialize the model for PEFT

model = FastLanguageModel.get_peft_model(
	model,
	r = 16,
	target_modules = ("q_proj", "k_proj", "v_proj", "o_proj",
                  	"gate_proj", "up_proj", "down_proj",),
	lora_alpha = 16,
	lora_dropout = 0, 
	bias = "none",
	use_gradient_checkpointing = "unsloth",
	random_state = 42,
	use_rslora = False, 
	loftq_config = None,
)

Description of each parameter

r: LoRA range; Higher values improve precision but use more memory (suggested: 8–128).
target_modules: Modules to adjust; include everything for best results
lora_alpha: scale factor; usually equal to or double the r range.
lora_dropout: Dropout rate; Set to 0 for optimized and faster training.
bias: type of bias; “None” is optimized for speed and minimal overfitting.
use_gradient_checkpointing: reduce memory for long context training; “Unsloth” is highly recommended.
random_state: Seed for deterministic runs, ensuring reproducible results (e.g. 42).
use_rslora: Automate alpha selection; useful for LoRA with stabilized range.
loftq_config: Initializes LoRA with r top singular vectors for higher precision, although it consumes a lot of memory.

Data processing

We will use RAG data to make adjustments. Download huggingface data.

dataset = load_dataset("neural-bridge/rag-dataset-1200", split = "train")

The data set has three keys as follows:

Dataset ({ features: ('context', 'question', 'answer'), num_rows: 960 })

The data must be in a specific format depending on the language model. Read more details here.

So, let's convert the data to the required format:

def convert_dataset_to_dict(dataset):
    dataset_dict = {
        "prompt": (),
        "completion": ()
    }

    for row in dataset:
        user_content = f"Context: {row('context')}\nQuestion: {row('question')}"
        assistant_content = row('answer')

        dataset_dict("prompt").append((
            {"role": "user", "content": user_content}
        ))
        dataset_dict("completion").append((
            {"role": "assistant", "content": assistant_content}
        ))
    return dataset_dict
    
    
converted_data = convert_dataset_to_dict(dataset)
dataset = Dataset.from_dict(converted_data)
dataset = dataset.map(apply_chat_template, fn_kwargs={"tokenizer": tokenizer})

The data set message will be as follows:

Setting trainer parameters

We can initialize the trainer to adjust the SLM:

trainer = SFTTrainer(
	model = model,
	tokenizer = tokenizer,
	train_dataset = dataset,
	max_seq_length = max_seq_length,
	data_collator = DataCollatorForSeq2Seq(tokenizer = tokenizer),
	dataset_num_proc = 2,
	packing = False, # Can make training 5x faster for short sequences.
	args = TrainingArguments(
    	per_device_train_batch_size = 2,
    	gradient_accumulation_steps = 4,
    	warmup_steps = 5,
    	# num_train_epochs = 1, # Set this for 1 full training run.
    	max_steps = 6, # using small number to test
    	learning_rate = 2e-4,
    	fp16 = not is_bfloat16_supported(),
    	bf16 = is_bfloat16_supported(),
    	logging_steps = 1,
    	optim = "adamw_8bit",
    	weight_decay = 0.01,
    	lr_scheduler_type = "linear",
    	seed = 3407,
    	output_dir = "outputs",
    	report_to = "none", # Use this for WandB etc
	),
)

Description of some of the parameters:

per_device_train_batch_size: batch size per device; scale up to use more GPU memory, but be aware of padding inefficiencies (suggested: 2).
gradient_accumulation_steps: simulate larger batch sizes without additional memory usage; increase for smoother loss curves (suggested: 4).
max_steps: total training steps; set it for faster runs (e.g., 60) or use `num_train_epochs` for full passes of the data set (e.g., 1–3).
learning_rate: controls the speed and convergence of training; Lower rates (e.g., 2e-4) improve accuracy but slow down training.

Have the model train only with responses by specifying the response template:

trainer = train_on_responses_only(
	trainer,
	instruction_part = "<|start_header_id|>user<|end_header_id|>\n\n",
	response_part = "<|start_header_id|>assistant<|end_header_id|>\n\n",
)

Tuning the model

trainer_stats = trainer.train()

Here are the training statistics:

Test and save the model

Let's use the model for inference:

FastLanguageModel.for_inference(model)

messages = (
	{"role": "user", "content": "Context: The sky is typically clear during the day. Question: What color is the water?"},
)
inputs = tokenizer.apply_chat_template(
	messages,
	tokenize = True,
	add_generation_prompt = True,
	return_tensors = "pt",
).to("cuda")

text_streamer = TextStreamer(tokenizer, skip_prompt = True)
_ = model.generate(input_ids = inputs, streamer = text_streamer, max_new_tokens = 128,
               	use_cache = True, temperature = 1.5, min_p = 0.1)

To save the included trained LoRA weights, use the following code

model.save_pretrained_merged("model", tokenizer, save_method = "merged_16bit")

Checkout: Guide to Tuning Large Language Models

Conclusion

Fine-tuning Llama 3.2 3B for RAG tasks shows the efficiency of smaller models in delivering high performance with low computational costs. Techniques such as LoRA optimize the use of resources while maintaining accuracy. This approach empowers domain-specific applications, making advanced ai more accessible, scalable, and cost-effective, driving innovation in augmented recovery generation, and democratizing ai for real-world challenges.

Read also: Introduction to Meta Llama 3.2

Frequently asked questions

P1. What is RAG?

A. RAG combines retrieval systems with generative models to improve responses by basing them on external knowledge, making it ideal for tasks such as answering questions and summarizing.

P2. Why choose Llama 3.2 3B for tuning?

A. Llama 3.2 3B offers a balance between performance, efficiency and scalability, making it suitable for RAG tasks while reducing computational and memory requirements.

P3. What is LoRA and how does it improve fit?

A. Low Rank Adaptation (LoRA) minimizes resource usage by training only low rank matrices instead of all model parameters, enabling efficient tuning on constrained hardware.

Q4. What data set is used to make adjustments in this article?

A. Hugging Face provides the RAG dataset, which contains context, questions and answers, to tune the Llama 3.2 3B model for better task performance.

Q5. Can the optimized model be implemented on edge devices?

A. Yes, Llama 3.2 3B, especially in its quantized form, is optimized for memory-efficient deployment in mobile and edge environments.

Santhosh Reddy Dandavolu

I work as an Associate Data Scientist at Analytics Vidhya, a platform dedicated to building the data science ecosystem. My interests lie in the fields of natural language processing (NLP), deep learning, and artificial intelligence agents.

Flame Adjustment 3.2 3B for RAG

Technical Terrence Team

Walmart Is Selling a $50 'Versatile' 16-Piece Dinnerware Set for Just $18, and Shoppers Say It's 'Very Sturdy'

Leave a Reply Cancel reply

Recommended.

The reason why Ethereum Name Service (ENS) skyrocketed 20% today

What was behind the fall in the price of Bitcoin and Ethereum?

5 of the biggest education trends in 2024

Bitcoin Sets New ATH Above $104,000, But Investors Don't Want to Sell

Bitcoin Bull Michael Saylor Wants SEC to Corner Crypto Herd: ETH, ADA, SOL in Sights

Categories

Important Links

Flame Adjustment 3.2 3B for RAG

What is Llama 3.2 3B?

Finetuning Llama 3.2 3B

lora

Required Libraries

Import the libraries

Initialize the model and tokenizers

Initialize the model for PEFT

Description of each parameter

Data processing

Setting trainer parameters

Tuning the model

Test and save the model

Conclusion

Frequently asked questions

Related

Technical Terrence Team

Walmart Is Selling a $50 'Versatile' 16-Piece Dinnerware Set for Just $18, and Shoppers Say It's 'Very Sturdy'

Leave a Reply Cancel reply

Recommended.

The reason why Ethereum Name Service (ENS) skyrocketed 20% today

What was behind the fall in the price of Bitcoin and Ethereum?

5 of the biggest education trends in 2024

Bitcoin Sets New ATH Above $104,000, But Investors Don't Want to Sell

Bitcoin Bull Michael Saylor Wants SEC to Corner Crypto Herd: ETH, ADA, SOL in Sights

Categories

Important Links

Get daily news updates to your inbox!