Pre-trained large language models (LLMs) can only make predictions of the next token, which prevents them from answering questions. That's why these basic models are tailored to pairs of instructions and responses to act as useful assistants. However, this process can still be flawed: lean LLMs can be biased, toxic, harmful, etc. This is where reinforcement learning from human feedback (RLHF) comes into play.
RLHF provides different responses to the LLM, which are classified according to the desired behavior (helpfulness, toxicity, etc.). The model learns to generate the best response among these candidates, thus imitating the behavior we want to instill. Often seen as a way to censor models, this process has recently become popular to improve performance, as shown in chat-neural-7b-v3–1.
In this article we will create NeuralHermes-2.5through fine adjustment OpenHermes-2.5 using a technique similar to RLHF: Direct Preference Optimization (DPO). To do this, we will introduce a preference data set, describe how the DPO algorithm works, and apply it to our model. We will see that it significantly improves the performance of the base model in the Open LLM leaderboard.
As usual, the code is available at GitHub and Google Co..
Preference data sets are not standardized, but typically consist of a collection of human-rated responses. This classification is essential, as the RLHF process refines the LLMs to generate the preferred response. Here is an example of Anthropic/hh-rlhfa data set of popular preferences:
The structure of the data set is simple: for each row, there is a chosen (preferred) answer and a rejected answer. The goal of RLHF is to guide the model to generate the preferred response.
Preference datasets are notoriously expensive and difficult to create, requiring collecting manual feedback from humans. This feedback is also subjective and can easily be biased toward safe (but incorrect) answers or contradict each other (different scorers have different values). Over time, several solutions have been proposed to address these issues, such as replacing human feedback with ai feedback (RLAIF).
These data sets also tend to be much smaller than fine-tuning data sets. To illustrate this, the excellent chat-neural-7b-v3–1 (best 7B LLM in the LLM Open Leaderboard when released) uses 518k samples for fine tuning (Open Orca/SlimOrca) but only 12.9k samples for RLHF (Intel/orca_dpo_pairs). In this case, the authors generated responses with GPT-4/3.5 to create the preferred responses, and with Call 2 13b chat to create the rejected responses. It's a smart way to avoid human feedback and only rely on models with different levels of performance.
While the concept of RLHF has been used in robotics for a long time, it was popularized for LLMs in the OpenAI paper. How to adjust language models based on human preferences. In this article, the authors present a framework in which a reward model is trained to approximate human feedback. This reward model is then used to optimize the adjusted model policy using the Upcoming Policy Optimization (PPO) algorithm.
The core concept of PPO revolves around making smaller, incremental updates to the policy, as larger updates can lead to instability or suboptimal solutions. From experience, unfortunately this technique remains unstable (losses diverge), difficult to reproduce (numerous hyperparameters, sensitive to random seeds), and computationally expensive.
This is where Direct Preference Optimization (DPO) comes into play. DPO simplifies control by treating the task as a classification problem. Specifically, it uses two models: the trained model (or policy model) and a copy of it called reference model. During training, the goal is to ensure that the trained model generates higher probabilities of preferred responses than the reference model. On the contrary, we also want it to generate lower probabilities of rejected answers. It means that we penalize the LLM for bad answers and reward it for good ones.
By using the LLM itself as a reward model and employing binary cross-entropy objectives, DPO efficiently aligns model outputs with human preferences without the need for extensive sampling, reward model tuning, or complex hyperparameter tuning. It results in a more stable, more efficient and less computationally demanding process.
In this example, we will adjust the excellent OpenHermes-2.5-Mistral-7B, which is a Mistral-7b model that was only tuned and monitored. For this purpose, we will use the Intel/orca_dpo_pairs dataset to align our model and improve its performance. We call this new model NeuralHermes-2.5-Mistral-7B.
The first step is to install the necessary libraries as follows.
pip install -q datasets trl peft bitsandbytes sentencepiece wandb
Once this is done, we can import the libraries. I'm also using the secrets tab in Google Colab to store my Hugging Face token.
import os
import gc
import torchimport transformers
from transformers import AutoModelForCausalLM, AutoTokenizer, TrainingArguments, BitsAndBytesConfig
from datasets import load_dataset
from peft import LoraConfig, PeftModel, get_peft_model, prepare_model_for_kbit_training
from trl import DPOTrainer
import bitsandbytes as bnb
from google.colab import userdata
import wandb
# Defined in the secrets tab in Google Colab
hf_token = userdata.get('huggingface')
wb_token = userdata.get('wandb')
wandb.login(key=wb_token)
model_name = "teknium/OpenHermes-2.5-Mistral-7B"
new_model = "NeuralHermes-2.5-Mistral-7B"
OpenHermes-2.5-Mistral-7B uses a specific chat template, called ChatML. Below is an example of a conversation formatted with this template:
<|im_start|>system
You are a helpful chatbot assistant.<|im_end|>
<|im_start|>user
Hi<|im_end|>
<|im_start|>assistant
Hi, how can I help you?<|im_end|>
As you can see, ChatML defines different roles (system, user, assistant) and adds special tokens (<|im_start|>
and <|im_end|>
) to separate them. Besides, DPOTrainer
It also requires a specific format with three columns: application, chosen and rejected.
Our dataset contains four columns: system, question, chatgpt, and call2–13b-chat. We will simply concatenate the system and question columns with the request column. We will also assign the chatgpt column to “chosen” and call2–13b-chat to “rejected”. To format the data set in a reliable way, we will use the tokenizer apply_chat_template()
function, which is already used by ChatML.
def chatml_format(example):
# Format system
if len(example('system')) > 0:
message = {"role": "system", "content": example('system')}
system = tokenizer.apply_chat_template((message), tokenize=False)
else:
system = ""# Format instruction
message = {"role": "user", "content": example('question')}
prompt = tokenizer.apply_chat_template((message), tokenize=False, add_generation_prompt=True)
# Format chosen answer
chosen = example('chosen') + "<|im_end|>\n"
# Format rejected answer
rejected = example('rejected') + "<|im_end|>\n"
return {
"prompt": system + prompt,
"chosen": chosen,
"rejected": rejected,
}
# Load dataset
dataset = load_dataset("Intel/orca_dpo_pairs")('train')
# Save columns
original_columns = dataset.column_names
# Tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "left"
# Format dataset
dataset = dataset.map(
chatml_format,
remove_columns=original_columns
)
Let's print a sample of the formatted data set to confirm that everything is working as expected:
{'prompt': '<|im_start|>system\nYou are an ai assistant. You will be given a task. You must generate a detailed and long answer.<|im_end|>\n<|im_start|>user\nGenerate an approximately fifteen-word sentence that describes all this data: Midsummer House eatType restaurant; Midsummer House food Chinese; Midsummer House priceRange moderate; Midsummer House customer rating 3 out of 5; Midsummer House near All Bar One<|im_end|>\n<|im_start|>assistant\n',
'chosen': 'Midsummer House is a moderately priced Chinese restaurant with a 3/5 customer rating, located near All Bar One.<|im_end|>\n',
'rejected': ' Sure! Here\'s a sentence that describes all the data you provided:\n\n"Midsummer House is a moderately priced Chinese restaurant with a customer rating of 3 out of 5, located near All Bar One, offering a variety of delicious dishes."<|im_end|>\n'}
We can see that the message combines system and user instructions. Thanks to add_generation_prompt=True
argument, it also adds the beginning of the wizard's response. If you want to skip this step, you can directly use the preprocessed data set as mlabonne/chatml_dpo_pairs.
Next, we define the LoRA configurations to train the model. As described in Intel Blog Postwe set the sort value to be equal to lora_alpha
which is unusual (2 * r
as a golden rule). We also target all linear modules with adapters.
# LoRA configuration
peft_config = LoraConfig(
r=16,
lora_alpha=16,
lora_dropout=0.05,
bias="none",
task_type="CAUSAL_LM",
target_modules=('k_proj', 'gate_proj', 'v_proj', 'up_proj', 'q_proj', 'o_proj', 'down_proj')
)
Now we are ready to load the model we want to tune with DPO. In this case, two models are required: the model to be adjusted and the reference model. This is mainly for readability reasons, since the DPOTrainer
The object automatically creates a reference model if one is not provided.
# Model to fine-tune
model = AutoModelForCausalLM.from_pretrained(
model_name,
torch_dtype=torch.float16,
load_in_4bit=True
)
model.config.use_cache = False# Reference model
ref_model = AutoModelForCausalLM.from_pretrained(
model_name,
torch_dtype=torch.float16,
load_in_4bit=True
)
The last step is to provide all the hyperparameters to TrainingArguments
and DPOTrainer
:
- Among them, the
beta
The parameter is unique to DPO as it controls the divergence from the initial policy (0.1 is a typical value for it). - Compared to the values described in Intel Blog Post, we lowered the learning rate (from 5e-4 to 5e-5) and the number of steps (from 1,000 to 200). I manually optimized these values after a few runs to stabilize the training and achieve the best results.
Now we can start training the model. Please note that it requires an A100 GPU and takes between 1 hour to complete the training.
# Training arguments
training_args = TrainingArguments(
per_device_train_batch_size=4,
gradient_accumulation_steps=4,
gradient_checkpointing=True,
learning_rate=5e-5,
lr_scheduler_type="cosine",
max_steps=200,
save_strategy="no",
logging_steps=1,
output_dir=new_model,
optim="paged_adamw_32bit",
warmup_steps=100,
bf16=True,
report_to="wandb",
)# Create DPO trainer
dpo_trainer = DPOTrainer(
model,
ref_model,
args=training_args,
train_dataset=dataset,
tokenizer=tokenizer,
peft_config=peft_config,
beta=0.1,
max_prompt_length=1024,
max_length=1536,
)
# Fine-tune model with DPO
dpo_trainer.train()
Our model is now fine-tuned. You can consult the project at Weights & Biases ai/mlabonne/NeuralHermes-2-5-Mistral-7B/runs/axe71gr0?workspace=user-mlabonne” rel=”noopener ugc nofollow” target=”_blank”>in this direction. Here are some interesting metrics to analyze:
Interestingly, training loss quickly drops to zero (before 50 steps), despite 100 warm-up steps. Meanwhile, the other metrics continue to evolve.
The train/rewards/chosen and train/rewards/rejected plots correspond to the mean difference between the log-odds generated by the trained and reference models. It makes sense that over time they would diverge as our trained model learns the preferred responses. The train/rewards/margins chart also shows the difference between these two charts. Finally, the train/reward/accuracies graph shows how often the preferred response is chosen. The trained model quickly reaches a perfect accuracy score, which is a good sign, but it could also mean that the difference between the preferred and rejected answers is too obvious.
Now that it is trained, we can merge the adapter with the original model. We then save the merged model and tokenizer before sending it to the Hugging Face Hub.
# Save artifacts
dpo_trainer.model.save_pretrained("final_checkpoint")
tokenizer.save_pretrained("final_checkpoint")# Flush memory
del dpo_trainer, model, ref_model
gc.collect()
torch.cuda.empty_cache()
# Reload model in FP16 (instead of NF4)
base_model = AutoModelForCausalLM.from_pretrained(
model_name,
return_dict=True,
torch_dtype=torch.float16,
)
tokenizer = AutoTokenizer.from_pretrained(model_name)
# Merge base model with the adapter
model = PeftModel.from_pretrained(base_model, "final_checkpoint")
model = model.merge_and_unload()
# Save model and tokenizer
model.save_pretrained(new_model)
tokenizer.save_pretrained(new_model)
# Push them to the HF Hub
model.push_to_hub(new_model, use_temp_dir=False, token=hf_token)
tokenizer.push_to_hub(new_model, use_temp_dir=False, token=hf_token)
Let's see how our model performs in a real test. We'll format the message to ask a basic question: “What is a large language model?”
# Format prompt
message = (
{"role": "system", "content": "You are a helpful assistant chatbot."},
{"role": "user", "content": "What is a Large Language Model?"}
)
tokenizer = AutoTokenizer.from_pretrained(new_model)
prompt = tokenizer.apply_chat_template(message, add_generation_prompt=True, tokenize=False)# Create pipeline
pipeline = transformers.pipeline(
"text-generation",
model=new_model,
tokenizer=tokenizer
)
# Generate text
sequences = pipeline(
prompt,
do_sample=True,
temperature=0.7,
top_p=0.9,
num_return_sequences=1,
max_length=200,
)
print(sequences(0)('generated_text'))
Here is the model's response:
A large language model is a type of artificial intelligence (ai) system that has been trained on vast amounts of text data. These models are designed to understand and generate human language, allowing them to perform various natural language processing tasks, such as text generation, language translation, and question answering. Large language models typically use deep learning techniques, like recurrent neural networks (RNNs) or transformers, to learn patterns and relationships in the data, enabling them to generate coherent and contextually relevant responses. The size of these models, in terms of the number of parameters and the volume of data they are trained on, plays a significant role in their ability to comprehend and produce complex language structures.
Everything seems to work, now we can evaluate the merged model. Since this is a general purpose model, we can take advantage of the lm-evaluation-harness to evaluate it. As the process requires quite a lot of resources, we can also send it directly for evaluation in the LLM Open Leaderboard. It took a few days, but here are the results compared to other OpenHermes models:
Compared to the original model, the NeuralHermes-2–5-Mistral-7B model improved the mean score by 6.7 points (particularly in GSM8K). This is an unexpectedly large improvement, showing the power of direct preference optimization.
In this article, we fit an already supervised fitted model using DPO and create our own. NeuralHermes-2.5 model. By leveraging a high-quality preference data set, we created a sample-efficient tuning pipeline that produced significant improvement in the Open LLM leaderboard. If you want to try it, you can find quantized variants of this model or use this Embracing the face space.
Please note that our adjustment process can still be improved in different ways. For example, the preferences data set is still quite raw and could be improved with more filtering and using different models. Additionally, numerous hyperparameters can still be modified to achieve better results. In particular, the learning rate can still be reduced to train the model in more steps and inject more preference data.