Image by author
Mistral ai, one of the world's leading ai research companies, has recently released the base model for ai/blog/mistral-7b-v0-2-base-model/” rel=”noopener” target=”_blank”>Mistral 7B v0.2.
This open source language model was unveiled during the company's hackathon event on March 23, 2024.
Mistral 7B models have 7.3 billion parameters, making them extremely powerful. They outperform Llama 2 13B and Llama 1 34B in almost all benchmarks. The latest V0.2 model introduces a 32k context window, among other advancements, improving its ability to process and generate text.
Additionally, the recently announced version is the base model of the instruction-tuned variant, “Mistral-7B-Instruct-V0.2,” which was released early last year.
In this tutorial, I'll show you how to access and adjust this language model in Hugging Face.
We will be tuning the base model Mistral 7B-v0.2 using Hugging Face's AutoTrain functionality.
hugging face is known for democratizing access to machine learning models, enabling everyday users to develop advanced ai solutions.
AutoTrain, a feature of Hugging Face, automates the model training process, making it accessible and efficient.
It helps users select the best training parameters and techniques when tuning models, which is an otherwise daunting and time-consuming task.
Here are 5 steps to tune your Mistral-7B model:
1. Set up the environment
You must first create an account with Hugging Face and then create a model repository.
To achieve this, simply follow the steps provided in this link and come back to this tutorial.
We will train the model in Python. When it comes to selecting a laptop environment for training, you can use Kaggle Notebooks either Google Co.which provide free access to GPUs.
If the training process is taking too long, you may want to switch to a cloud platform like AWS Sagemaker or Azure ML.
Finally, perform the following pip installations before you start coding with this tutorial:
!pip install -U autotrain-advanced
!pip install datasets transformers
2. Preparing your data set
In this tutorial, we will use the alpaca data set in Hugging Face, which looks like this:
We will fine-tune the model in pairs of instructions and outcomes and evaluate its ability to respond to the instruction given in the evaluation process.
To access and prepare this data set, run the following lines of code:
import pandas as pd
from datasets import load_dataset
# Load and preprocess dataset
def preprocess_dataset(dataset_name, split_ratio='train(:10%)', input_col="input", output_col="output"):
dataset = load_dataset(dataset_name, split=split_ratio)
df = pd.DataFrame(dataset)
chat_df = df(df(input_col) == '').reset_index(drop=True)
return chat_df
# Formatting according to AutoTrain requirements
def format_interaction(row):
formatted_text = f"(Begin) {row('instruction')} (End) {row('output')} (Close)"
return formatted_text
# Process and save the dataset
if __name__ == "__main__":
dataset_name = "tatsu-lab/alpaca"
processed_data = preprocess_dataset(dataset_name)
processed_data('formatted_text') = processed_data.apply(format_interaction, axis=1)
save_path="formatted_data/training_dataset"
os.makedirs(save_path, exist_ok=True)
file_path = os.path.join(save_path, 'formatted_train.csv')
processed_data(('formatted_text')).to_csv(file_path, index=False)
print("Dataset formatted and saved.")
The first function will load the Alpaca dataset using the “datasets” library and clean it up to ensure we don't include empty statements. The second function structures your data in a format that AutoTrain can understand.
After running the above code, the dataset will be loaded, formatted and saved to the specified path. When you open your formatted data set, you should see a single column named “formatted_text.”
3. Set up your training environment
Now that you have successfully prepared the data set, let's proceed to set up your model training environment.
To do this, you must define the following parameters:
project_name="mistralai"
model_name="alpindale/Mistral-7B-v0.2-hf"
push_to_hub = True
hf_token = 'your_token_here'
repo_id = 'your_repo_here.'
Here's a breakdown of the specifications above:
- You can specify any project's name. This is where all your project and training files will be stored.
- He model name The parameter is the model you want to fit. In this case, I have specified a path to Base model Mistral-7B v0.2 in Hug the face.
- He token_hf The variable must be set to your Hugging Face token, which can be obtained by navigating to this link.
- His repository_id should be configured in the Hugging Face model repository that you created in the first step of this tutorial. For example, my repository ID is NatasshaS/Model2.
4. Configure the model parameters.
Before tuning our model, we must define training parameters, which control aspects of the model's behavior, such as training duration and regularization.
These parameters influence key aspects such as how long the model trains, how it learns from the data, and how it avoids overfitting.
You can configure the following parameters for your model:
use_fp16 = True
use_peft = True
use_int4 = True
learning_rate = 1e-4
num_epochs = 3
batch_size = 4
block_size = 512
warmup_ratio = 0.05
weight_decay = 0.005
lora_r = 8
lora_alpha = 16
lora_dropout = 0.01
5. Set environment variables
Let's now prepare our training environment by setting some environment variables.
This step ensures that the AutoTrain function uses the desired settings to tune the model, such as our project name and training preferences:
os.environ("PROJECT_NAME") = project_name
os.environ("MODEL_NAME") = model_name
os.environ("LEARNING_RATE") = str(learning_rate)
os.environ("NUM_EPOCHS") = str(num_epochs)
os.environ("BATCH_SIZE") = str(batch_size)
os.environ("BLOCK_SIZE") = str(block_size)
os.environ("WARMUP_RATIO") = str(warmup_ratio)
os.environ("WEIGHT_DECAY") = str(weight_decay)
os.environ("USE_FP16") = str(use_fp16)
os.environ("LORA_R") = str(lora_r)
os.environ("LORA_ALPHA") = str(lora_alpha)
os.environ("LORA_DROPOUT") = str(lora_dropout)
6. Start model training
Finally, let's start training the model using the autotrain domain. This step involves specifying your model, dataset, and training settings, as shown below:
!autotrain llm \
--train \
--model "${MODEL_NAME}" \
--project-name "${PROJECT_NAME}" \
--data-path "formatted_data/training_dataset/" \
--text-column "formatted_text" \
--lr "${LEARNING_RATE}" \
--batch-size "${BATCH_SIZE}" \
--epochs "${NUM_EPOCHS}" \
--block-size "${BLOCK_SIZE}" \
--warmup-ratio "${WARMUP_RATIO}" \
--lora-r "${LORA_R}" \
--lora-alpha "${LORA_ALPHA}" \
--lora-dropout "${LORA_DROPOUT}" \
--weight-decay "${WEIGHT_DECAY}" \
$( (( "$USE_FP16" == "True" )) && echo "--mixed-precision fp16" ) \
$( (( "$USE_PEFT" == "True" )) && echo "--use-peft" ) \
$( (( "$USE_INT4" == "True" )) && echo "--quantization int4" ) \
$( (( "$PUSH_TO_HUB" == "True" )) && echo "--push-to-hub --token ${HF_TOKEN} --repo-id ${REPO_ID}" )
Make sure to change the data path to where your training data set is.
7. Model evaluation
Once your model has finished training, you should see a folder appear in your directory with the same title as your project name.
In my case, this folder is titled “mistrals ” as seen in the image below:
Inside this folder, you can find files covering the weights, hyperparameters, and architectural details of your model.
Let's now check if this fitted model is able to accurately answer a question in our data set. To achieve this, we must first run the following lines of code to generate 5 sample inputs and outputs from our data set:
# Print out 5 sample inputs and outputs from our dataset
for i, example in enumerate(dataset):
if i >= 5:
break
print(f"Instruction: {example('instruction')}")
print(f"Output: {example('output')}\n---")
You should see a response similar to this, showing five sample data points:
Instruction: Give three tips for staying healthy.
Output: 1.Eat a balanced diet and make sure to include plenty of fruits and vegetables.
2. Exercise regularly to keep your body active and strong.
3. Get enough sleep and maintain a consistent sleep schedule.
---
Instruction: What are the three primary colors?
Output: The three primary colors are red, blue, and yellow.
---
Instruction: Describe the structure of an atom.
Output: An atom is made up of a nucleus, which contains protons and neutrons, surrounded by electrons that travel in orbits around the nucleus. The protons and neutrons have a positive charge, while the electrons have a negative charge, resulting in an overall neutral atom. The number of each particle determines the atomic number and the type of atom.
---
Instruction: How can we reduce air pollution?
Output: There are a number of ways to reduce air pollution, such as shifting to renewable energy sources, encouraging the use of public transportation, prohibiting the burning of fossil fuels, implementing policies to reduce emissions from industrial sources, and implementing vehicle emissions standards. Additionally, individuals can do their part to reduce air pollution by reducing car use, avoiding burning materials such as wood, and changing to energy efficient appliances.
---
Instruction: Describe a time when you had to make a difficult decision.
Output: I had to make a difficult decision when I was working as a project manager at a construction company. I was in charge of a project that needed to be completed by a certain date in order to meet the client's expectations. However, due to unexpected delays, we were not able to meet the deadline and so I had to make a difficult decision. I decided to extend the deadline, but I had to stretch the team's resources even further and increase the budget. Although it was a risky decision, I ultimately decided to go ahead with it to ensure that the project was completed on time and that the client's expectations were met. The project was eventually successfully completed and this was seen as a testament to my leadership and decision-making abilities.
Let's write one of the above instructions into the model and check if it generates an accurate result. Here is a function to provide an instruction to the model and get a response from it:
# Function to provide an instruction
def ask(model, tokenizer, question, max_length=128):
inputs = tokenizer.encode(question, return_tensors="pt")
outputs = model.generate(inputs, max_length=max_length, num_return_sequences=1)
answer = tokenizer.decode(outputs(0), skip_special_tokens=True)
return answer
Finally, enter a question in this function as shown below:
question = "Describe a time when you had to make a difficult decision."
answer = ask(model, tokenizer, question)
print(answer)
Your model should generate a response identical to its corresponding output in the training data set, as shown below:
Describe a time when you had to make a difficult decision.
What did you do? How did it turn out?
(/INST) I remember a time when I had to make a difficult decision about
my career. I had been working in the same job for several years and had
grown tired of it. I knew that I needed to make a change, but I was unsure of what to do. I weighed my options carefully and eventually decided to take a leap of faith and start my own business. It was a risky move, but it paid off in the end. I am now the owner of a successful business and
Please note that the answer may appear incomplete or truncated due to the number of tokens we have specified. Feel free to adjust the “max_length” value to allow for a larger response.
If you've made it this far, congratulations!
It has successfully perfected a next-generation language model, leveraging the power of Mistral 7B v-0.2 along with the capabilities of Hugging Face.
But the journey doesn't end here.
As a next step, I recommend experimenting with different data sets or modifying certain training parameters to optimize model performance. Adjusting your models to a larger scale will improve their usefulness, so try experimenting with larger data sets or variable formats, such as PDF and text files.
This experience is invaluable when working with real-world data in organizations, which is often confusing and unstructured.
Natasha Selvaraj is a self-taught data scientist with a passion for writing. Natassha writes on all things data science, a true master of all things data. You can connect with her at LinkedIn or look at it Youtube channel.