With the growing number of integration models available, choosing the right one for your machine learning applications can be challenging. Fortunately, the MTEB Leaderboard provides a wide range of classification metrics for various natural language processing tasks.
When you visit the site, you will notice that the top five integration models are Pretrained Generative Transformers (GPT). This might lead you to think that GPT models are the best for inlays. But is this really true? Let's do an experiment to find out.
Embeddings are tensor representations of texts, which convert IDs to text tokens and project them into a tensor space.
By inputting text into a neural network model and performing a forward pass, you can obtain embedding vectors. However, the actual process is a little more complex. Let's analyze it step by step:
- Convert text to token ID
- Pass token IDs to a neural network
- Return the outputs of the neural network.
In the first step, I will use a tokenizer to achieve this. model_inputs
is the tensor representation of the content of the text, "some questions."
.
from transformers import AutoTokenizertokenizer = AutoTokenizer.from_pretrained("mistralai/Mistral-7B-Instruct-v0.1")
messages = (
{
"role": "user",
"content": "some questions.",
},
)
encodeds = tokenizer.apply_chat_template(messages, return_tensors="pt")
model_inputs = encodeds.to("cuda")
The second step is simple: move forward the model_inputs
in a neural network. The logits of the generated tokens can be accessed through .logits
. torch.no_grad()
It means I don't want the model weights to be updated because the model is in inference mode.
import torchwith torch.no_grad():
return model(model_inputs).logits
The third step is a bit complicated. GPT models are decoders only and their token generation is autoregressive. In simple terms, the last token of a complete sentence has seen all the previous tokens in the sentence. Therefore, the output of the last token contains all the affinity scores (attentions) of the previous tokens.
Bingo! You are most interested in the last token because of the attention mechanism of the transformers.
The output dimension of GPTs implemented in Hugging Face is (batch size, input token size, vocabulary amount). To get the latest token output of all batches, I can perform a tensor slice.
import torch
with torch.no_grad():
return model(model_inputs).logits(:, -1, :)
To measure the quality of these GPT embeddings, you can use cosine similarity. The greater the cosine similarity, the closer the semantic meaning of the sentences.
import torch
def compute_cosine_similarity(vec1, vec2):
cos = torch.nn.CosineSimilarity(dim=1, eps=1e-6)
return cos(vec1, vec2)
Let's create some useful functions that allow us to loop through the list of question and answer pairs and see the result. Mistral 7b v0.1 Instructions one of the great open source models, is used for this experiment.
import torch
from termcolor import colored
from transformers import AutoModelForCausalLM, AutoTokenizermodel = AutoModelForCausalLM.from_pretrained(
"mistralai/Mistral-7B-Instruct-v0.1"
)
tokenizer = AutoTokenizer.from_pretrained("mistralai/Mistral-7B-Instruct-v0.1")
def generate_last_token_embeddings(question, max_new_tokens=30):
messages = (
{
"role": "user",
"content": question,
},
)
encodeds = tokenizer.apply_chat_template(messages, return_tensors="pt")
model_inputs = encodeds.to("cuda")
with torch.no_grad():
return model(model_inputs).logits(:, -1, :)
def get_similarities(questions, answers):
for question in questions:
for answer in answers:
q_embedding, a_embedding = (
generate_last_token_embeddings(question),
generate_last_token_embeddings(answer),
)
similarity = compute_cosine_similarity(q_embedding, a_embedding)
print(colored(f"question: {question} and ans: {answer}", "green"))
print(colored(f"result: {similarity}", "blue"))
questions = ("Where is the headquarter of OpenAI?", "What is GPU?")
answers = (
"OpenAI is based at San Francisco.",
"A graphics processing unit (GPU) is an electronic circuit that can perform mathematical calculations quickly",
)
For the first pair of questions and answers:
- Question: “What is the headquarters of OpenAI?”
- Answer: “OpenAI is based in San Francisco.”
- Cosine similarity: 0.96
For the second pair of questions and answers:
- Question: “What is GPU?”
- Answer: “A graphics processing unit (GPU) is an electronic circuit that can perform mathematical calculations quickly.”
- Cosine similarity: 0.94
For an irrelevant pair:
- Question: “Where is OpenAI's headquarters?”
- Answer: “A graphics processing unit (GPU) is an electronic circuit that can perform mathematical calculations quickly.”
- Cosine similarity: 0.90
For the worst pair:
- Question: “What is GPU?”
- Answer: “OpenAI is based in San Francisco.”
- Cosine similarity: 0.93
These results suggest that the use of GPT models, in this case, the mistral 7b v0.1 instruction, as integration models may not produce great results in terms of distinguishing between relevant and irrelevant pairs. But why are GPT models still in the top five integrated models?
tokenizer = AutoTokenizer.from_pretrained("intfloat/e5-mistral-7b-instruct")
model = AutoModelForCausalLM.from_pretrained(
"intfloat/e5-mistral-7b-instruct"
)
By repeating the same evaluation procedure with a different model, e5-mistral-7b-instruct
, which is one of the best open source MTEB leaderboard models and fine-tuned from the mistral 7b instruction, I found that the cosine similarity for the question and the relevant pairs are 0.88 and 0.84 for the questions from OpenAI and GPU, respectively. For irrelevant question-answer pairs, similarity drops to 0.56 and 0.67. These findings suggest e5-mistral-7b-instruct
It is a much improved model for inlays. What makes such an improvement?
Going deeper into the paper behind e5-mistral-7b-instruct
the key is the use of contrastive loss to further refine the mistral model.
Unlike GPTs which are trained or further tuned using cross entropy loss of predicted tokens and labeled tokens, contrastive loss aims to maximize the distance between negative pairs and minimize the distance between positive pairs.
This blog post covers this concept in greater detail. He sim
The function calculates the cosine distance between two vectors. For contrastive loss, the denominators represent the cosine distance between positive examples and negative examples. The reason behind contrastive loss is that we want similar vectors to be as close to 1 as possible, since log(1) = 0 represents the optimal loss.
In this post, I have highlighted a common mistake when using GPT as built-in models without making any adjustments. My evaluation suggests that by fitting GPTs with contrast loss, the embeddings can be more meaningful and discriminative. By understanding the strengths and limitations of GPT models and leveraging custom loss, such as contrastive loss, you can make more informed decisions when selecting and using integrated models for your machine learning projects. I hope this post helps you choose GPT models wisely for your applications and I look forward to hearing your feedback. 🙂