Easy handling of long documents

Current text embedding models, such as BERT, are limited to processing only 512 tokens at a time, making them difficult to be effective with long documents. This limitation often results in a loss of context and nuanced understanding. However, Jina Embeddings v2 addresses this issue by supporting sequences of up to 8,192 tokens, preserving context and improving the accuracy and relevance of information processed in long documents. This advancement marks a substantial improvement in handling complex text data.

Learning objectives

Understand the limitations of traditional text embedding models like BERT in handling long documents.
Learn how Jina Embeddings v2 overcomes these limitations with its support for 8192 tokens and advanced architecture.
Explore the key innovations behind Jina Embeddings v2, including ALiBi, GLU, and its three-stage training process.
Discover real-world applications of Jina Embeddings v2 in fields such as legal research, content management, and generative ai.
Get practical knowledge on how to integrate Jina Embeddings v2 into your projects using the Hugging Face libraries.

This article was published as part of the Data Science Blogathon.

The challenges of embedding long documents

Long documents pose unique challenges in NLP. Traditional models process text in fragments, truncating context or producing fragmented embeddings that misrepresent the original document. This results in:

Increased computational overhead
Increased memory usage
Decreased performance on tasks that require a holistic understanding of the text.

Jina Embeddings v2 directly addresses these issues by expanding the token limit to 8192eliminating the need for excessive segmentation and preserving the semantic integrity of the document.

Also Read: Guide to Word Embedding System

Architecture paradigm and innovative training

Jina Embeddings v2 takes the best of BERT and powers it with cutting-edge innovations. This is how it works:

Attention with Linear Biases (ALiBi): ALiBi replaces traditional positional embeddings with a linear bias applied to attention scores. This allows the model to efficiently extrapolate to sequences much longer than those observed during training. Unlike previous implementations designed for unidirectional generative tasks, Jina Embeddings v2 employs a bidirectional variant, ensuring support for encoding-based tasks.
Closed Linear Units (GLU): The forward layers use GLU, known to improve transformer efficiency. The model uses variants such as GEGLU and ReGLU to optimize performance based on model size.
Optimized training process: Jina Embeddings v2 follows a three-stage training paradigm:
Pre-workout: The model is trained on Colossal Clean Crawled Corpus (C4), leveraging masked language modeling (MLM) to build a solid foundation.
Fine tuning with text pairs: Focused on aligning embeddings for semantically similar text pairs.
Hard Negative Adjustment: It incorporates challenging distractor examples to improve the model's classification and retrieval capabilities.
Memory efficient training: Techniques such as mixed precision training and activation checkpoints ensure scalability for larger batch sizes, which is critical for contrastive learning tasks.

With ALiBi attention, a linear bias is incorporated into each attention score preceding the softmax operation. Each attention head uses a different constant scalar, m, which diversifies its calculation. Our model adopts the encoder variant where all tokens assist each other during computation, in contrast to the causal variant originally designed for language modeling. In the latter case, a causal mask confines the tokens so that they attend only to the preceding tokens in the sequence.

Performance Benchmarks

Jina Embeddings v2 delivers state-of-the-art performance on multiple benchmarks, including the Massive Text Embedding Benchmark (MTEB) and newly designed long document datasets. Highlights include:

Classification: It achieves top-notch accuracy on tasks like amazon polarity classification and toxic conversations, demonstrating strong semantic understanding.
Group: Outperforms the competition in clustering related texts, validated by tasks such as PatentClustering and WikiCitiesClustering.
Recovery: It excels at retrieval tasks like NarrativeQA, where full document context is essential.
Handling long documents: It maintains the accuracy of MLM even in sequences of 8192 tokens, demonstrating its ability to generalize effectively.

The graph compares the performance of the integrated models on retrieval and clustering tasks with different sequence lengths. Text-embedding-ada-002 excels, especially at its 8191 token limit, showing significant gains on long context tasks. Other models, such as e5-base-v2, show consistent but less dramatic improvements with longer sequences, possibly affected by the lack of prefixes as query: in their configuration. In general, handling longer sequences is essential to maximize performance on these tasks.

Applications in real world scenarios

Legal and academic research: Jina Embeddings v2's ability to encode long documents makes it ideal for searching and analyzing legal briefs, academic articles, and patent applications. It ensures context-rich and semantically accurate embeddings, crucial for detailed comparisons and retrieval tasks.
Content Management Systems: Companies that manage large repositories of articles, manuals, or multimedia captions can take advantage of Jina Embeddings v2 for efficient tagging, grouping, and retrieval.
Generative ai: With its extended context handling, Jina Embeddings v2 can significantly improve generative ai applications. For example:
Improve the quality of ai-generated summaries by providing richer, more context-sensitive embeddings.
Enable more relevant and accurate completions for prompt-based models.
E-commerce: Advanced product search and recommendation systems benefit from embeddings that capture nuanced details in long product descriptions and user reviews.

Comparison with existing models

Jina Embeddings v2 stands out not only for its ability to handle extended sequences but also for its competitive performance against proprietary models such as text-embedding-ada-002 from OpenAI. While many open source models limit the length of their sequences to 512 tokens, the 16x improvement of Jina Embeddings v2 enables entirely new use cases in NLP.

Furthermore, its open source availability ensures accessibility for various organizations and projects. The model can be tuned for specific applications using resources from your Hugging Face Repository.

How to use Jina Embeddings v2 with Hugging Face?

Step 1: installation

!pip install transformers  
!pip install -U sentence-transformers

Step 2 – Using Jina Embeddings with Transformers

You can use Jina embeddings directly through the transformer library:

import torch  
from transformers import AutoModel  
from numpy.linalg import norm  

# Define cosine similarity function  
cos_sim = lambda a, b: (a @ b.T) / (norm(a) * norm(b))  

# Load the Jina embedding model  
model = AutoModel.from_pretrained('jinaai/jina-embeddings-v2-base-en', trust_remote_code=True)  

# Encode sentences  
embeddings = model.encode(('How is the weather today?', 'What is the current weather like today?'))  

# Calculate cosine similarity  
print(cos_sim(embeddings, embeddings))

Production:

Handling long sequences

To process longer sequences, specify the max_length parameter:

embeddings = model.encode(('Very long ... document'), max_length=2048)

Step 3: Using Jina Embeddings with Sentence-Transformers

Alternatively, use Jina embeddings with the sentence transformer library:

from sentence_transformers import SentenceTransformer  
from sentence_transformers.util import cos_sim  

# Load the Jina embedding model  
model = SentenceTransformer('jinaai/jina-embeddings-v2-base-en', trust_remote_code=True)  

# Encode sentences  
embeddings = model.encode(('How is the weather today?', 'What is the current weather like today?'))  

# Calculate cosine similarity  
print(cos_sim(embeddings, embeddings))

Setting the maximum sequence length

Control the length of the input stream as necessary:

model.max_seq_length = 1024  # Set maximum sequence length to 1024 tokens

Important notes

Make sure you are logged in to Hugging Face to access private models. Provide an access token if necessary.
The guide applies to English models; use the appropriate model identifier for other languages (for example, Chinese or German).

Also read: Exploring models integrated with Vertex ai

Conclusion

Jina Embeddings v2 marks a major advancement in NLP, addressing the challenges of embedding long documents. By supporting streams of up to 8,192 tokens and delivering robust performance, it enables a variety of applications, including academic research, enterprise search, and generative ai. As NLP tasks increasingly involve processing long, complex texts, innovations like Jina Embeddings v2 will become essential. Its capabilities not only improve current workflows but also open up new possibilities for working with long-form textual data in the future.

For more details or to integrate Jina Embeddings v2 into your projects, visit their hug face page.

Key takeaways

Jina Embeddings v2 supports up to 8192 tokens, addressing a key limitation in long document NLP tasks.
ALiBi (Attention with Linear Biases) replaces traditional positional embeddings, allowing the model to process longer sequences effectively.
Gated Linear Units (GLU) improve transformer efficiency, with variants such as GEGLU and ReGLU improving performance.
The three-stage training process (pre-training, fine-tuning, and strict negative fine-tuning) ensures that the model produces robust and accurate embeddings.
Jina Embeddings v2 performs exceptionally well in tasks such as classification, grouping, and retrieval, especially for long documents.

Frequently asked questions

P1. What makes Jina Embeddings v2 unique compared to traditional models like BERT?

A. Jina Embeddings v2 supports sequences of up to 8,192 tokens, exceeding the 512 token limit of traditional models such as BERT. This allows you to handle long documents without segmenting them, preserving global context and improving semantic representation.

P2. How does Jina Embeddings v2 achieve efficient handling of long sequences?

A. The model incorporates cutting-edge innovations such as Attention Linear Biases (ALiBi), Gated Linear Units (GLU), and a three-stage training paradigm. These optimizations enable efficient handling of long texts while maintaining high performance and efficiency.

P3. How can I use Jina Embeddings v2 with Hugging Face libraries?

A. You can integrate it using the transformer or sentence transformer libraries. Both provide easy-to-use APIs for encoding text, handling long sequences, and performing similarity calculations. Detailed configuration steps and example codes are provided in the guide.

Q4. What precautions should I take when using Jina Embeddings v2?

A. Please ensure you are logged in to Hugging Face to access private models and provide an access token if required. Additionally, confirm the model's compatibility with your language requirements by selecting the appropriate identifier (for example, for Chinese or German models).

The media shown in this article is not the property of Analytics Vidhya and is used at the author's discretion.

Hello! I am an enthusiastic student of data science who loves to explore new things. My passion for data science stems from a deep curiosity about how data can be transformed into actionable insights. I enjoy diving into various data sets, discovering patterns, and applying machine learning algorithms to solve real-world problems. Every project I undertake is an opportunity to improve my skills and learn about new tools and techniques in the ever-evolving field of data science.

Easy handling of long documents

Technical Terrence Team

Leave a Reply Cancel reply

Recommended.

June sales fall 47% but there are more buyers and sellers

Ruto ally says Telegram account was hacked before Kenyan election | Kenya

Porsche's new Taycan electric vehicles have more range, faster acceleration and a higher price

Bankruptcy Watch: Discount Retailers Issue Warning of Concern

Aid agency urges Johnson & Johnson to improve access to TB drug By Reuters

Categories

Important Links

Easy handling of long documents

Learning objectives

The challenges of embedding long documents

Architecture paradigm and innovative training

Performance Benchmarks

Applications in real world scenarios

Comparison with existing models

How to use Jina Embeddings v2 with Hugging Face?

Step 1: installation

Step 2 – Using Jina Embeddings with Transformers

Handling long sequences

Step 3: Using Jina Embeddings with Sentence-Transformers

Setting the maximum sequence length

Important notes

Conclusion

Key takeaways

Frequently asked questions

Related

Technical Terrence Team

Leave a Reply Cancel reply

Recommended.

June sales fall 47% but there are more buyers and sellers

Ruto ally says Telegram account was hacked before Kenyan election | Kenya

Porsche's new Taycan electric vehicles have more range, faster acceleration and a higher price

Bankruptcy Watch: Discount Retailers Issue Warning of Concern

Aid agency urges Johnson & Johnson to improve access to TB drug By Reuters

Categories

Important Links

Get daily news updates to your inbox!