Current text embedding models, such as BERT, are limited to processing only 512 tokens at a time, making them difficult to be effective with long documents. This limitation often results in a loss of context and nuanced understanding. However, Jina Embeddings v2 addresses this issue by supporting sequences of up to 8,192 tokens, preserving context and improving the accuracy and relevance of information processed in long documents. This advancement marks a substantial improvement in handling complex text data.
Learning objectives
- Understand the limitations of traditional text embedding models like BERT in handling long documents.
- Learn how Jina Embeddings v2 overcomes these limitations with its support for 8192 tokens and advanced architecture.
- Explore the key innovations behind Jina Embeddings v2, including ALiBi, GLU, and its three-stage training process.
- Discover real-world applications of Jina Embeddings v2 in fields such as legal research, content management, and generative ai.
- Get practical knowledge on how to integrate Jina Embeddings v2 into your projects using the Hugging Face libraries.
This article was published as part of the Data Science Blogathon.
The challenges of embedding long documents
Long documents pose unique challenges in NLP. Traditional models process text in fragments, truncating context or producing fragmented embeddings that misrepresent the original document. This results in:
- Increased computational overhead
- Increased memory usage
- Decreased performance on tasks that require a holistic understanding of the text.
Jina Embeddings v2 directly addresses these issues by expanding the token limit to 8192eliminating the need for excessive segmentation and preserving the semantic integrity of the document.
Also Read: Guide to Word Embedding System
Architecture paradigm and innovative training
Jina Embeddings v2 takes the best of BERT and powers it with cutting-edge innovations. This is how it works:
- Attention with Linear Biases (ALiBi): ALiBi replaces traditional positional embeddings with a linear bias applied to attention scores. This allows the model to efficiently extrapolate to sequences much longer than those observed during training. Unlike previous implementations designed for unidirectional generative tasks, Jina Embeddings v2 employs a bidirectional variant, ensuring support for encoding-based tasks.
- Closed Linear Units (GLU): The forward layers use GLU, known to improve transformer efficiency. The model uses variants such as GEGLU and ReGLU to optimize performance based on model size.
- Optimized training process: Jina Embeddings v2 follows a three-stage training paradigm:
- Pre-workout: The model is trained on Colossal Clean Crawled Corpus (C4), leveraging masked language modeling (MLM) to build a solid foundation.
- Fine tuning with text pairs: Focused on aligning embeddings for semantically similar text pairs.
- Hard Negative Adjustment: It incorporates challenging distractor examples to improve the model's classification and retrieval capabilities.
- Memory efficient training: Techniques such as mixed precision training and activation checkpoints ensure scalability for larger batch sizes, which is critical for contrastive learning tasks.
With ALiBi attention, a linear bias is incorporated into each attention score preceding the softmax operation. Each attention head uses a different constant scalar, m, which diversifies its calculation. Our model adopts the encoder variant where all tokens assist each other during computation, in contrast to the causal variant originally designed for language modeling. In the latter case, a causal mask confines the tokens so that they attend only to the preceding tokens in the sequence.
Performance Benchmarks
Jina Embeddings v2 delivers state-of-the-art performance on multiple benchmarks, including the Massive Text Embedding Benchmark (MTEB) and newly designed long document datasets. Highlights include:
- Classification: It achieves top-notch accuracy on tasks like amazon polarity classification and toxic conversations, demonstrating strong semantic understanding.
- Group: Outperforms the competition in clustering related texts, validated by tasks such as PatentClustering and WikiCitiesClustering.
- Recovery: It excels at retrieval tasks like NarrativeQA, where full document context is essential.
- Handling long documents: It maintains the accuracy of MLM even in sequences of 8192 tokens, demonstrating its ability to generalize effectively.
The graph compares the performance of the integrated models on retrieval and clustering tasks with different sequence lengths. Text-embedding-ada-002 excels, especially at its 8191 token limit, showing significant gains on long context tasks. Other models, such as e5-base-v2, show consistent but less dramatic improvements with longer sequences, possibly affected by the lack of prefixes as query: in their configuration. In general, handling longer sequences is essential to maximize performance on these tasks.
Applications in real world scenarios
- Legal and academic research: Jina Embeddings v2's ability to encode long documents makes it ideal for searching and analyzing legal briefs, academic articles, and patent applications. It ensures context-rich and semantically accurate embeddings, crucial for detailed comparisons and retrieval tasks.
- Content Management Systems: Companies that manage large repositories of articles, manuals, or multimedia captions can take advantage of Jina Embeddings v2 for efficient tagging, grouping, and retrieval.
- Generative ai: With its extended context handling, Jina Embeddings v2 can significantly improve generative ai applications. For example:
- Improve the quality of ai-generated summaries by providing richer, more context-sensitive embeddings.
- Enable more relevant and accurate completions for prompt-based models.
- E-commerce: Advanced product search and recommendation systems benefit from embeddings that capture nuanced details in long product descriptions and user reviews.
Comparison with existing models
Jina Embeddings v2 stands out not only for its ability to handle extended sequences but also for its competitive performance against proprietary models such as text-embedding-ada-002 from OpenAI. While many open source models limit the length of their sequences to 512 tokens, the 16x improvement of Jina Embeddings v2 enables entirely new use cases in NLP.
Furthermore, its open source availability ensures accessibility for various organizations and projects. The model can be tuned for specific applications using resources from your Hugging Face Repository.
How to use Jina Embeddings v2 with Hugging Face?
Step 1: installation
!pip install transformers
!pip install -U sentence-transformers
Step 2 – Using Jina Embeddings with Transformers
You can use Jina embeddings directly through the transformer library:
import torch
from transformers import AutoModel
from numpy.linalg import norm
# Define cosine similarity function
cos_sim = lambda a, b: (a @ b.T) / (norm(a) * norm(b))
# Load the Jina embedding model
model = AutoModel.from_pretrained('jinaai/jina-embeddings-v2-base-en', trust_remote_code=True)
# Encode sentences
embeddings = model.encode(('How is the weather today?', 'What is the current weather like today?'))
# Calculate cosine similarity
print(cos_sim(embeddings, embeddings))
Production:
Handling long sequences
To process longer sequences, specify the max_length parameter:
embeddings = model.encode(('Very long ... document'), max_length=2048)
Step 3: Using Jina Embeddings with Sentence-Transformers
Alternatively, use Jina embeddings with the sentence transformer library:
from sentence_transformers import SentenceTransformer
from sentence_transformers.util import cos_sim
# Load the Jina embedding model
model = SentenceTransformer('jinaai/jina-embeddings-v2-base-en', trust_remote_code=True)
# Encode sentences
embeddings = model.encode(('How is the weather today?', 'What is the current weather like today?'))
# Calculate cosine similarity
print(cos_sim(embeddings, embeddings))
Setting the maximum sequence length
Control the length of the input stream as necessary:
model.max_seq_length = 1024 # Set maximum sequence length to 1024 tokens
Important notes
- Make sure you are logged in to Hugging Face to access private models. Provide an access token if necessary.
- The guide applies to English models; use the appropriate model identifier for other languages (for example, Chinese or German).
Also read: Exploring models integrated with Vertex ai
Conclusion
Jina Embeddings v2 marks a major advancement in NLP, addressing the challenges of embedding long documents. By supporting streams of up to 8,192 tokens and delivering robust performance, it enables a variety of applications, including academic research, enterprise search, and generative ai. As NLP tasks increasingly involve processing long, complex texts, innovations like Jina Embeddings v2 will become essential. Its capabilities not only improve current workflows but also open up new possibilities for working with long-form textual data in the future.
For more details or to integrate Jina Embeddings v2 into your projects, visit their hug face page.
Key takeaways
- Jina Embeddings v2 supports up to 8192 tokens, addressing a key limitation in long document NLP tasks.
- ALiBi (Attention with Linear Biases) replaces traditional positional embeddings, allowing the model to process longer sequences effectively.
- Gated Linear Units (GLU) improve transformer efficiency, with variants such as GEGLU and ReGLU improving performance.
- The three-stage training process (pre-training, fine-tuning, and strict negative fine-tuning) ensures that the model produces robust and accurate embeddings.
- Jina Embeddings v2 performs exceptionally well in tasks such as classification, grouping, and retrieval, especially for long documents.
Frequently asked questions
A. Jina Embeddings v2 supports sequences of up to 8,192 tokens, exceeding the 512 token limit of traditional models such as BERT. This allows you to handle long documents without segmenting them, preserving global context and improving semantic representation.
A. The model incorporates cutting-edge innovations such as Attention Linear Biases (ALiBi), Gated Linear Units (GLU), and a three-stage training paradigm. These optimizations enable efficient handling of long texts while maintaining high performance and efficiency.
A. You can integrate it using the transformer or sentence transformer libraries. Both provide easy-to-use APIs for encoding text, handling long sequences, and performing similarity calculations. Detailed configuration steps and example codes are provided in the guide.
A. Please ensure you are logged in to Hugging Face to access private models and provide an access token if required. Additionally, confirm the model's compatibility with your language requirements by selecting the appropriate identifier (for example, for Chinese or German models).
The media shown in this article is not the property of Analytics Vidhya and is used at the author's discretion.