Information retrieval (IR) models face significant challenges in delivering transparent and intuitive search experiences. Current methodologies rely primarily on a single semantic similarity score to match queries to passages, resulting in a potentially opaque user experience. This approach often requires users to engage in a cumbersome process of finding specific keywords, applying various filters in the advanced search settings, and iteratively refining their queries based on previous search results. The need for users to construct the “perfect” query to retrieve desired passages highlights the limitations of existing IR systems in providing efficient and easy-to-use search capabilities.
Recent developments in IR models have introduced the use of instructions, moving beyond traditional training of dense retrievers that focused on similarity functions similar to phrase-level agreement. Early efforts such as TART and Instructor incorporated simple task prefixes during training. More recent models such as E5-Mistral, GritLM, and NV-Retriever have expanded on this approach by expanding both the size of the data set and the model. These newer models typically adopt the instruction set proposed by E5-Mistral. However, while these advances represent advancements in the field, they still rely primarily on a single set of instructions and do not fully address the challenges of providing users with a more transparent and flexible search experience.
Researchers at Johns Hopkins University and Samaya ai have introduced Promptriever, a unique approach to information retrieval that enables control using natural language prompts. This model allows users to dynamically adjust relevance criteria using conversational descriptions, eliminating the need for multiple searches or complex filters. For example, when searching for James Cameron movies, users can simply specify criteria such as “Relevant documents are not co-directed and were created before 2022.” Promptriever is based on a dual-encoder recovery architecture and uses large language models such as LLaMA-2 7B as its backbone. While pretrained language models can adapt to natural language instructions, traditional IR training often compromises this ability by focusing solely on optimizing the semantic similarity scores of query passages. Promptriever addresses this limitation, maintaining the ability to follow instructions after IR training.
Promptriever uses a two-part data generation process to train its bi-encoder for instruction-based retrieval. The model is based on the MS MARCO data set, using the tevatron-msmarco-aug version with hard negatives. The first step involves instruction generation, where Llama-3-70B-Instruct creates diverse, query-specific instructions, varying in length and style. These instructions maintain relevance with the original positive passages, as verified by FollowIR-7B.
The second step, negative instruction mining, introduces passages that are positive for queries but negative for instructions. This process encourages the model to consider both query and instruction during training. GPT-4 generates these passages, which are then filtered using FollowIR-7B to ensure accuracy. Human validation confirms the effectiveness of this filtering process, with model-human agreement reaching 84%.
This comprehensive data augmentation approach allows Promptriever to adapt its relevance criteria dynamically based on natural language instructions, significantly improving its retrieval capabilities compared to traditional IR models.
Promptriever demonstrates superior performance in following instructions while maintaining strong standard retrieval capabilities. It outperforms the original RepLLaMA by a significant margin, with improvements of +14.3 p-MRR and +3.1 in nDCG/MAP, establishing itself as the highest performing dense retriever. While cross-encoder models achieve the best results due to their computational advantage, the performance of Promptriever as a bi-encoder model is comparable and more efficient.
On standard uninstructed retrieval tasks, Promptriever performs on par with RepLLaMA for in-domain tasks (MS MARCO) and out-of-domain tasks (BEIR). Furthermore, Promptriever shows 44% less variation in cues compared to RepLLaMA and 77% less than BM25, indicating greater robustness to input variations. These results underscore the effectiveness of Promptriever's instruction-based approach in improving both retrieval accuracy and adaptability to diverse queries.
This study presents Promptriever, a significant advance in information retrieval, featuring the first fast zero-shot retrieval. Developed using a unique instruction-based data set derived from MS MARCO, this model demonstrates superior performance on both standard retrieval and instruction-following tasks. By dynamically adapting its relevance criteria based on per-query instructions, Promptriever shows the successful application of stimulation techniques from language models to dense retrievals. This innovation paves the way for more flexible and easy-to-use information retrieval systems, bridging the gap between natural language processing and efficient search capabilities.
look at the Paper. All credit for this research goes to the researchers of this project. Also, don't forget to follow us on twitter.com/Marktechpost”>twitter and join our Telegram channel and LinkedIn Grabove. If you like our work, you will love our information sheet..
Don't forget to join our 52k+ ML SubReddit
Asjad is an internal consultant at Marktechpost. He is pursuing B.tech in Mechanical Engineering at Indian Institute of technology, Kharagpur. Asjad is a machine learning and deep learning enthusiast who is always researching applications of machine learning in healthcare.
<script async src="//platform.twitter.com/widgets.js” charset=”utf-8″>