Information retrieval (IR) models have the ability to sort and classify documents based on user queries, facilitating efficient and effective access to information. One of the most interesting applications of IR is in the field of biomedicine, where it can be used to search relevant scientific literature and help medical professionals make evidence-based decisions.
However, since most existing IR systems in this field are keyword-based, they may miss relevant articles that do not share exactly the same keywords. Furthermore, dense retriever-based models are trained on a general dataset that cannot perform well on domain-specific tasks. Furthermore, there is also a paucity of domain-specific data sets, which restricts the development of generalizable models.
To address these issues, the authors of this article have introduced MedCPT, an IR model that has been trained on 255 million pairs of queries and articles from anonymous PubMed search records. Traditional IR models have a mismatch between the recuperator and reclassifier modules, which affects their performance. MedCPT, on the other hand, is the first IR model that integrates these two components through contrastive learning. This ensures that the reclassification process is more closely aligned with the characteristics of the recovered items, making the entire system more efficient.
As mentioned above, MedCPT consists of a first-stage retriever and a second-stage reclassifier. This bi-encoder architecture is scalable as documents can be encoded offline and only the user query needs to be encoded at inference time. The retrieval model then uses a nearest neighbor search to identify the parts of the documents that are most similar to the encoded query. The reclassifier, which is a cross-encoder, further refines the classification of the parent items returned by the retriever and generates the final classification of the items.
Although the reorderer is computationally expensive, the entire MedCPT architecture is efficient as only one encoding and nearest neighbor search is required before the reclassification process. MedCPT was evaluated on a wide range of zero-shot IR biomedical tasks. The following are the results:
- MedCPT achieved state-of-the-art document retrieval performance on three of five biomedical tasks on the BEIR benchmark. It outperformed much larger models like Google’s GTR-XXL (4.8B) and OpenAI’s cpt-text-XL (175B).
- The MedCPT item encoder outperforms other models such as SPECTRE and SciNCL when evaluated on the RELISH item similarity task. Furthermore, it also achieves SOTA performance on the MeSH prediction task in SciDocs.
- The MedCPT query coder was able to code biomedical and clinical phrases effectively.
In conclusion, MedCPT is the first information retrieval model that integrates a pair of retrieval and reclassification modules. This architecture provides a balance between efficiency and performance, and MedCPT can achieve SOTA performance on numerous biomedical tasks and outperform many larger models. The model has the potential to be applied to various biomedical applications, such as recommending related articles, retrieving similar sentences, searching for relevant documents, etc., making it an indispensable asset for both biomedical knowledge discovery and decision support. clinics.
Review the Paper and GitHub. All credit for this research goes to the researchers of this project. Also, don’t forget to join. our 32k+ ML SubReddit, 41k+ Facebook community, Discord channel, and Electronic newsletterwhere we share the latest news on ai research, interesting ai projects and more.
If you like our work, you’ll love our newsletter.
we are also in Telegram and WhatsApp.
I am a Civil Engineering Graduate (2022) from Jamia Millia Islamia, New Delhi, and have a keen interest in Data Science, especially Neural Networks and its application in various areas.
<!– ai CONTENT END 2 –>