The generation of literature-based hypotheses is the central tenet of literature-based discovery (LBD). With drug discovery as its main field of application, linkage-based hypothesis testing (LBD) focuses on hypothesizing links between ideas that have not been examined together before (such as new drug-disease links).
Although these systems have been converted to machine learning methodologies, this configuration has serious problems. Hypotheses cannot be expected to be as expressive if the “language of scientific ideas” is reduced to its most basic form. In addition, LBD does not mimic factors that human scientists consider throughout the ideation process, such as the configuration, requirements and constraints, incentives, and issues of the intended application. Finally, the inductive and generative nature of science, where new concepts and their recombinations are continuously developed, is not considered in the transductive LBD context, where all concepts are known a priori and must be connected.
Researchers from the University of Illinois at Urbana-Champaign, the Hebrew University of Jerusalem, and the Allen Institute for Artificial Intelligence (AI2) attempt to address these complexities with contextual literature-based discovery (C-LBD), a unique environment and paradigm of modeling. . They are the first to use a natural language setting to constrain the LBD generation space and also break with the classic LBD on output by having it generate sentences.
The inspiration for C-LBD comes from the idea of an AI-powered assistant that can provide suggestions in plain language, including unique thoughts and connections. The wizard accepts as input (1) relevant information, such as current challenges, motives, and constraints, and (2) an opening sentence that should be the main focus of the developed scientific concept. Given this information, the team investigates two forms of C-LBD: one that generates a complete sentence that explains an idea, and another that generates only a salient component of the idea.
To this end, they present a novel modeling framework for CLBD that can take inspiration from disparate sources (such as a scientific knowledge graph) and use them to form novel hypotheses. They also introduce a contrastive-in-context model that uses sub-sentences as negatives to avoid emulating unwarranted input and promote creative thinking. Unlike most LBD research, which is aimed at biomedical applications, these experiments apply to papers in the field of informatics. From the 67,408 articles in the ACL anthology, the team autonomously selected a new data set using IE systems, complete with task, method, and background sentence annotations.
By focusing specifically on the field of NLP, researchers in that area will find it easier to analyze the results. Experimental results from automated and human evaluations reveal that recovery-augmented hypothesis generation significantly outperforms previous methods, but that current state-of-the-art generative models are still inadequate for this work.
The team believes that expanding C-LBD to include multimodal analysis of formulas, tables, and figures to provide richer and more complete background context is an intriguing direction to investigate in the future. The use of advanced LLMs such as GPT-4, which is currently under development, is another avenue to investigate.
review the Paper and Github. Don’t forget to join our 22k+ ML SubReddit, discord channel, and electronic newsletter, where we share the latest AI research news, exciting AI projects, and more. If you have any questions about the article above or if we missed anything, feel free to email us at [email protected]
🚀 Check out 100 AI tools at AI Tools Club
Tanushree Shenwai is a consulting intern at MarktechPost. She is currently pursuing her B.Tech at the Indian Institute of Technology (IIT), Bhubaneswar. She is a data science enthusiast and has a strong interest in the scope of application of artificial intelligence in various fields. She is passionate about exploring new advances in technology and its real life application.