Try taking a photo of each of the places in North America. barely 11,000 tree species and you will have a mere fraction of the millions of photographs contained in nature image data sets. These huge collections of snapshots, ranging from butterflies to humpback whales – are a great research tool for ecologists because they provide evidence of unique organism behaviors, rare conditions, migration patterns, and responses to pollution and other forms of climate change.
While comprehensive, nature image datasets are still not as useful as they could be. It takes a long time to search these databases and retrieve the images most relevant to your hypothesis. You would be better off with an automated research assistant, or perhaps artificial intelligence systems called multimodal vision language models (VLM). They are trained on both text and images, making it easier for them to identify finer details, such as specific trees in the background of a photo.
But how well can VLMs help nature researchers in image retrieval? A team from MIT's Computer Science and artificial intelligence Laboratory (CSAIL), University College London, iNaturalist, and elsewhere designed a performance test to find out. The task of each VLM: locate and reorganize the most relevant results within the team's “INQUIRE” data set, composed of 5 million wildlife images and 250 search prompts from ecologists and other biodiversity experts.
Looking for that special frog
In these evaluations, researchers found that larger, more advanced VLMs, which are trained on much more data, can sometimes give researchers the results they want to see. The models performed reasonably well on simple queries about visual content, such as identifying debris on a reef, but had significant difficulty with queries that required expert knowledge, such as identifying specific biological conditions or behaviors. For example, the VLMs discovered examples of jellyfish on the beach with some ease, but had difficulty with more technical indications such as “axantism in a green frog”, a condition that limits its ability to turn its skin yellow.
Their findings indicate that models need much more domain-specific training data to process difficult queries. MIT doctoral student Edward Vendrow, a CSAIL affiliate who co-led work on the data set in a new paperbelieves that by becoming familiar with more informative data, VLMs could one day be excellent research assistants. “We want to build recovery systems that find the exact results that scientists look for when they monitor biodiversity and analyze climate change,” says Vendrow. “Multimodal models do not yet understand more complex scientific language, but we believe INQUIRE will be an important benchmark to track how they improve in understanding scientific terminology and ultimately help researchers automatically find the exact images.” that they need.”
The team's experiments illustrated that larger models tended to be more effective for both simpler and more complex searches due to their extensive training data. They first used the INQUIRE dataset to test whether VLMs could reduce a set of 5 million images to the 100 most relevant results (also known as “ranking”). For simple search queries such as “a reef with artificial structures and debris”, relatively large models such as “SigLIP” found matching images, while the smaller CLIP models had problems. According to Vendrow, larger VLMs are “only starting to be useful” for classifying more difficult queries.
Vendrow and his colleagues also evaluated how well the multimodal models could reclassify those 100 results, rearranging which images were most relevant to a search. In these tests, even large LLMs trained on more curated data, such as GPT-4o, struggled: Their accuracy score was just 59.6 percent, the highest score achieved by any model.
The researchers presented these results at the Neural Information Processing Systems (NeurIPS) Conference earlier this month.
Consulting for CONSULTING
The INQUIRE dataset includes search queries based on discussions with ecologists, biologists, oceanographers, and other experts about the types of images they would look for, including the animals' unique physical conditions and behaviors. A team of annotators then spent 180 hours searching the iNaturalist dataset for these prompts, carefully reviewing approximately 200,000 results to label 33,000 matches that fit the prompts.
For example, annotators used queries such as “a hermit crab that uses plastic debris as a shell” and “a California condor labeled with a green '26'” to identify subsets of the larger image data set that represent these rare events. and specific.
The researchers then used the same search queries to see how well the VLMs could retrieve images from iNaturalist. The annotators' labels revealed when the models had difficulty understanding the scientists' keywords, as their results included images previously labeled as irrelevant to the search. For example, VLM results for “fire-scarred redwoods” sometimes included images of trees without any markings.
“This is careful selection of data, focusing on capturing real examples of scientific research in research areas in ecology and environmental sciences,” says Sara Beery, Homer A. Burnell Assistant Professor of Career Development at MIT, principal investigator of CSAIL. and companion. -main author of the work. “It has proven vital in expanding our understanding of the current capabilities of VLMs in these potentially impactful scientific environments. “It has also outlined gaps in current research that we can now work to address, particularly for complex composition queries, technical terminology, and the subtle and detailed differences that delineate the categories of interest to our contributors.”
“Our findings imply that some vision models are already accurate enough to help wildlife scientists recover some images, but many tasks remain too difficult for even the largest and best-performing models,” says Vendrow. “Although INQUIRE focuses on monitoring ecology and biodiversity, the wide variety of its queries means that VLMs that perform well in INQUIRE are likely to excel in analyzing large image collections in other observation-intensive fields.”
Inquiring minds want to see
Taking their project further, the researchers are working with iNaturalist to develop a query system that better helps scientists and other curious minds find the images they really want to see. Your work manifestation allows users to filter searches by species, allowing for faster discovery of relevant results such as the various eye colors of cats. Vendrow and co-senior author Omiros Pantazis, who recently received his PhD from University College London, also aim to improve the reclassification system by augmenting current models to provide better results.
University of Pittsburgh Associate Professor Justin Kitzes highlights INQUIRE's ability to uncover secondary data. “Biodiversity data sets are becoming too large for any individual scientist to review,” says Kitzes, who was not involved in the research. “This paper draws attention to a difficult and unsolved problem: how to effectively search that data with questions that go beyond simply 'who is here' to ask about individual characteristics, behavior, and interactions between species. “Being able to efficiently and accurately uncover these more complex phenomena in biodiversity image data will be critical for fundamental science and real-world impacts in ecology and conservation.”
Vendrow, Pantazis and Beery wrote the paper with iNaturalist software engineer Alexander Shepard, University College London professors Gabriel Brostow and Kate Jones, University of Edinburgh associate professor and co-senior author Oisin Mac Aodha, and assistant professor Grant They come from the University of Massachusetts at Amherst. Horn, who was co-lead author. Their work was supported, in part, by the Generative ai Laboratory at the University of Edinburgh, the US National Science Foundation/Natural Sciences and Engineering Research Council, Canada's Global Center for ai and Biodiversity Change , a research grant from the Royal Society and the Biome. Health project funded by the World Wildlife Fund of the United Kingdom.