iRangeGraph: A dynamic approach to improve nearest neighbor search performance using range filtering through efficient graph construction and lower memory footprint on large-scale data systems

Graph-based methods have gained increasing importance in data retrieval and machine learning, particularly in nearest neighbor (NN) search. NN search helps to identify the closest data points to a given query, which becomes critical with high-dimensional data such as text, images, or audio. Approximate nearest neighbor (ANN) methods emerged due to the inefficiency of exact searches in high-dimensional spaces. ANN methods, especially graph-based approaches, balance response time and accuracy, making them widely used in real-world applications such as recommendation engines, e-commerce platforms, and ai-based search systems. These systems rely heavily on timely and accurate retrieval of relevant data from large datasets.

A major challenge in NN search arises when there is a need to combine vector-based search with additional numerical attribute constraints. For example, a user on an e-commerce platform might want to find products similar to a specific item within a given price range. Traditional ANN methods either filter out irrelevant data before the search or search without considering the constraints and filter afterwards. Both approaches face performance issues. Pre-filtering can become inefficient for large data sets, while post-filtering can return many irrelevant results, wasting computational resources. The need for efficient search techniques that incorporate vector similarity and numerical constraints has become increasingly important, especially in systems that handle massive volumes of data across various industries.

Existing methods for range filtering approximate nearest neighbor (RFANN) queries include pre-filtering and post-filtering, where numerical constraints are applied before or after an ANN search. Another method, internal filtering, attempts to integrate these numerical constraints during the search, aiming to visit only data points within the numerical range of the query. However, these methods struggle to provide optimal performance under different query scenarios. For example, pre-filtering becomes slow when the numerical constraint is not selective enough, while post-filtering results in wasted effort when too many irrelevant data points are visited. The inherent shortcomings of these strategies have prompted researchers to explore alternative approaches, particularly for cases where query workloads vary in size and complexity.

Researchers from Nanyang Technological University and Aalborg University have introduced a new method called i rank chart to address the limitations of existing pipelines. Instead of pre-computing graphs for every possible numerical range, iRangeGraph materializes elementary graphs for only a few ranges. These graphs can then be used to dynamically build a dedicated graph for any query range during execution, reducing the need for large-scale index storage. The technique has caught the attention of industry players such as Apple and Alibaba, who use similar methods for their large-scale search systems. The main advantage of iRangeGraph is its ability to reduce memory consumption while maintaining high query performance, making it an attractive solution for enterprises with large data sets.

The iRangeGraph technique involves dynamic construction of graph-based indexes during query processing. Instead of creating and storing an index for every possible range, the method builds these graphs on an as-needed basis, leveraging pre-built elementary graphs that cover a moderate number of ranges. This approach conserves memory and ensures that query response time remains efficient. iRangeGraph is particularly useful in scenarios where the numerical constraints applied to the search are neither highly selective nor unselective and where existing methods tend to perform poorly. iRangeGraph can handle multi-attribute RFANN queries, meaning that queries involving more than one numerical constraint can be processed efficiently. For example, a query may search for data points within a specific price and date range, and iRangeGraph can handle such scenarios effectively.

Performance testing of iRangeGraph was performed on several real-world datasets including the WIT-Image, TripClick, Redcaps, and YouTube datasets. These datasets included high-dimensional vector data and numerical attributes such as image size, post date, and number of likes. The tests showed that iRangeGraph significantly outperformed existing methods. With a recall of 0.9, iRangeGraph achieved 2-5x better query per second (qps) performance than its competitors. Memory usage was consistently lower, a key advantage when working with large-scale systems where storage is a critical concern. Compared to dedicated graph-based indexes materialized for each query range, iRangeGraph was less than 2x slower and consumed much less memory. For multi-attribute RFANN queries, iRangeGraph demonstrated a 2-4x performance improvement in qps compared to the most competitive baseline methods.

In conclusion, iRangeGraph presents a novel and efficient solution for range filtering in approximate nearest neighbor queries. By dynamically constructing graph indexes during query execution and using elementary graphs to reduce memory requirements, this method successfully addresses the shortcomings of existing RFANN techniques. iRangeGraph’s ability to deliver high performance on diverse query workloads while significantly reducing memory consumption makes it an ideal choice for large-scale data systems. The flexibility of the method to handle multi-attribute queries extends its applicability in real-world scenarios. The research findings highlight the potential of iRangeGraph to revolutionize range filtering in nearest neighbor search, especially for systems managing high-dimensional data with numerical constraints.

Take a look at the PaperAll credit for this research goes to the researchers of this project. Also, don't forget to follow us on twitter.com/Marktechpost”>twitter and join our Telegram Channel and LinkedIn GrAbove!. If you like our work, you will love our fact sheet..

Don't forget to join our SubReddit of over 50,000 ml

FREE ai WEBINAR: 'SAM 2 for Video: How to Optimize Your Data' (Wednesday, September 25, 4:00 am – 4:45 am EST)

Aswin AK is a Consulting Intern at MarkTechPost. He is pursuing his dual degree from Indian Institute of technology, Kharagpur. He is passionate about Data Science and Machine Learning and has a strong academic background and hands-on experience in solving real-world interdisciplinary challenges.

Join the fastest growing ai research newsletter read by researchers from Google + NVIDIA + Meta + Stanford + MIT + Microsoft and many others…

iRangeGraph: A dynamic approach to improve nearest neighbor search performance using range filtering through efficient graph construction and lower memory footprint on large-scale data systems

Technical Terrence Team

If I invest £5,000 in Airtel Africa, how much passive income would I get?

Leave a Reply Cancel reply

Recommended.

The CFPB finished more than 1,400 workers, leaving about 200 employees and effectively dethripating the agency that Elon Musk said previously “eliminate” (Makena Kelly/Wired)

8 Types of Chunking for RAG Systems

Mike Tyson joins boxing blockchain project

Meet PromptingWhisper: Using prompt engineering to tailor the whisper model for invisible tasks, prompting prompts improve performance by 10-45% on three zero-trigger tasks

DeGods Co-Founder Leaves the Project

Categories

Important Links

iRangeGraph: A dynamic approach to improve nearest neighbor search performance using range filtering through efficient graph construction and lower memory footprint on large-scale data systems

Related

Technical Terrence Team

If I invest £5,000 in Airtel Africa, how much passive income would I get?

Leave a Reply Cancel reply

Recommended.

The CFPB finished more than 1,400 workers, leaving about 200 employees and effectively dethripating the agency that Elon Musk said previously “eliminate” (Makena Kelly/Wired)

8 Types of Chunking for RAG Systems

Mike Tyson joins boxing blockchain project

Meet PromptingWhisper: Using prompt engineering to tailor the whisper model for invisible tasks, prompting prompts improve performance by 10-45% on three zero-trigger tasks

DeGods Co-Founder Leaves the Project

Categories

Important Links

Get daily news updates to your inbox!