Rebuilding an HNSW index is one of the most resource-intensive aspects of using HNSW in production workloads. Unlike traditional databases, where data deletions can be handled by simply removing a row from a table, using HNSW in a vector database typically requires a full rebuild to maintain optimal performance and accuracy.
Why is reconstruction necessary?
Due to its layered graph structure, HNSW is not inherently designed for dynamic data sets that change frequently. Adding new data or removing existing data is essential to keeping data up to date, especially for use cases like RAG, which aims to improve search relevance.
Most databases work on a concept called “hard” and “hard” deletion. Hard deletion deletes data permanently, while soft deletion marks data as “to be deleted” and deletes it later. The problem with soft deletion is that the data to be deleted continues to use a significant amount of memory until it is permanently deleted. This is particularly problematic in vector databases using HNSW, where memory consumption is already a significant issue.
HNSW creates a graph in which nodes (vectors) are connected based on their proximity in the vector space, and traversal of an HNSW graph is performed as a skip list. To achieve this, the layers of the graph are designed such that some layers have very few nodes. When vectors are removed, especially those in layers that have very few nodes that serve as critical connectors in the graph, the entire HNSW structure can become fragmented. This fragmentation can result in nodes (or layers) becoming disconnected from the main graph, which requires rebuilding the entire graph or at the very least will cause degradation in search efficiency.
HNSW then uses a soft-pruning technique, which marks vectors for deletion but does not delete them immediately. This approach reduces the expense of frequent full reconstructions, although periodic reconstruction is still necessary to maintain the optimal state of the graph.