Hashing is a core operation in most online databases, such as a library catalog or e-commerce website. A hash function generates codes that directly determine the location where the data would be stored. So using these codes, it is easier to find and recover the data.
However, because traditional hash functions randomly generate codes, two pieces of data can sometimes be encoded with the same value. This causes collisions: when searching for an element directs the user to a lot of data with the same hash value. It takes much longer to find the right one, resulting in slower searches and reduced performance.
Certain types of hash functions, known as perfect hash functions, are designed to position data in a way that avoids collisions. But its construction for each data set is very time consuming and takes longer to compute than traditional hash functions.
Since hashing is used in so many applications, from database indexing to data compression to cryptography, fast and efficient hashing functions are critical. So researchers at MIT and elsewhere set out to see if they could use machine learning to build better hash functions.
They found that in certain situations, using learned models instead of traditional hash functions could lead to half as many collisions. These learned models are created by running a machine learning algorithm on a data set to capture specific features. The team’s experiments also showed that learned models were often more computationally efficient than perfect hashes.
“What we found in this work is that, in some situations, we can reach a better balance between the calculation of the hash function and the collisions that we will face. In these situations, the computation time for the hash function can be increased a bit, but at the same time its collisions can be reduced very significantly,” says Ibrahim Sabek, a postdoc in the Computer Science Data Systems Group. and Artificial Intelligence from MIT. Laboratory (CSAIL).
Their research, which will be presented at the 2023 International Conference on Very Large Databases, demonstrates how a hash function can be designed to significantly speed up searches in a huge database. For example, his technique could speed up the computer systems scientists use to store and analyze DNA, amino acid sequences or other biological information.
Sabek is the co-lead author of the paper with Department of Electrical Engineering and Computer Science (EECS) graduate student Kapil Vaidya. They are joined by co-authors Dominick Horn, a graduate student at the Technical University of Munich; Andreas Kipf, an MIT postdoc; Michael Mitzenmacher, professor of computer science at the Harvard John A. Paulson School of Engineering and Applied Sciences; and senior author Tim Kraska, an EECS Associate Professor at MIT and Co-Director of the Data, Systems, and Artificial Intelligence Laboratory.
doing it
Given a data input, or key, a traditional hash function generates a random number, or code, that corresponds to the slot where that key will be stored. To use a simple example, if there are 10 keys to be placed in 10 slots, the function would output an integer between 1 and 10 for each input. It is very likely that two keys will end up in the same slot, causing collisions.
Perfect hash functions provide a collision-free alternative. The researchers give the feature some additional insights, such as the number of slots the data will be placed in. You can then do additional calculations to figure out where to put each key to avoid collisions. However, these additional computations make the function more difficult to create and less efficient.
“We were wondering, if we know more about the data, that it will come from a particular distribution, can we use learned models to build a hash function that can actually reduce collisions?” Vaidya says.
A data distribution shows all possible values in a data set and how often each value occurs. The distribution can be used to calculate the probability that a particular value will be in a sample of data.
The researchers took a small sample of a data set and used machine learning to approximate the shape of the data distribution, or how the data is distributed. The learned model then uses the approximation to predict the location of a key in the data set.
They found that learned models were easier to build and faster to execute than perfect hashes, and generated fewer collisions than traditional hashes if the data is distributed in a predictable manner. But if the data is not distributed predictably because the gaps between data points vary too much, using learned models could cause more collisions.
“We may have a large number of data inputs, and the gaps between consecutive inputs are very different, so learning a model to capture the data distribution of these inputs is quite difficult,” Sabek explains.
Fewer collisions, faster results
When data was distributed predictably, learned models could reduce the proportion of conflicting keys in a data set from 30% to 15%, compared to traditional hash functions. They were also able to achieve better performance than perfect hash functions. In the best cases, learned models cut execution time by almost 30 percent.
As they explored the use of learned models for hashing, the researchers also found that performance was more affected by the number of submodels. Each learned model is made up of smaller linear models that approximate the data distribution for different parts of the data. With more submodels, the learned model produces a more accurate approximation, but it takes more time.
“At a certain threshold of submodels, you get enough information to build the approximation you need for the hash function. But after that, it won’t lead to further improvements in collision reduction,” says Sabek.
From this analysis, the researchers want to use learned models to design hash functions for other types of data. They also plan to explore learned hashing for databases into which data can be inserted or deleted. When the data is updated in this way, the model must change accordingly, but changing the model while maintaining accuracy is a difficult problem.
“We want to encourage the community to use machine learning within more fundamental data structures and algorithms. Any kind of core data structure presents us with the opportunity to use machine learning to capture data properties for better performance. There is still a lot we can explore,” says Sabek.
“Hashing and indexing functions are fundamental to many database functions. Given the variety of users and use cases, there is no one-size-fits-all hashing, and learned models help tailor the database for a specific user. This paper is a great balanced analysis of the feasibility of these new techniques and it does a good job of rigorously talking about the pros and cons, and helps us build our understanding of when these methods can be expected to work well,” says Murali. Narayanaswamy, a Principal Machine Learning Scientist at Amazon, who was not involved in this work. “Exploring these kinds of enhancements is an exciting area of research in both academia and industry, and the kind of rigor displayed in this work is critical for these methods to have a big impact.”
This work was supported, in part, by Google, Intel, Microsoft, the US National Science Foundation, the US Air Force Research Laboratory, and the US Air Force Artificial Intelligence Accelerator. USA