Intelligently transform a hash table into a probabilistic data structure to trade precision for large memory gains
hash table It is one of the most well-known and used data structures. With a smart choice of hash function, a hash table can produce optimal performance for constant-time insert, lookup, and delete queries.
The main drawback of the hash table is possible collisions. To avoid them, one of the standard methods includes increasing the size of the hash table. While this approach works well in most cases, sometimes we are still limited in using a large memory space.
It is necessary to remember that a hash table always provides a correct answer to any query. It may suffer from collisions and be slow at times, but always guarantees 100% correct answers. It turns out that in some systems we don't always need to receive the correct information for queries. This decrease in accuracy can be used to focus on improving other aspects of the system.
In this article, we will discover an innovative data structure called Bloom filter. In simple words, it is a modified version of a standard hash table that compensates a small decrease in precision for memory space gains.
The Bloom filter is organized in the form of a Boolean array of size m. Initially all its elements are marked as 0 (false). Apart from that, you need to choose k hash functions that take objects as input and assign them to the range (0, m — 1). Each output value will subsequently correspond to an array element at that index.
For best results, it is recommended that hash functions generate values whose distribution is nearly uniform.
Insertion
Whenever a new object needs to be added, it is passed through k predefined hash functions. For each output hash value, the corresponding element at that index is converted to 1 (true).
If an array element whose index was generated from a hash function has already been set to 1, then it simply stays at 1.
Basically, the presence of 1 in any array element acts as a partial proof that a hash element of the respective array index actually exists in the Bloom filter.
Look for
To check if an object exists, its k hash values are calculated. There can be two possible scenarios:
If these are at least one hash value for which the respective array element is equal to 0, this means that the the object does not exist.
During insertion, an object is associated with multiple array elements that are marked 1. If an object actually existed in the filter, all hash functions would deterministically generate the same sequence of indices pointing to 1. However, point to an array The element with 0 clearly means that the current object is not present in the data structure.
Yes for all hash valuesthe respective elements of the matrix are equal to 1, this means that the object probably exists (not 100%).
This statement is exactly what makes the Bloom filter a probabilistic data structure. If an object was added before, during a search, the Bloom filter guarantees that the hashes will be the same for it, so the object will be found.
However, the Bloom filter can produce a false positive response when an object does not actually exist but the Bloom filter claims otherwise. This happens when all hash functions for the object return hash values of 1 corresponding to other objects already inserted into the filter.
False positive responses tend to occur when the number of inserted objects becomes relatively high compared to the size of the Bloom filter array.
Estimation of false positive errors.
It is possible to estimate the probability of obtaining a false positive error, given the structure of the Bloom filter.
The full proof of this formula can be found at Wikipedia. Based on that expression, we can make a couple of interesting observations:
- The probability of FP decreases with the increase in the number of hash functions k, the increase in the size of the array m, and the decrease in the number of inserted objects n.
- Before inserting objects into the Bloom filter, we can find the optimal number of hash functions k required that will minimize the probability of FP if we know the size of the array m and can estimate the number of objects n to be inserted in the future.
Another option to reduce the probability of FP is a combination (AND conjunction) of several independent Bloom filters. Ultimately, an element is considered present in the data structure only if it is present in all Bloom filters.
Restrictions
- Unlike hash tables, the standard implementation of a Bloom filter does not support deletion.
- The chosen number of hash functions k and the size of the array m at the beginning cannot be changed later. If there is such a need, the only way to do it is to create another Bloom filter with new settings by inserting all the previously stored objects.
According to the Wikipedia page, the free encyclopediaBloom filter is widely used in large systems:
- Databases like ApacheHBase, cassandra apache and PostgreSQL use the Bloom filter to check for non-existing rows or columns. This approach is considerably faster than using disk searches.
- Half uses the Bloom filter to filter out pages that have already been recommended to a user.
- Google Chrome used the Bloom filter in the past to identify malicious URLs. A URL was considered safe if the Bloom filter returned a negative response. Otherwise, full verification was performed.
In this article, we cover an alternative approach to building hash tables. When a small decrease in precision can be compromised for more efficient use of memory, the Bloom filter turns out to be a solid solution in many distributed systems.
Varying the number of hash functions with the size of the Bloom filter allows us to find the most appropriate balance between precision and performance requirements.
All images, unless otherwise noted, are the author's.