System design: Bloom filter. Cleverly transforming a hash table into a… | by Vyacheslav Efimov | March 2024

Intelligently transform a hash table into a probabilistic data structure to trade precision for large memory gains

hash table It is one of the most well-known and used data structures. With a smart choice of hash function, a hash table can produce optimal performance for constant-time insert, lookup, and delete queries.

The main drawback of the hash table is possible collisions. To avoid them, one of the standard methods includes increasing the size of the hash table. While this approach works well in most cases, sometimes we are still limited in using a large memory space.

It is necessary to remember that a hash table always provides a correct answer to any query. It may suffer from collisions and be slow at times, but always guarantees 100% correct answers. It turns out that in some systems we don't always need to receive the correct information for queries. This decrease in accuracy can be used to focus on improving other aspects of the system.

In this article, we will discover an innovative data structure called Bloom filter. In simple words, it is a modified version of a standard hash table that compensates a small decrease in precision for memory space gains.

The Bloom filter is organized in the form of a Boolean array of size m. Initially all its elements are marked as 0 (false). Apart from that, you need to choose k hash functions that take objects as input and assign them to the range (0, m — 1). Each output value will subsequently correspond to an array element at that index.

For best results, it is recommended that hash functions generate values whose distribution is nearly uniform.

In our example, we will use a Bloom filter of size m = 13 with k = 3 hash functions. Each of those functions assigns an input object to the range (0, 12).

Insertion

Whenever a new object needs to be added, it is passed through k predefined hash functions. For each output hash value, the corresponding element at that index is converted to 1 (true).

The “banana” object is added to the Bloom filter. The output values of the hash functions are 6, 2, and 9. The array elements at those indices change to 1.

If an array element whose index was generated from a hash function has already been set to 1, then it simply stays at 1.

The “apple” object is added to the Bloom filter. The array elements at indexes 10, 9, and 4 are assigned to 1. Although the ninth element of the array was already assigned to 1, its value does not change here.

Basically, the presence of 1 in any array element acts as a partial proof that a hash element of the respective array index actually exists in the Bloom filter.

Look for

To check if an object exists, its k hash values are calculated. There can be two possible scenarios:

If these are at least one hash value for which the respective array element is equal to 0, this means that the the object does not exist.

During insertion, an object is associated with multiple array elements that are marked 1. If an object actually existed in the filter, all hash functions would deterministically generate the same sequence of indices pointing to 1. However, point to an array The element with 0 clearly means that the current object is not present in the data structure.

Checking if the “orange” object is present in the Bloom filter. Since there is at least one hash function (precisely two in our case) that generates an index (7 and 12) of the array whose element is equal to 0, this means that “orange” does not exist in the filter.

Yes for all hash valuesthe respective elements of the matrix are equal to 1, this means that the object probably exists (not 100%).

This statement is exactly what makes the Bloom filter a probabilistic data structure. If an object was added before, during a search, the Bloom filter guarantees that the hashes will be the same for it, so the object will be found.

Checking if the “banana” object is present in the Bloom filter. Since hash functions are deterministic, they generate exactly the same array positions that were used before during the “banana” insertion. As a result, there is “banana” in the filter.

However, the Bloom filter can produce a false positive response when an object does not actually exist but the Bloom filter claims otherwise. This happens when all hash functions for the object return hash values of 1 corresponding to other objects already inserted into the filter.

Example of a false positive response. Although “cherry” was not added before, the filter believes it exists since all the output hashes for “cherry” point to array elements with values of 1.

False positive responses tend to occur when the number of inserted objects becomes relatively high compared to the size of the Bloom filter array.

Estimation of false positive errors.

It is possible to estimate the probability of obtaining a false positive error, given the structure of the Bloom filter.

Image adopted by the author. Fountain: Bloom Filter | Wikipedia, the free encyclopedia

The full proof of this formula can be found at Wikipedia. Based on that expression, we can make a couple of interesting observations:

The probability of FP decreases with the increase in the number of hash functions k, the increase in the size of the array m, and the decrease in the number of inserted objects n.

Increasing k, increasing m or decreasing n lead to a lower FP rate

Before inserting objects into the Bloom filter, we can find the optimal number of hash functions k required that will minimize the probability of FP if we know the size of the array m and can estimate the number of objects n to be inserted in the future.

The optimal number of hash functions k that minimizes the probability of FP

Another option to reduce the probability of FP is a combination (AND conjunction) of several independent Bloom filters. Ultimately, an element is considered present in the data structure only if it is present in all Bloom filters.

Restrictions

Unlike hash tables, the standard implementation of a Bloom filter does not support deletion.
The chosen number of hash functions k and the size of the array m at the beginning cannot be changed later. If there is such a need, the only way to do it is to create another Bloom filter with new settings by inserting all the previously stored objects.

According to the Wikipedia page, the free encyclopediaBloom filter is widely used in large systems:

Databases like ApacheHBase, cassandra apache and PostgreSQL use the Bloom filter to check for non-existing rows or columns. This approach is considerably faster than using disk searches.
Half uses the Bloom filter to filter out pages that have already been recommended to a user.
Google Chrome used the Bloom filter in the past to identify malicious URLs. A URL was considered safe if the Bloom filter returned a negative response. Otherwise, full verification was performed.

Google algorithm used to search for malicious URLs. The use of the Bloom filter allowed us to significantly reduce the number of more computationally heavy full checks that would otherwise have been required for a large portion of the secure links.

In this article, we cover an alternative approach to building hash tables. When a small decrease in precision can be compromised for more efficient use of memory, the Bloom filter turns out to be a solid solution in many distributed systems.

Varying the number of hash functions with the size of the Bloom filter allows us to find the most appropriate balance between precision and performance requirements.

All images, unless otherwise noted, are the author's.

System design: Bloom filter. Cleverly transforming a hash table into a… | by Vyacheslav Efimov | March 2024

Technical Terrence Team

Israel Stock Markets Close Higher; The TA 35 rises 0.87% By Investing.com

Leave a Reply Cancel reply

Recommended.

Bitcoin NFTs will reach $4.5 billion by 2025

Crypto Investors Withdraw $3 Billion From Stablecoin

4 Top Stocks I Think Could Pay Me Passive Income For Life

Windblown shows how good roguelikes can be with friends

Yuan Dynamics Against the US Dollar: Challenges in Asian Currencies

Categories

Important Links

System design: Bloom filter. Cleverly transforming a hash table into a… | by Vyacheslav Efimov | March 2024

Intelligently transform a hash table into a probabilistic data structure to trade precision for large memory gains

Insertion

Look for

Estimation of false positive errors.

Restrictions

Related

Technical Terrence Team

Israel Stock Markets Close Higher; The TA 35 rises 0.87% By Investing.com

Leave a Reply Cancel reply

Recommended.

Bitcoin NFTs will reach $4.5 billion by 2025

Crypto Investors Withdraw $3 Billion From Stablecoin

4 Top Stocks I Think Could Pay Me Passive Income For Life

Windblown shows how good roguelikes can be with friends

Yuan Dynamics Against the US Dollar: Challenges in Asian Currencies

Categories

Important Links

Get daily news updates to your inbox!