The Bloom Filter was invented by Burton Bloom in 1970 . The central intent of the algorithm is to answer one seemingly simple question:
“Is needle contained in haystack?”
Hmm.. how about using a linear search, binary search or hash search algorithm you ask ?? Well.. all are valid methods.. but the Bloom Filter stands out in that it is super efficient at answering this question (especially if the set you are searching is really large).
There is another important difference. Most other search data structures store the data elements themselves.. A Bloom filter DOES NOT STORE DATA! This makes it incredibly space efficient. It only stores an array of bits – 0’s and 1’s. One caveat is that the bloom filter might return incorrect answers – also known as a false positives. If needle is present in haystack, it will definitely return ‘true’ .. however, if we query for a non existent element, it might incorrectly return ‘true’. Obviously, for applications that make use of the Bloom filter, that degree of error is acceptable.
Google makes ample use of Bloom Filters.. For example, they are used
- In Google’s BigTable distributed file system to minimize disk lookups
- In Google cache routing algorithms for knocking precious milliseconds off their retrieval time
- In Google Chrome browser for implementing “Safe browsing”
Click here for an informative YouTube video on the Google Tech Talks channel.
Bloom filters basically support two operations:
Initially, the filter is set to an array of ‘m’ bits (say 20) all initialized to 0 (the actual value of ‘m’ is quite critical to the funtioning of the bloom filter and is dependant on the number of elements in the haystack, and the acceptable degree of errors/false positives)
Internally, hash functions are used to map added elements to indices into the filter array. It is important to use high-quality hash functions in the filter to guarantee that output is equally distributed over the entire index space.
Lets say that we use 3 hash functions (h1, h2 and h3) inside our algorithm each of which generates numbers in the range (1..m-1) which is 0 to 19 in our case. There is some math and probability calculation that goes into determining the number of hash passes required and the optimal filter size.. but we can skip that for the purposes of this tutorial.
(From an implementation perspective, note that libraries already exist for generating high quality hash values, and running multiple hash functions simply means using different salts/initialization values in a loop)
Further, let us assume that the following hash values are returned:
Now, the bits at each of these positions/indexes is set to ‘1’ like so:
Note : A particular bit may be set to 1 multiple times due to hash collisions (like position 7 above.. both h2 and h3 map into it).. but that is ok.
A query is positive if and only if ALL bits returned by the hash functions are 1. If any of them is 0, then certainly the search term is not contained in the set.
For example, to check the existance of “bar” in our set:
Compute the 3 hashes of “bar”, and say, we get values of 1, 4 and 5. Checking the bit at position 1 in the array, we see that it is a ‘0’. So, “bar” CANNOT be present. We can return with the value “FALSE”.. There is no need to check other bits.
It is easy to see that the probability of false positives is directly proportional to the number of elements that are added to the set – As elements are added to the set, more bits are flipped.. which then makes it more likely that all hashes on a non-existant value will map to cells containing a ‘1’, thus resulting in false positives.. It is therefore imperative to estimate the size of the data set properly during initialization.