A bloom filter is a probabilistic data structure that allows you to efficiently test an element for probable set membership. It supports the operations add(item) and contains(item), where add performs as expected and contains either yields a definite no or a probable yes, meaning false positives are possible but false negatives are not.

A newly initialized bloom filter is an all-zeros array of length \(m\) along with \(k\) different uniform hash functions, which hash set items to positions in the array.

How to add an item

To add an element \(x\) to a bloom filter simply hash it with all the different hash functions and set the array at those resulting indices \(h_1(x), h_2(x), \ldots, h_k(x)\) to \(1\).

How to test membership of an item

To test whether an item has previously been added to the bloom filter simply hash it with all the hash functions and check whether the array is all-ones at those indices. If it is not then the item definitely hasn't been added since those indices would have been set to ones. If all the indices are set to one the item may have been added or it may be a coincidence and those indices were set to ones while adding one or more other items.

Applications of Bloom Filters

A bloom filter is a fast way to determine whether an element does or does not belong to a set. Bloom filters can be used to optimize disk lookups, database or web requests, improve caching systems, search chemical structures and much more. Amongst others they have been applied at Medium to avoid recommending previously viewed articles to users, at Google to identify malicious URLs and at Bing to improve their page index.[1]
The fact that bloom filters sometimes return false positives is usually not a problem. For example when optimizing database or web requests one might want to use a bloom filter to determine whether an item is actually available there before sending a time-consuming lookup request. In this case a false positive would mean that the bloom filter says the item is stored in the database when it in reality is not. Therefore the false positive would only cause an unnecessary web request but no information would be lost.

How to choose parameters \(m\), \(k\) and the hash functions

When creating a bloom filter, one needs to specify the array size \(m\) and the number of hash functions \(k\). These together determine the expected false positive rate \(\lambda\) for the bloom filter after adding \(n\) items to it. Therefore the parameters should be chosen with the required false positive rate in mind.

  1. Estimate the approximate total size of the set \(n\).
  2. Choose an array size \(m\), for example \(10n\).
  3. Set the number of hash functions, \(k\) to \(\frac{m}{n}ln(2)\).
  4. Decide if the false positive rate \(\lambda = (1-e^{-kn/m})^k\) is acceptable, if not, go back and adjust \(m\).
The hash functions should be independent and uniformally distributed but do not need to be cryptographically secure. A well-suited hash function would be FNV or FNV-1 for example. [2]

How to use a Bloom Filter in Python

To use a bloom filter in Python you could either implement it yourself or use a publicly available library such as bloom-filter (Pypi).
from bloom_filter import BloomFilter

# instantiate BloomFilter with custom settings,
# max_elements is how many elements you expect the filter to hold.
# error_rate defines accuracy; You can use defaults with
# BloomFilter() without any arguments. Following example
# is same as defaults:
bloom = BloomFilter(max_elements=10000, error_rate=0.1)

# Test whether the bloom-filter has seen a key:
print("test-key" in bloom)

# Mark the key as seen
bloom.add("test-key")

# Now check again
print("test-key" in bloom)
(Example is modified from project readme)

References


  1. Wikipedia - Bloom Filter
  2. Modern, high performance bloom filter in Python?