Bloom filters are a data structure developed by Burton Howard Bloom in 1970. You can see them as a hash tables’ cousin. They also allow for efficient insert and lookup operations while occupying very little space. The price to pay is false positives. The probability of a false positive is a function of the size of the filter and the number of hash functions, as we will see in the next section.
In contrast with hash tables, Bloom filters do not store objects. They only remember whether an object has been added to the filter. In this regard, they behave more like sets. Furthermore, they do not allow deletions.
After introducing the ideas behind a Bloom FIlter and some example applications, I will focus on how they are used by Bitcoin light clients.
You can implement a Bloom filter as an array of
To add a new element to the filter
Similarly, you iterate through all the hash functions. If an element is in the set, the output of all the hash functions will be indices where the bit is set to 1. If any of the indices is set to 0, the element is not part of the filter.
Consider the scenario depicted in the image above. The filter is an array of 10 elements and we are using two hash functions, h1 and h2. We initialize the array to
To check if a word is part of the filter, we check the entries corresponding to the output of all hash functions. For
Suppose we want to see if
Now, let’s lookup
Now, let’s imagine
Let’s analyze the probability of a false positive.
Let’s assume we have a dataset S, k hash functions, and a filter F of n bits. After inserting all the elements in S, the probability that any bit is still set to 0 is
where
Therefore, the probability of a given bit is 1 after all elements have been inserted in F is
We can approximate this using the exponential function since
where
Let’s take an element X that does not belong to the filter. Let’s also assume that each hash function will produce a different output for X. For X to be a false positive, its
We can fix
Thus, the probability of false positives as a function of the number of bits per object is
We can also fix this probability of false positives (for instance, to 1%) and solve for
If we wish to insert
Before diving into Bitcoin, let’s consider some other applications of Bloom Filters.
This was one of its earliest applications. You could create a filter and insert all words in a dictionary. When you want to check if a word is correctly spelled, check if it is in the filter. If it is not, you know it is not properly spelled.
If the word belongs to the filter, there is a tiny probability that it is misspelled (false positive).
Imagine you have a huge userbase, like Twitter. When a new user signs up, he has to choose a Twitter handle.
Looking up in the database if the handle is available is slower than using memory. We could add all existing Twitter handles to a hash set, but that would take a lot of space. Since we can afford false positives, we can give Bloom Filters a try.
Firstly, you insert into the Bloom filter the handles that are already in use. Over time, you can add more as users sign up. When a client tries to register a certain handle, you look it up in the Bloom filter. If you get a “False” result, you can let the user pick that handle.
If you get a “True” result, either the handle is unavailable or we have a false positive. In this situation, we can check in the database where we store all handles in use.
As a result, Bloom Filters would reduce the number of lookups in the database.
Elements cannot be deleted from a Bloom Filter, only inserted. Therefore, if closing an account makes the handle available again, we need to periodically recreate the Filter.
We can apply similar ideas to a system that needs to send network calls to another system. For example, Alexa uses blacklists to ensure it does not show certain types of information. Imagine a service B that keeps a database of items to be blacklisted. Service A makes lots of calls to service B.
To reduce the number of calls, service A could use a Bloom Filter, adding all entities from the blacklist. A negative result from the lookup means the element is not on the blacklist and therefore safe to process. A positive result means that the item might be on the blacklist. To avoid false positives, we can make a call to service B.
The Bloom Filter needs to be synced periodically with the database in service B.
In conclusion, using a Bloom Filter reduces the number of network calls. On the other hand, it adds complexity. Nonetheless, depending on the requirements, it could be a trade-off worth considering.
To sum up, Bloom Filters are a good solution for applications that:
Otherwise, hash tables or hash sets might be a better alternative. However, what if you could leverage false positives?
In my previous article about Merkle trees, I introduced the concept of light nodes. Light nodes only download block headers and query other nodes for a Merkle path to verify transactions.
Imagine a light node that wants to receive transactions to go to addresses contained in its wallet. The node could talk to full nodes asking for transactions of interest. However, the rest of the nodes would know what addresses this light node is interested in, which is a concern from a privacy perspective. How can Bloom filters help with this?
Bloom filters are a data structure that allows for a probabilistic check of membership. You cannot tell with 100% probability if an element present in a Bloom filter is really there or a false positive. Therefore, light nodes can use them to ask their peers for transactions, without revealing exactly what information the light node is interested in.
Light nodes will create a Bloom filter and insert into it the addresses that its wallets contain. When a light client connects to a full node, it sends this Bloom filter. In turn, whenever the full node receives a transaction, it checks if its input or output addresses match the filter. Peers will only send transactions that match the filter.
To be more precise, they send the block header and the Merkle path of that transaction to the root of the tree. Using the Merkle path, the node can verify that the transaction belongs to the block. Use the block header, the node can link this block to the rest of the blockchain. All this using an infinitesimal fraction of the space that the full blockchain requires.
A light node will discard false positives and use the rest transactions to update its UTXO set. UTXO stands for Unspent Transaction Output and the UTXO set is “the way Bitcoin keeps track of who owns what”. I will cover this concept in more detail in an upcoming article about blocks and transactions.
Using Bloom Filters, a light node does not need to reveal to the rest of the participants which addresses it is interested in. This would not be a security leak per se, but a privacy leak that could end up compromising security.
For more information, have a look at this paper.
Instead of using
The following equation defines the ” k different hash functions”:
where
You can choose two parameters to define your Bloom Filter:
which are familiar to you now from the previous sections.
I hope now you have a better understanding of Bloom Filters, their design trade-offs, and how they can be used in different contexts. As a next step, I recommend you check Bitcoin’s source code and BIP0037, which defines the specifications for Bloom Filters in light nodes.
I hope you found this article helpful. Share it, because you could help someone pass their exams or get a job.
You have probably heard lately the terms tokens, crypto tokens, and cryptocurrencies. In this article,…
Here you are going to see how to approach coding challenges like a pro and…
UTXO vs Account-Based Blockchains Introduction Bitcoin and Ethereum differ in many ways. In this article,…
This guide is a summary of my study notes for the Certified Kubernetes Application Developer…
Merkle trees, also known as binary hash trees, are a type of binary tree. They…
In this article, I gave you an introduction to Dynamic Programming with several examples. Here…