I need to create a bloom filter of 208 million URLs. What would be a good choice
ID: 649165 • Letter: I
Question
I need to create a bloom filter of 208 million URLs. What would be a good choice of bit vector size and number of hash functions? I tried a bit vector of size 1 GB and 4 hash functions, but it resulted in too many false positives while reading.
I have a huge web corpus containing web content of billions of URLs. I need to process the web content of URLs satisfying certain criteria: the URL should have appeared in web search results in the past 7 days at least 5 times. This is represented by a list of 208 million URLs. Joining the list directly with the web corpus is not feasible because of volume. So I am considering creation of a bloom filter out of the list and then using the bloom filter to prune out unnecessary URLs from the web corpus
Explanation / Answer
Using the formula from wikipedia for Bloom filter false positives, your proposal would have a false positive probability of about 0.00726%. This assumes, among other things, that good hash functions are used. The formula is:
(1?(1?[1/m])kn)k
where m is the number of bits in the filter, k is the number of hash functions and n is the number of entries in the filter.
Because items cannot be removed from a typical Bloom filter, if generation of the filter is too expensive, you might consider a counting Bloom filter to allow deletions.
(Although I have not read of it being used and do not know if it would be effective, one might use ORed signatures in each field instead of a set bit, where each signature has half or fewer of the bits set. A possible match is found when ((entry_signature ^ test_signature) & test_signature) == 0 for each entry selected by a hash function. If every signature has the same number of set bits, this would be like nesting a Bloom filter with a size equal to the signature size and the number of hash functions equal to the number of set bits.)
Related Questions
drjack9650@gmail.com
Navigate
Integrity-first tutoring: explanations and feedback only — we do not complete graded work. Learn more.