Academic Integrity: tutoring, explanations, and feedback — we don’t complete graded work or submit on a student’s behalf.

I need to create a bloom filter of 208 million URLs. What would be a good choice

ID: 649165 • Letter: I

Question

I need to create a bloom filter of 208 million URLs. What would be a good choice of bit vector size and number of hash functions? I tried a bit vector of size 1 GB and 4 hash functions, but it resulted in too many false positives while reading.

I have a huge web corpus containing web content of billions of URLs. I need to process the web content of URLs satisfying certain criteria: the URL should have appeared in web search results in the past 7 days at least 5 times. This is represented by a list of 208 million URLs. Joining the list directly with the web corpus is not feasible because of volume. So I am considering creation of a bloom filter out of the list and then using the bloom filter to prune out unnecessary URLs from the web corpus

Explanation / Answer

Using the formula from wikipedia for Bloom filter false positives, your proposal would have a false positive probability of about 0.00726%. This assumes, among other things, that good hash functions are used. The formula is:

(1?(1?[1/m])kn)k

where m is the number of bits in the filter, k is the number of hash functions and n is the number of entries in the filter.

Because items cannot be removed from a typical Bloom filter, if generation of the filter is too expensive, you might consider a counting Bloom filter to allow deletions.

(Although I have not read of it being used and do not know if it would be effective, one might use ORed signatures in each field instead of a set bit, where each signature has half or fewer of the bits set. A possible match is found when ((entry_signature ^ test_signature) & test_signature) == 0 for each entry selected by a hash function. If every signature has the same number of set bits, this would be like nesting a Bloom filter with a size equal to the signature size and the number of hash functions equal to the number of set bits.)

Hire Me For All Your Tutoring Needs
Integrity-first tutoring: clear explanations, guidance, and feedback.
Drop an Email at
drjack9650@gmail.com
Chat Now And Get Quote