Gary Illyes from Google was asked why is the filtered data higher than the overall data within Google Search Console? In which Gary explained how the filter works - specifically - it uses a "Bloom filter."
A Bloom filter is a space-efficient probabilistic data structure, conceived by Burton Howard Bloom in 1970, that is used to test whether an element is a member of a set.
Gary said the filter is used because it is an efficient and fast way to process a ton of data and process a lot of stored data.
Gary said at the 1:13 mark into the Google SEO office hours video, "The short answer is that we make heavy use of something called Bloom filters because we need to handle a lot of data and Bloom filters can save us lots of time and basically storage."
He added, "The long answer is still that we make heavy use of Bloom filters because, again, we need to handle a lot of data but I also want to say a few words about Bloom filters. When you handle a large number of items in a set, and I mean billions of items if not trillions, sometimes looking up things fast becomes super hard. This is where Bloom filters come in handy. They allow you to consult a different set that contains a hash of possible items in the main set, and you look up the data there in your smaller set since you are looking up hashes first."
"It’s pretty fast, but hashing sometimes comes with data loss, either purposefully or not. And this missing data is what you're experiencing. Less data to go through means more accurate predictions about whether something exists in the main set or not. Basically, Bloom filters to speed up lookups by predicting if something exists in a data set but at the expense of accuracy, and the smaller the data set is, the more accurate the predictions are," he added."
Here is the video embed at the start time:
Oh, the jokes on the Google Bloom filter have begun:
Forum discussion at Twitter.