As part of the topic of clustering and canonicalization with Google Search today, Allan Scott from Google explained what he called "marauding black holes" in Google Search. Where Google's clustering takes in some error pages and they end up in this black hole of sorts in Google Search.
This came up in the excellent Search Off The Record interview of Allan Scott from the Google Search team, who works specifically on duplication within Google Search. Martin Splitt and John Mueller from Google interviewed Allan.
Allan explained these "marauding black holes" happen because "Error pages and clustering have an unfortunate relationship" in some cases. Allan said, "Error pages and clustering have an unfortunate relationship where undetected error pages just get a checksum like any other page would, and then cluster by checksum, and so error pages tend to cluster with each other. That makes sense at this point, right?"
Martin Splitt from Google summed it up with an example, "Is that these cases where you have like a website that has, I don't know, like 20 products that are no longer available and they have like replaced it with this item is no longer available. It's kind of an error page, but it doesn't serve as an error page because it serves as an HTTP 200. But then the content is all the same, so the checksums will be all the same. And then weird things happen, right?"
I think this means, Google thinks those error pages are all the same, because the checksums are all the same.
What is a checksum? A checksum is a small-sized block of data derived from another block of digital data for the purpose of detecting errors that may have been introduced during its transmission or storage. By themselves, checksums are often used to verify data integrity but are not relied upon to verify data authenticity.
Back to Allan, he responded to Martin saying, "So that's a good example. Yes, that that is exactly what I'm talking about. Now, in that case, the webmaster might not be too concerned because these products, if they're permanently gone, then they want them gone, so it's not a big deal. Now, if they're temporarily gone though, this is a problem because now they've all been sucked into this cluster. They're probably not coming back out because crawl really doesn't like dups. They're like, "Oh, that page is a dup. Forget it. I never need to crawl it again." That's why it's a black hole."
It goes into this black hole where Google might not ever look at that page again. Well, maybe not forever.
Allan said, "only the things that are very towards the top of the cluster are likely to get back out."
So why is Allan talking about this? He said, "where this really worries me is sites with transient errors, like what you're describing there is sort of a like an intentional transient error." "Well, one out of every thousand times, you're going to serve us your error. Now you've got a marauding black hole of dead pages. It gets worse because you're also serving a bunch of JavaScript dependencies," he added.
Here is more back and forth with Allan and Martin on this:
Allan:
If those fail to fetch, they might break your render, in which case we'll look at your page, and we'll think it's broken. The actual reliability of your page, after it's gone through those steps, is not necessarily very high. We have to worry a lot about getting these kinds of marauding black hole clusters from taking over a site because stuff just gets dumped in them, like there were social media sites where I would look at the, you know, the most prominent profiles, and they would just have reams of pages underneath them, some of them fairly high profile themselves that just did not belong in that cluster.
Martin:
Oh, boy. Okay. Yeah. I've seen something like that when someone was A/B testing a new version of their website, and then certain links would break with error messages because the API had changed and the calls no longer worked or something like that. And then, in like 10% of the cases, you would get like an error message for pretty much all of their content. Yeah, getting back out of that was tricky I guess.
John Mueller brought up the cases where this can be an issue with CDNs:
I've also seen something that I assume is similar to this where, if a site has some kind of a CDN in front of it where the CDN does some kind of bot detection or DDoS detection and then serves something like, "Oh, it looks like you're a bot," and Googlebot is, "Yes, I'm a bot." But then all of those pages, I guess, end up being clustered together and probably across multiple sites, right?
Allan confirmed and said Gary Illyes from Google has been working on this here and there:
Yes, basically. Gary has actually been doing some outreach for us on this subject. You know, we come across instances like this, and we do try to get providers of these sorts of services to work with us, or at least work with Gary. I don't know what he does with them. He's in charge of that. But not all of them are as cooperative. So that's something to be aware of.
So how do you avoid staying out of these Google black holes? Allan said, "The easiest way is to serve correct HTTP codes so, you know, send us a 404 or a 403 or a 503. If you do that, you're not going to cluster. We can only cluster pages that serve a 200. Only 200s go into black holes."
The other option Allan said was:
The other option here is, if you are doing JavaScript foo, in which case you might not be able to send us an HTTP code. Might be a little too late for that. What you can do there is you can attempt to service an actual error message, something that is very discernibly an error like, you know, you could literally just say, you know, 503 - we encountered a server error or 403 - you were not authorized to view this or 404 - we could not find the correct file. Any of those things would work. You know, you don't even need to use HTTP code. Obviously, you could just say something. Well, we have a system that's supposed to detect error pages, and we want to improve its recall beyond what it currently does to try to tackle some of these bad renders and these bot-served pages type things. But, in the meantime, it's generally safest to take things into your own hands and try to make sure that Google understands your intent as well as possible.
They go on and on about this, and it all starts at around 16:22 minute mark - here is the video embed:
Forum discussion at X.