Google Goes Deep On Dupe Detection & Canonicalization

Nov 4, 2020 - 7:21 am 4 by

Google Pieces Canonical

This morning our Google friends, John Mueller, Martin Splitt, Gary Illyes and also Lizzi Harvey (Google's technical writer) posted a new podcast. It was obviously fun to listen to but in it, Gary Illyes went super deep on how Google handles duplicate content detection, i.e. dupe detection and then the canonicalization. They are not the same thing.

The short version is that Google creates a Checksum for each page, it is basically like a unique fingerprint of the document based on the words on the page. So if there are two pages with the same checksum, then that is basically how Google figures out which pages are duplicate to each other. A checksum is a small-sized datum derived from a block of digital data for the purpose of detecting errors that may have been introduced during its transmission or storage. By themselves, checksums are often used to verify data integrity but are not relied upon to verify data authenticity.

Dupe detection and canonicalization are not the same thing. Gary said "first you have to detect the dupes, basically cluster them together, saying that all of these pages are dupes of each other, and then you have to basically find a leader page for all of them." "And that is canonicalization. So, you have the duplication, which is the whole term, but within that you have cluster building, like dupe cluster building, and canonicalization," he added.

How does dupe detection work? Gary said "for dupe detection what we do is, well, we try to detect dupes. And how we do that is perhaps how most people at other search engines do it, which is, basically, reducing the content into a hash or checksum and then comparing the checksums. And that's because it's much easier to do that than comparing perhaps 3,000 words, which is the minimum to rank well in any search engine."

They went off on a joke about the 3,000 words on a page for a bit.

Gary goes on to explain "we are reducing the content into a checksum. And we do that because we don't want to scan the whole text, because it just doesn't make sense, essentially. It takes more resources and the result would be pretty much the same. So, we calculate multiple kinds of checksums about the textual content of the page and then we compare the checksums."

It is not just exact duplicates but near-duplicates Gary explained. He said "It can catch both" he said "it can also catch near duplicates." "We have several algorithms that, for example, try to detect and then remove the boilerplate from the pages. So, for example, we exclude the navigation from the checksum calculation, we remove the footer as well, and then we are left with what we call the centerpiece, which is the central content of the page," he added.

Then they went off about meat jokes and vegetarian jokes, many of them are vegetarian. Gary and Liz and many seem to filter out emails from their boss, Sundar Pichai. :)

Gary goes deeper on this dupe detection:

Yeah. And then, basically, if the number changes, then the dupe cluster would be, again, different, because the contents of the two clusters would be different, because you have a new number in the cluster. So, that would just go into another cluster, essentially, one that's relevant to that number.

And then, once we calculated these checksums and we have the dupe cluster, then we have to select one document that we want to show in the search results.

Why do we do that? We do that because, typically, users don't like it when the same content is repeated across many search results. And we do that also because our story space in the index is not infinite.

Basically, why would we want to store duplicates in our index when users don't like it anyway? So, we can, basically, just reduce the index size.

But calculating which one to be the canonical, which page to lead the cluster, is actually not that easy, because there are scenarios where even for humans it would be quite hard to tell which page should be the one that is in the search results.

So, we employ, I think, over 20 signals. We use over 20 signals to decide which page to pick as canonical from a dupe cluster.

And most of you can probably guess like what these signals would be. Like one is, obviously, the content. But it could be also stuff like page rank, for example, like which page has higher page rank, because we still use page rank after all these years.

It could be, especially on same site, which page is on an HTTPS URL, which page is included in a sitemap. Or, if one page is redirecting to the other page, then that's a very clear signal that the other page should become canonical.

The rel=canonical attribute, that's also-- Is it an attribute? Tag. It's not a tag.

So after dupe detection, Google does the canonicalization part. Where Google takes all the duplicate URLs and decides which one to show in search. How does Google decide which one to show? That part is made up of about 20 different signals, Gary said. The signals include:

  • Content
  • PageRank
  • HTTPS
  • Is the page in sitemap file
  • A server redirect signal
  • rel canonical

They do not assign weights to these signals manually, they use machine learning for this. Why not assign weights manually, well, it can cause issues if they just manually assign weights to things. But a redirect and a canonical tag is weighted higher by the machine learning.

Gary explained why use machine learning:

So, that's a very good question. And a few years ago, I worked on canonicalization because I was trying to introduce a GIF link into the calculation as a signal and it was a nightmare to fine-tune the weights manually.

Because even if you change the weight by 0.1 number-- I don't think that it has a measure-- then it can throw off some other number and then suddenly, pages that, for example, whose URL is shorter might show up or more likely to show up in the search results, which is kind of silly because like, why would you look at that, like who cares about the URL length?

So, it was an absolute nightmare to find the right weight when you were introducing, for example, a new signal. And then you can also see bugs. I know that, for example, John escalates quite a bit to index dupes, basically, based on what he picks up on Twitter or the forums or whatever.

And then, sometimes, he escalates an actual bug where the dupe's team says that... Why are you laughing, John? You shouldn't laugh. This is about you. I'm putting you on the spot, you should appreciate this. But, anyway.

So, then he escalates a potential bug, and it's confirmed that it's a bug and it's related to a weight. Let's say that we use, I don't know, the sitemap signal to-- or the weight of the sitemap signal is too high.

And then let's say that the dupe's team says that, "Okay, let's reduce that signal a tiny bit." But then, when they reduce that signal a tiny bit, then some other signal becomes more powerful. But you can't actually control which signal because there are like 20 of them.

And then you tweak that other signal that suddenly became more powerful or heavier, and then that throws up yet another signal. And then you tweak that one and, basically, it's a never-ending game, essentially.

So, it's a whack-a-mole. So, if you feed all these signals to a machine learning algorithm plus all the desired outcomes, then you can train it to set these weights for you and then use those weights that were calculated or suggested by a machine learning algorithm.

Of course John knows what SEOs are thinking, so John asked Gary as a softball question "are those weights also like a ranking factor? Like, you mentioned like is it in the sitemap file, would we say, "Well, if it's in a sitemap file, it'll rank better." Or is canonicalization kind of independent of ranking?"

Gary responded "so, canonicalization is completely independent of ranking. But the page that we choose as canonical, that will end up in the search result pages and that will be ranked, but not based on these signals."

Here is the podcast audio, it starts at about 6:05 into the podcast:

Forum discussion at Twitter.

 

Popular Categories

The Pulse of the search community

Search Video Recaps

 
Video Details More Videos Subscribe to Videos

Most Recent Articles

Search Forum Recap

Daily Search Forum Recap: January 20, 2025

Jan 20, 2025 - 10:00 am
Google Ads

Google Ads Weekly Spend Fluctuations Often Due To Market Conditions Or Budget Changes

Jan 20, 2025 - 7:51 am
Google Ads

Google Ads PMax Search Terms Insights Gains Source Data

Jan 20, 2025 - 7:41 am
Google

Google Search Trending Products Carousel On Right Side

Jan 20, 2025 - 7:31 am
Google Search Engine Optimization

Google Search Quality Analyst Detects & Treats AI-Generated Content

Jan 20, 2025 - 7:21 am
Google Search Engine Optimization

Google: Don't Dynamically Update Robots.txt File Multiple Times Per Day

Jan 20, 2025 - 7:11 am
Previous Story: Behind The Scenes: Google Product Expert Virtual Summit 2020