Martin Splitt shared a lot of information about how Google detects duplicate pages, and then selects the canonical page to include in search engine results pages.
He also shared how at least twenty different signals are weighted to identify the canonical side and why machine learning is used to adjust the weights.
How Google deals with canonicalization
Martin begins by explaining how websites are crawled and documents are indexed. Then he moves on to the next step, canonicalization and duplicate detection.
He goes into detail on reducing content to a checksum, a number that is then compared with the checksums of other pages to identify identical checksums.
“We are collecting signals and have now reached the next step, namely canonicalization and the detection of frauds.
… First you need to identify the dupes, basically summarize them, and say that all of these pages are dupes of each other. And then you basically have to find a home page for everyone.
And how we do it, maybe most people, other search engines, do, which basically is to reduce the content to a hash or checksum and then compare the checksums.
And that’s because it’s a lot easier to do than maybe compare the three thousand words …
… And so we reduce the content to a checksum and we do that because we don’t want to scan the entire text because it just doesn’t make sense. Essentially, it takes more resources and the result would be pretty much the same. So we calculate several types of checksums for the textual content of the page and then compare them with checksums. “
Read on below
Martin responds next when this process intercepts near duplicates or exact duplicates:
Good question. It can catch both. It can also catch near-duplicates.
We have several algorithms that try, for example, to recognize the boilerplate and then remove it from the pages.
For example, we exclude navigation from the checksum calculation. We’re also removing the footer. And then what we have left is what we call the centerpiece, which is the core content of the page, much like the meat of the page.
If we calculate the checksums and compare the checksums with each other, we will put those that are quite similar, or at least a little bit similar, together into a dupe cluster. “
Martin was then asked what a checksum is:
“A checksum is basically a hash of the content. Basically a fingerprint. Basically, it’s a fingerprint of something. In this case it is the content of the file …
And when we have calculated those checksums, we have the dupe cluster. Then we need to select a document that we want to display in the search results. “
Read on below
Martin then discussed the reason Google prevents duplicate pages from showing up in the SERP:
“Why are we doing this? We do this because users typically don’t like it when the same content is repeated in many search results. And we do that because our storage space in the index is not infinite. Why should we want to keep duplicates in our index? “
Next, he goes back to the core of the topic, recognizing duplicates and choosing the canonical side:
“However, it is not that easy to calculate which side should be canonical and which side should lead the cluster. Because there are scenarios where it is difficult even for humans to tell which page to include in search results.
I think we are using over twenty signals, we are using over twenty signals to decide which side from a betrayed cluster to choose as canonical.
And most of you can probably guess what those signals would look like. How one is obviously the content.
But it could also be things like PageRank, for example which page has a higher PageRank because we are still using PageRank after all these years.
In particular on the same website it can be that the page is on an https URL, which page is contained in the sitemap or that one page redirects to the other page. This is a very clear signal that the other side should become canonical. The rel = canonical attribute… is again a pretty strong signal… because… someone has indicated that this other side should be canonical.
And then when we have compared all these signals for all the pairs of sides, we get an actual canonical. And then each of these signals that we use has its own weight. And we use machine learning to calculate the weights for these signals. “
It will now be detailed and explains the reason why Google would give redirects a higher weight than the http / https url signal:
“To give you an idea, the 301 redirect or any type of redirect should have a lot more weight in canonicalization than if the page is on an http url or https.
Because at some point the user would see the redirect destination. So it doesn’t make sense to include the redirect source in the search results. “
Mueller asks him why Google is using machine learning to adjust signal weights:
“Do we sometimes get it wrong? Why do we need machine learning because we clearly only write these weights down once and then it’s perfect, right? “
Martin then related an anecdote about working on canonicalization and trying to include Hreflang as a signal in the calculation. He said that adjusting the weights manually was a nightmare. He said that manually adjusting the weights can throw off different weights, resulting in unexpected results like weird search results that don’t make sense.
Read on below
He shared an error example of pages with short URLs suddenly ranking better, which Martin called silly.
He also shared an anecdote about manually flattening a site map signal to fix a bug related to canonicalization. However, this makes another signal stronger, which then causes other problems.
The point is that all weight signals are tightly interrelated and machine learning is required in order to make changes to weight successfully.
“Let’s assume that … the weight of the sitemap signal is too high. And then let’s say the Dupes team says, okay, let’s cut that signal down a bit.
But if you reduce that signal a little bit, another signal becomes stronger.
However, you cannot control which signal is present as there are about twenty of them.
And then you optimize the other signal, which suddenly got stronger or heavier and then gives off another signal. And then you tweak that and basically it’s a never ending game, basically it’s a punch in the mouth.
So if you feed all of these signals to a machine learning algorithm and any results you want, you can train it to set these weights for you, and then use the weights calculated or suggested by a machine learning algorithm. “
Read on below
John Mueller next asks if these twenty weights, like the sitemap signal mentioned earlier, could be viewed as ranking signals.
“Are these weights also a ranking factor? … Or is the canonicalization independent of the ranking? “
“Canonization is completely independent of the ranking. But the page we choose to be canonical ends up on the search results pages and is ranked, but not based on those signals. “
Martin shared a lot about how canonicalization works, including its complexity. They discussed writing down this information at a later date, but they sounded discouraged with the task of writing it all down.
The podcast episode was titled “How Technical Search Content Is Written and Published on Google, and More!” But I have to say that by far the most interesting part was Martin’s description of canonicalization in Google.
Listen to the entire podcast:
Look out the Record Podcast