Google: Block Duplicate Content & Don't Block Duplicate Content

Apr 13, 2011 - 8:16 am 10 by

duplicate twinsA very active SEO in the Google forums posted a thread at the Google Webmaster Help forums asking why does it seem Google is contradicting itself with their advice on how to handle duplicate content.

He points out two different help documents:

(1) Google-friendly sites #40349:

Don't create multiple copies of a page under different URLs. Many sites offer text-only or printer-friendly versions of pages that contain the same content as the corresponding graphic-rich pages. To ensure that your preferred page is included in our search results, you'll need to block duplicates from our spiders using a robots.txt file.

(2) Duplicate content #66359:

Google no longer recommends blocking crawler access to duplicate content on your website, whether with a robots.txt file or other methods. If search engines can't crawl pages with duplicate content, they can't automatically detect that these URLs point to the same content and will therefore effectively have to treat them as separate, unique pages. A better solution is to allow search engines to crawl these URLs, but mark them as duplicates by using the rel="canonical" link element, the URL parameter handling tool, or 301 redirects.

So which is it? Do we block the duplicate content or not? With Panda rolling out globally and Google giving advice to remove duplicate content and non-original content, what is one to do?

New Googler Pierre Far passed this on to Googler JohnMu who replied:

Just to be clear -- using a robots.txt disallow is not a recommended way to handle duplicate content. By using a disallow, we won't be able to recognize that it's duplicate content and may end up indexing the URL without having crawled it.

For example, assuming you have the same content at: A) http://example.com/page.php?id=12 B) http://example.com/easter/eggs.htm

... and assuming your robots.txt file contains: user-agent: * disallow: /*?

... that would disallow us from crawling URL (A) above. However, doing that would block us from being able to recognize that the two URLs are actually showing the same content. In case we find links going to (A), it's possible that we'll still choose to index (A) (without having crawled it), and those links will end up counting for a URL that is basically unknown.

On the other hand, if we're allowed to crawl URL (A), then our systems will generally be able to recognize that these URLs are showing the same content, and will be able to forward context and information (such as the links) about one URL to the version that's indexed. Additionally, you can use the various canonicalization methods to make sure that we index the version that you prefer.

But is Google still contradicting themselves in those two help articles?

Forum discussion at Google Webmaster Help.

Image credit to JUNG HEE PARK on Flickr.

 

Popular Categories

The Pulse of the search community

Follow

Search Video Recaps

 
- YouTube
Video Details More Videos Subscribe to Videos

Most Recent Articles

Search Forum Recap

Daily Search Forum Recap: November 20, 2024

Nov 20, 2024 - 10:00 am
Google Search Engine Optimization

Google Site Reputation Abuse Policy Now Includes First Party Involvement Or Content Oversight

Nov 20, 2024 - 7:51 am
Google

Google Lens Updated For In-Store Shopping

Nov 20, 2024 - 7:41 am
Google Search Engine Optimization

Google Makes It Clear It Has Both Site Wide & Page Level Ranking Signals

Nov 20, 2024 - 7:31 am
Other Search Engines

ChatGPT's Search Marketing Share vs Google

Nov 20, 2024 - 7:21 am
Bing Search

Bing Video Search Tests Categorizing Videos

Nov 20, 2024 - 7:11 am
Previous Story: The Google Locomotive Train Logo: Richard Trevithick