A very active SEO in the Google forums posted a thread at the Google Webmaster Help forums asking why does it seem Google is contradicting itself with their advice on how to handle duplicate content.
He points out two different help documents:
(1) Google-friendly sites #40349:
Don't create multiple copies of a page under different URLs. Many sites offer text-only or printer-friendly versions of pages that contain the same content as the corresponding graphic-rich pages. To ensure that your preferred page is included in our search results, you'll need to block duplicates from our spiders using a robots.txt file.
Google no longer recommends blocking crawler access to duplicate content on your website, whether with a robots.txt file or other methods. If search engines can't crawl pages with duplicate content, they can't automatically detect that these URLs point to the same content and will therefore effectively have to treat them as separate, unique pages. A better solution is to allow search engines to crawl these URLs, but mark them as duplicates by using the rel="canonical" link element, the URL parameter handling tool, or 301 redirects.
So which is it? Do we block the duplicate content or not? With Panda rolling out globally and Google giving advice to remove duplicate content and non-original content, what is one to do?
New Googler Pierre Far passed this on to Googler JohnMu who replied:
Just to be clear -- using a robots.txt disallow is not a recommended way to handle duplicate content. By using a disallow, we won't be able to recognize that it's duplicate content and may end up indexing the URL without having crawled it.For example, assuming you have the same content at: A) http://example.com/page.php?id=12 B) http://example.com/easter/eggs.htm
... and assuming your robots.txt file contains: user-agent: * disallow: /*?
... that would disallow us from crawling URL (A) above. However, doing that would block us from being able to recognize that the two URLs are actually showing the same content. In case we find links going to (A), it's possible that we'll still choose to index (A) (without having crawled it), and those links will end up counting for a URL that is basically unknown.
On the other hand, if we're allowed to crawl URL (A), then our systems will generally be able to recognize that these URLs are showing the same content, and will be able to forward context and information (such as the links) about one URL to the version that's indexed. Additionally, you can use the various canonicalization methods to make sure that we index the version that you prefer.
But is Google still contradicting themselves in those two help articles?
Forum discussion at Google Webmaster Help.
Image credit to JUNG HEE PARK on Flickr.