Moderator: Rand Fishkin
Rahul Lahiri, Vice President of Search Product Management, Ask is a maybe...
Ben D'Angelo, Software Engineer, Google is kicking things off.
Duplicate content issues include multiple URLs pointing to the same page or very similar pages. Different countries with the same language. Duplicate content is also across other sites as syndicated content and scraped content.
The ideal situation is you want one URL for one piece of content.
Examples of duplicates include www vs no www, session IDs, URL parameters, print version pages, CNAMEs. Then you have similar content on different URLs. Using manufacturers database of pictures and content. Sites in different countries with same language.
How does Google handle duplicate content? General idea is that they cluster pages together and choose the "best" representation page. They have different types of filters for different types of duplicate content. This is not a penalty, just a filter.
What can you do about this? - For exact dups use a 301 redirect - Near duplicates noindex and robots.txt them out - Domains by country, note a different language is not duplicate, use unique content specific to country and use different TLDs and webmaster tool's geo thing. - Try not to put extraneous parameters in your URLs
There are also things like duplicate meta tags and titles.
What about other sites that cause duplicate content. What if you syndicate your content out. One tip, make sure to include a link back to the original article or content. Maybe also just give them a summary. If you syndicate other's content then flip the reverse.
Scrapers are likely not to impact you, it is possible, but rare. You can then file a DMCA and/or Spam Report.
Priyank Garg, Director Product Management, Yahoo! Search is going short with his presentation cause he lost his voice.
Yahoo does filter dups throughout all steps in the pipeline. He shows some examples... They classify most duplicate content "accidental." Soft 404 (not real 404s) is one of the largest source of duplicates. There are also abusive forms, like scrapers.
He then links to Yahoo Tools, like Site Explorer. The dynamic URL rewrite tool rocks, so does URL removal.
Derrick Wheeler, Senior Search Engine Optmization Architect, Microsoft is last up.
Duplicate content is his worse nightmare. CIRTA = crawl, index, rank, traffic, action. They have 180 million URLs in Live Search, 80 million in Google and a few in Yahoo, cause each engine filters them out differently.
- Consider you might need to detect when an engine is coming to your site, like cloak - in very specific considerations it is helpful, like session IDs - Know your parameters - Always link to your parameters in the same order - Dig into the search results of your site and you can find things there - Exclude dups using robots.txt or noindex, nofollow, etc. - Don't assume engines cant find JavaScript - Find a tool that will crawl your site, so you can see how an engine will look at your site - Focus on your strong URLs first
These are his key points, heading out now, have a meeting.