This room is PACKED, I feel like I am a rock concert there are so many people in here. It is an obvious testament to the interest in duplicate content issues.
Shari Thurow is first up to talk about the issues and says that they all agree with each other and here to share various outlooks on the same topic. She put ups a slide and says she will show 5 ways search engine detect duplicate content. What is duplicate content, the definition is unclear. A duplicate can be a replica of exact syntactic terms and sequence of terms, with or without formatting differences. The problem is a single formmating change such as CSS. Search engine don’t like redundant content in their indexes because it slows the information retrieval process and it degrades the search results. Searches rarely wish to see duplicate content in the search results. Search engine use clustering to limit each represented web site to one to two results. She shows an example of clustering. The other search engines do is that they also filter out duplicate content in news section. She gives the example of BMW getting banned from Google News.
Shari says that search engines look at content properties such as boilerplate stripping/removal. A boilerplate is a section of HTML code that is common to many different documents. She says they look for a unqiue fingerprint and heavy html density. Collection and filtering happens first. Indexing is one thing and adding to te index is another thing. Another item that search engine looks at is linkage properties. If the linkage properties are too similar. Press releases are a concern she says. Extermal third party links going to PRWEB and the National Cancer site. They are different and not the same. She loves Yahoo site explorer, a big groupie apparently. The other thing search engines are looking for is content evolution. In general is 65% of web content will not change on a weekly basis. 0.8% of web content will change completely every one to two weeks. Search engine are also looking at host name resolution. Many host names will resolve to the same web servers. Search engines are seeing geninue redundant content and that which is not. The last is a shingle comparison, every web document has a unique signature or fingerprint example.
She gives the example of a shingle comparison and gives an example of a word set. She gives three examples with the same site but slightly diffent product descriptions. The content is similar but varies in placement. She recommends finding the page with the highest converting rate lets Google spider it, and nofollow or robots.txt pages with similar content. Robots meta-tags is a good way to manage this. Some duplicate content is consistent spam and some is not. She puts up a university page (Norwich). This is a good site she says but indeed it is spam. Its stumped Danny Sullivan. She found a hallway page or a page with links to many other doorway pages. Universities do spam. Some duplicate content is copyright infringement. These are things such as scraper sites and link farms. You can also use DMCA reporting. Shari also recommends to register your copyright. Its gives you the keys to the courthouse so to say. Mikkel deMib Svenden is up second. He says there is a million ways you can create duplicate content problems but he is going to highlight the popular ways. Some sites resolve both to www and non-www. Most engines have a problem with this, but they are getting smart to the fact they are the same site. There may not be an issue with indexing, but are you leveraging the value of linking the best way. He recommends using picking one way to use www or non-www. Session ids are also a problem, he puts up an example of a site that had the same page spidered 200,000 versions of that page. This is a resource problems for the engines. Dump all session information in a cookie for all users or identify spiders and strip the session ID for them only. IN any case: deal with it. He mentions using Wordpress and using permalinks. He says don’t leave the engines to decide which url is supposed to be used. The solution is to 301 the non official version of the url to the official url. For example http://www.domain.com/sessionid?=33 to http://www.domain.com/this-url.
He says you can user server header check to look at the various responses you get to diagnose any problems. The header check should show a 301 redirect if you did it properly for the non-official url to the official url. He next gives an example of Many-to-one in forums. He puts up a search engine watch.com issue. Its not a problem for the site right now because you have to register and login in. But when somebody links to this url it can create a problem and a way in for the search engine spider. The solution for the forum example is to detect for bots and redirect them to the official urls.
Breadcrumb navigation can also be an issue. The problem is a when you use breadcrumbs that reflect the url structure of the site. He recommends having a product or article in on physical location. Don’t put multiple types of urls in the breadcrumbs. He last pearl of wisdom don’t ever leave the search engines to make decisions for you. There are a couple ways they can approach the website and usually it’s the wrong way.
Anne Kennedy from Beyond Ink. Why is duplicate content a problem? Because they said so. They are clear on what you should and should not do. What happens when you have a site in two different languages? Internationally generally is OK – Google identifies user by IP. But US is a single region and Google aims to return only on result for a set of content. You need one canonical domain and link all internal pages on the site to it. Exclude landing pages for tracking from search engines using robots.txt. Use 302 redirects ONLY for content that is going to change and only for temporary content.