There is no doubt that a ton of bot activity on one's sites are from rogue spiders. Spider or bots that pretend to be legit bots but are there to steal your content. We have covered several sessions on this in the past; here are some:
- The Bot Obedience Course - August 8, 2006
- The Bot Obedience Course - December 5, 2006
- Scrape Bots Vs. Search Bots :: Fighting the Battle - September 12, 2006
- Spider and DOS Defense - Rebels, Renegades, and Rogues - November 16, 2006
Matt Cutts posted a detailed How to verify Googlebot back at the Webmaster Central Blog on 9/20/2006 explaining how to do reverse DNS and then a forward DNS->IP lookup.
Telling webmasters to use DNS to verify on a case-by-case basis seems like the best way to go. I think the recommended technique would be to do a reverse DNS lookup, verify that the name is in the googlebot.com domain, and then do a corresponding forward DNS->IP lookup using that googlebot.com name; eg:> host 66.249.66.1 1.66.249.66.in-addr.arpa domain name pointer crawl-66-249-66-1.googlebot.com.
> host crawl-66-249-66-1.googlebot.com crawl-66-249-66-1.googlebot.com has address 66.249.66.1
I don't think just doing a reverse DNS lookup is sufficient, because a spoofer could set up reverse DNS to point to crawl-a-b-c-d.googlebot.com.
Of course there are some ways to automate this. Either code it yourself, buy CrawlWall or implement a solution similar to Ekstreme's PHP Search Engine Bot Authentication.
Rogue spiders are no fun, as we have seen in cases with some forums.
Forum discussion at Cre8asite Forums.