Google posted this morning that they are going to stop unofficially supporting the noindex, nofollow and crawl-delay directives within robots.txt files. Google has been saying not to do this this for years actually and hinted this was coming really soon and now it is here.
Google wrote "While open-sourcing our parser library, we analyzed the usage of robots.txt rules. In particular, we focused on rules unsupported by the internet draft, such as crawl-delay, nofollow, and noindex. Since these rules were never documented by Google, naturally, their usage in relation to Googlebot is very low. Digging further, we saw their usage was contradicted by other rules in all but 0.001% of all robots.txt files on the internet. These mistakes hurt websites' presence in Google's search results in ways we donβt think webmasters intended."
In short, if you mention crawl-delay, nofollow, and noindex in your robots.txt file - Google on September 1, 2019 will stop honoring it. They currently do honor some of those implementations, even though they are "unsupported and unpublished rules" but will stop doing so on September 1, 2019.
Google may send out notifications via Google Search Console if you are using these unsupported commands in your robots.txt files.
That sounds like a good idea. Are you reading our email?
β π John π (@JohnMu) July 2, 2019
/turns around slowly to scan the room
Like I said above, Google has been telling webmasters and SEOs not to use noindex in robots.txt:
Well, we've been saying not to rely on it for years now :).
β π John π (@JohnMu) July 2, 2019
You do realize that we've been telling people not to rely on this for many years now?
β π John π (@JohnMu) July 2, 2019
Google told us this change would happen eventually:
As promised a few weeks ago, i ran the analysis about noindex in robotstxt. The number of sites that were hurting themselves very high. I honestly believe that this is for the better for the Ecosystem & those who used it correctly will find better ways to achieve the same thing. https://t.co/LvdhsN2pIE
β Gary "ι―¨η" Illyes (@methode) July 2, 2019
Gary Illyes is to blame for this:
Sorry in advance... πΆ pic.twitter.com/IhT8zUzhK1
β Gary "ι―¨η" Illyes (@methode) July 2, 2019
He said he is honestly sorry:
Honestly... Right now... Yes
β Gary "ι―¨η" Illyes (@methode) July 2, 2019
But Google looked and analyzed the impact and so a small impact, if any. In fact, they are not making the change for a few months and like I said above, may email those who will be impacted:
Yep! We don't really make these kinds of changes willy-nilly :-).
β π John π (@JohnMu) July 2, 2019
So now is the time to bulk up your audits to make sure that your clients are not depending on these unsupported commands in their robots.txt files.
Here is what Google posted in terms of noindex directive alternatives:
- Noindex in robots meta tags: Supported both in the HTTP response headers and in HTML, the noindex directive is the most effective way to remove URLs from the index when crawling is allowed.
- 404 and 410 HTTP status codes: Both status codes mean that the page does not exist, which will drop such URLs from Google's index once they're crawled and processed.
- Password protection: Unless markup is used to indicate subscription or paywalled content, hiding a page behind a login will generally remove it from Google's index.
- Disallow in robots.txt: Search engines can only index pages that they know about, so blocking the page from being crawled usually means its content won’t be indexed. While the search engine may also index a URL based on links from other pages, without seeing the content itself, we aim to make such pages less visible in the future.
- Search Console Remove URL tool: The tool is a quick and easy method to remove a URL temporarily from Google's search results.
Forum discussion at Twitter.