Running a website can be exhausting. New technologies are always emerging, and finding the best ones will be key to gaining a competitive edge, and Google is always changing, too. This means that best practices are shifting and that we’re sometimes left a little scrambling to update our sites to not only meet best practices but new basic standards.
We’ve just gotten hit with another one of these changes recently when Google announced that they’ll be canceling support for Robots.txt noindex. This is a big change for the sites that it’s affecting, so take a look at what this change means for you and how you can be ready.
What Exactly is Robots.txt Noindex?
Robots.txt is a type of text file that site owners can use to tell web robots (which are most often search crawler’s like Google’s) how they should be crawling pages on their website. This text file is a part of the Robots Exclusion Protocol (REP), which is a set of standards that regulate exactly how these bots are crawling, accessing, indexing, and serving content to searchers.
In many cases, a robot.txt file will tell specific bots whether they can or can’t crawl or index certain parts of the site. This is the cases for “Robots.txt noindex,” which tells crawlers not to index your site, preventing it from appearing in search results.
This can be used for a number of different purposes but is often used for pages like order confirmation pages, special offer landing pages, and more. The idea is that you only want certain pages to be found through exact means (including you potentially sending a link to that site to certain customers based on triggered actions), instead of people being able to find them by searching for your site.
So What’s Happening with the Robots.txt Noindex?
Google has historically respected the Robots.txt directives when it comes to indexing (or in this case, non-indexing). Soon, that won’t be the case anymore.
Starting on September 1st of this year, Google won’t be following the noindex directives anymore. They’ll crawl whatever they’d like and index it as they see fit, apparently, at least when it comes to the Robots.txt files. If publishers and site owners want to keep certain pages out of the index, they’ll need to find another alternative.
Why Did Google Cancel Its Support?
While Google has atypically followed Robots.txt noindex commands, that’s been almost more of a courtesy than anything else; these files weren’t an official Google directive, which is why they’re canceling support now.
Their official alert stressed this, alerting users that they were “Saying goodbye to undocumented and unsupported rules in robots.tx.” They want to be able to maintain a healthier ecosystem, and following a set number of rules is much easier when it comes to scaling then it is to try to follow everything that people are coming up with on their own. This allows for more open source releases moving forward by requiring more standardized coding across the board.
Keep in mind that even before this update, Google only followed robots.txt noindex instructions around 11 times out of 12. That’s a solid track record, but it’s not perfect, and it left some sites open to having pages indexed that they had wanted to keep off the search engine grid.
So What Now?
There are plenty of reasons why sites want to control which of their pages are being crawled and indexed, and the good news is that there are still five different ways that you can do so even without Robots.txt noindexing.
In Google’s official announcement, they listed five different ways that you can manage indexing. These include:
- Noindex in robots meta tags. These are different than basic robots.txt files, and they’re supported in both HTML and HTTP response headers. This is the most effective way to remove URLS from the index when they’re able to be crawled. They’re placed in the head section of the URL, making it easier for Google to read.
- 404 and 410 HTTP status codes. These status codes tell Google that the page doesn’t exist, dropping them from Google’s index after they’re re-crawled.
- Disallow in robots.txt. Disallowing is different than deindexing, and it will prevent Google from ever crawling the pages to begin with. Even though this is another type of robots.txt, Google is supporting this one and recommending it as an option.
- Password protection. In most cases, having pages locked behind logins requiring passwords typically remove it from Google’s index.
The exception here is if you’re using markup to indicate that it’s paywalled content or subscription, which is why you’ll see news stories for sites like the Wall Street Journal show up and then be asked to log-in, but these require that extra markup.
- Using the removal tool in the Search Console. You can find a URL on your site structure in Google’s Search Console and simply remove it. This will prevent it from showing up in search engines, and it’s fast and easy, but it may only remove the URL temporarily until Google recrawls your site again at a later date.
Needing to make fast changes and site adjustments aren’t something that many of us are unfamiliar with, and fortunately, Google’s given us more warning with the canceled support here than they sometimes do with algorithm changes that can quickly affect ranking and positioning. Take a careful look at your site (or have a qualified site designer do so) and ensure that any pages you don’t want showing up in search engines have the proper methods in place to prevent them from doing so.
The cut off is September of this year, so make it a priority now before it’s too late.
Looking for new ways to get the pages you want to appear higher in search results (or stash away those you don’t)? Get in touch with us and see how we can help.