It keeps amazing me that I keep seeing people use robots.txt files to prevent sites from being indexed and thus showing up in the search engines. You know why it keeps amazing me? Because robots.txtdoesn’t actually do the latter, even though it does prevent your site from being indexed.
Let’s go through some terms here:
Indexed / Indexing
The process of downloading a site or a page’s content to the server of the search engine, thereby adding it to it’s “index”.
Ranking / Listing / Showing
Showing a site in the search result pages (aka SERPs).
So, while the most common process goes from Indexing to Listing, a site doesn’t have to be indexed to be listed. If a link points at a page, domain or wherever, that link will be followed. If the robots.txt on that domain prevents the search engine from indexing that page, it’ll still show the URL in the results if it can gather from other variables that it might be worth looking at.
If my explanation above doesn’t make sense, have a look at Matt Cutt’s video explanation:
So, if you want to effectively hide pages from the search engines, and this might seem contradictory, you need them to index those pages. Why? Because when they index those pages, you can tell them not to List them. There’s two ways of doing that: by using robots meta tags, like this (and I’ve got an article on robots meta tags that’s more extensive):
<meta name="robots" content="noindex,nofollow"/>
The issue with a tag like that is that you have to add it to each and every page. That’s why the search engines came up with the X-Robots-Tag HTTP header. This allows you to specify an HTTP header called
X-Robots-Tag, and set the value as you would the meta robots tags value. The cool thing about this is that you can do it for an entire site. So, if your site is running on Apache, and mod_headers is enabled (it usually is), you could add the following single line to your .htaccess file:
Header set X-Robots-Tag "noindex, nofollow"
And it’d have the effect that that entire site can be indexed, but will never be shown in the search results. So, get rid of that robots.txt file with
Disallow: / in it, and use the X-Robots-Tag instead!