Less 404 errors by adding robots.txt to your site

Even if you don’t think you need it, it’s still good practice to provide a “robots.txt” file in the root directory of your site for search engine spiders to find. Not only will it remove the 404s from your error_log (happens every time a spider/bot looks for it and it doesn’t exist), but it also provides a quick and efficient way to block certain sections of your site from being indexed. This is by far a better method than adding rel=”nofollow” to your links or the following meta tag to the header of each page in question.

<meta name="robots" content="noindex, nofollow" />

The most basic robots.txt file would include the following. This tells the search engines to index everything it can find.

User-agent: *
Disallow:

There is no “Allow” directive for use in a robots.txt file, so you could enter a blank “Disallow” entry (which basically equates to Allow) or provide sections spiders should exclude. The empty “Disallow:” is only used (in my opinion) where you don’t have any other rules to exclude content as shown in the following examples.

To block certain areas of you site from being indexed, you can use something more like the following example.

User-agent: *
Disallow: /cgi-bin/
Disallow: /images/
Disallow: /somepage.html

Now, since robots.txt is plain text and viewable by anyone, do not include areas of your site that are not already publicly accessible. This would include site administration areas, protected folders and the like. If it is being hidden from the public, don’t announce where it is in a robots.txt file 😉

Now, let’s say for example you want to keep Google from indexing your images for inclusion in their image search section. Just use the following…

User-agent: Googlebot-Image
Disallow: /images/

A more extreme example would be to add the following to your robots.txt file. This will tell every spider/bot not to index the content of your entire site regardless of the filename or directory it resides in. Do not, I repeat DO NOT put this in a site you want indexed by the search engines. It will simply tell them to go away ?

User-agent: *
Disallow: /

That’s all there is to it!

Leave a Reply

Your email address will not be published. Required fields are marked *