Why Your Website Needs the Robots.txt File

Robots.txt Format - Sitemap.xml LocationThe robots.txt file is a set of instructions that help massage the interaction between your website and the search engine crawlers. That means the crawlers should follow the instructions of your robots.txt file, and in most cases do in fact adopt your wishes. There might be rare occasions where you will see the search engines ignoring your robots.txt commands, but that is not a reason to abandon using this file. If anything that should encourage you to create a Google Webmasters Tools and Bing Webmasters Tools account and see if they are having specific problems with your site.

Creating a robots.txt file is very simple. Open up Notepad, enter your instructions and save the file as robots.txt. If you name the file anything other than robots.txt the crawlers will not see or acknowledge your commands. The engines will assume you do not have a robots.txt file and meander through your website as they wish. Hope you do not have duplicate content because of tracking parameters and hope you have another way for the engines to realize the location of your sitemap.xml files. If you just noted my sarcasm then you probably understand the importance of naming the file properly.

If you want the search crawlers to consume every file on your site then you will create the following robots.txt file:

User-agent: *
Disallow:
SITEMAP: http://www.yourdomain.com/sitemap.xml

If you want to exclude some sections and block the crawlers from analyzing then you would write the following robots.txt file:

User-agent: *
Disallow: /advertisements/
Disallow: /cgi-bin/
SITEMAP: http://www.yourdomain.com/sitemap.xml

In that version I am telling the crawlers to stay away from content in my cgi-bin and advertisements folders.

What if you are running tracking campaigns utilizing various parameters? You need to block the crawlers from crawling those URLs because you do not want to create duplicate content. If you have five different tracking parameters pointing to the same URL you will create six versions of that page. One is the original page and five other pages each adding their unique parameter at the end of the original URL. You can rely on the rel=”canonical” tag to help resolve this issue, but you can also enter instructions in your robots.txt file.

User-Agent: *
Disallow: /*ocid=
Disallow: /*icid=
Disallow: /*idic=
SITEMAP: http://www.yourdomain.com/sitemap.xml

Above, I am instructing the search engines to ignore any URLs with ocid=, icid= and idic= within the URL itself. Those three identifiers are the tracking parameters I want to block from being indexed.

Notice the SITEMAP: command in each of the examples above? That tells the search engine the location of your sitemap.xml file. This makes it very easy for the crawlers to locate your sitemap.xml and to then consume all of the URLs listed in that file. When you makes life easier for the crawlers the better the search engines will treat you in the search engine results pages.

Once you have created your robots.txt file then you upload it to your root directory. Before you do so make sure you have entered the proper commands. One enterprise site I worked on moved a new CMS platform from beta to release. They DID NOT update the beta version robots.txt file. That file instructed every search engine to not index and to not follow the entire site. Search referral traffic disappear instantly and as you can imagine panic ensued. If you see that kind of anomaly with your search traffic always check your robots.txt file. Perhaps someone made a mistake that can easily correct your search referral conundrum.

Visit robots.txt for more information about everything you can do with that glorious file.

Leave a Reply

Your email address will not be published. Required fields are marked *

*