Tuesday, February 22, 2011

Robots File


A friend of mine brought up robot files in a discussion, so I decided to write about that topic here. I'll provide a definition, display a sample, list some benefits as well as some problems of the robots file. Sources will be cited below.

Essentially, in order to prevent web crawlers from crawling certain parts of a website, the administrator can write a "robots.txt" file formatted under guidelines found in the "Robots Exclusion Protocol." With the robots file, one can specify which files certain web crawlers should be unable to access. The benefits of this type of control will be outlined in the section below, "Benefits."

1. User-agent: GoogleBot
2. Disallow: /cgi-bin/
3. Disallow: /temp/
If a web crawler comes across this file, it would read the user-agent line and check - via a substring test - whether that line applies to the crawler. If the crawler is mentioned in the user-agent line, it knows not to access whatever files or directories are found in the disallow statements.
General Structure:
1. User-agent: [Crawler name; asterisk means all crawlers]
2. Disallow: [Directory or Filename]

(NOTE: One file or directory per "Disallow" statement.)
(NOTE: The crawler mentioned in the user-agent line is only supposed to be disallowed from accessing each file or directory mentioned until the next user-agent line.)

Explanation of the code:
Line 1 specifies to which crawlers the following code applies. If an asterisk (*) is used, as in the above example, the following code applies to every web crawler.
Line 2 and 3 both start off with "Disallow:". The directory (e.g. /cgi-bin/) that follows the colon is the disallowed file/directory. In other words, line 2 tells the blocked web crawler (in this case all, because of the asterisk) to not access the "/cgi-bin/" directory. Line 3 does the same for the "/temp/" directory.

The robots file serves a multitude of purposes and can greatly assist any web developer, depending on his or her needs.
  1. Robots files can protect your site from resource-hungry crawlers. Essentially, when a crawler visits your site, in order to thoroughly crawl and round up the data, the crawler may execute all of your scripts. Some scripts however, like Facebook's account-creating script, usually do not need to be crawled, for it probably does not serve the search engine a purpose. In that case, a robots file can block it, and thus there will be more resources available for human users. In addition, it can also protect the integrity of an online vote, in that the crawler will, hopefully, not affect the results, if it is asked to not run the scripts.
  2. In another instance, if you do not have a robots file, a crawler might continually click a broken link on your site. When it does not find the page, an error page will be sent back. Since, I would say, most big websites have customized error pages, sending out those pages to a bunch of different crawlers will drain server resources, and waste the crawler's time. In sum, a robots file can prevent drainage of both a server's and a crawler's resources. Therefore, blocking access (through the robots file) to files that may very well contain broken links would be prudent.
  3. Outfront.net (resource link 2) provided another great point. One can use a robots file, while they are slowly developing a site, if they do not want unfinished elements present in a search engine searches - or as previously mentioned, if they do not want to have to waste resources sending out error pages for broken links. This also benefits the search engine, in that the crawler will waste less of its time fruitlessly searches for inexistent pages.

There are two problems associated with the robots file:
  1. A robots file does not definitely mean that a web crawler will not interact with the files or directories that are disallowed. It merely provides a list for a well-intentioned crawler to work with. For example, when a crawler accesses your site, it reads the list and can decide where not to go; however, it can completely disregard it. A robots file is like a note, in which you tell your kid not to have a party, while you are out grocery shopping. Your kid may listen to it, but he or she can just as easily decide not to. If you want to block a website from accessing files, you would have to work with the ".htaccess" file, which is a different story.
  2. A robots file is not a place to hide files. It really just tells a crawler where the administrator does not want the crawler to go. It is as if you tell a robber, "do not take the key that I leave under the mat." Now, the robber knows where the key is, and there is nothing actually preventing the robber from taking the key.
In sum, the Robot Exclusion Protocol (REP) is a set of guidelines, by which well-intentioned web crawlers are expected to abide. REP lists files and/or directories that a crawler should not crawl; however, it does not enforce blocking crawlers.


No comments:

Post a Comment

Search This Blog