Showing posts with label robots file. Show all posts
Showing posts with label robots file. Show all posts

Tuesday, February 22, 2011

Robots File

Hello!

A friend of mine brought up robot files in a discussion, so I decided to write about that topic here. I'll provide a definition, display a sample, list some benefits as well as some problems of the robots file. Sources will be cited below.

Definition:
Essentially, in order to prevent web crawlers from crawling certain parts of a website, the administrator can write a "robots.txt" file formatted under guidelines found in the "Robots Exclusion Protocol." With the robots file, one can specify which files certain web crawlers should be unable to access. The benefits of this type of control will be outlined in the section below, "Benefits."

Sample:
1. User-agent: GoogleBot
2. Disallow: /cgi-bin/
3. Disallow: /temp/
If a web crawler comes across this file, it would read the user-agent line and check - via a substring test - whether that line applies to the crawler. If the crawler is mentioned in the user-agent line, it knows not to access whatever files or directories are found in the disallow statements.
General Structure:
1. User-agent: [Crawler name; asterisk means all crawlers]
2. Disallow: [Directory or Filename]

(NOTE: One file or directory per "Disallow" statement.)
(NOTE: The crawler mentioned in the user-agent line is only supposed to be disallowed from accessing each file or directory mentioned until the next user-agent line.)

Explanation of the code:
Line 1 specifies to which crawlers the following code applies. If an asterisk (*) is used, as in the above example, the following code applies to every web crawler.
Line 2 and 3 both start off with "Disallow:". The directory (e.g. /cgi-bin/) that follows the colon is the disallowed file/directory. In other words, line 2 tells the blocked web crawler (in this case all, because of the asterisk) to not access the "/cgi-bin/" directory. Line 3 does the same for the "/temp/" directory.

Benefits:
The robots file serves a multitude of purposes and can greatly assist any web developer, depending on his or her needs.
  1. Robots files can protect your site from resource-hungry crawlers. Essentially, when a crawler visits your site, in order to thoroughly crawl and round up the data, the crawler may execute all of your scripts. Some scripts however, like Facebook's account-creating script, usually do not need to be crawled, for it probably does not serve the search engine a purpose. In that case, a robots file can block it, and thus there will be more resources available for human users. In addition, it can also protect the integrity of an online vote, in that the crawler will, hopefully, not affect the results, if it is asked to not run the scripts.
  2. In another instance, if you do not have a robots file, a crawler might continually click a broken link on your site. When it does not find the page, an error page will be sent back. Since, I would say, most big websites have customized error pages, sending out those pages to a bunch of different crawlers will drain server resources, and waste the crawler's time. In sum, a robots file can prevent drainage of both a server's and a crawler's resources. Therefore, blocking access (through the robots file) to files that may very well contain broken links would be prudent.
  3. Outfront.net (resource link 2) provided another great point. One can use a robots file, while they are slowly developing a site, if they do not want unfinished elements present in a search engine searches - or as previously mentioned, if they do not want to have to waste resources sending out error pages for broken links. This also benefits the search engine, in that the crawler will waste less of its time fruitlessly searches for inexistent pages.

Problems:
There are two problems associated with the robots file:
  1. A robots file does not definitely mean that a web crawler will not interact with the files or directories that are disallowed. It merely provides a list for a well-intentioned crawler to work with. For example, when a crawler accesses your site, it reads the list and can decide where not to go; however, it can completely disregard it. A robots file is like a note, in which you tell your kid not to have a party, while you are out grocery shopping. Your kid may listen to it, but he or she can just as easily decide not to. If you want to block a website from accessing files, you would have to work with the ".htaccess" file, which is a different story.
  2. A robots file is not a place to hide files. It really just tells a crawler where the administrator does not want the crawler to go. It is as if you tell a robber, "do not take the key that I leave under the mat." Now, the robber knows where the key is, and there is nothing actually preventing the robber from taking the key.
In sum, the Robot Exclusion Protocol (REP) is a set of guidelines, by which well-intentioned web crawlers are expected to abide. REP lists files and/or directories that a crawler should not crawl; however, it does not enforce blocking crawlers.

Sources:

Thursday, December 23, 2010

Proxy Server

General Information:
Essentially, a proxy server is a server that connects a computer to another server. They are usually used to get access to a blocked website server.

Let me explain - as the previous definition can be confusing without an example. To open a web page, a computer must connect to a server. (Think of a server as another computer that stores a website in a file; picture the file as an object.) With that said, when you want to load a website, the computer you are using connects to that website server and downloads the file. Now, in a school - or work - setting, the administrators (the people that control the computer system) can block access to a server; and thus, block access to a webpage file.

To bypass that, someone can connect to a proxy server, which is NOT blocked, as long as the administrators do not know about it. This server can ask the computer what site the computer wants to connect to. Then, the proxy server can download the appropriate website server's web page data/file and send it back to the computer that could not previously access it.

Basically, the proxy server downloads the webpage file from the blocked server and saves the webpage file on the proxy server so that any computer can access it. That way, your computer can get the same webpage that they wanted to get from the blocked website, without ever actually connecting to that site's server.
As you can see, someone can access a blocked website through a proxy server.

Mind you that proxy servers can be blocked by administrators - they aren't impervious - because once the server is used, the administrators can block it.

Real life example:
Let me, now, give you a real life example. Say I am at school and want to go to "http://www.yahoo.com".
Now, let's pretend my school blocks "http://www.yahoo.com". What I would do is go to the webpage of a proxy server - just like I would with any other website. I would type in the "http://www.yahoo.com". The proxy server would go to "http://www.yahoo.com", download the webpage, and send it back to me. Now, I have the "http://www.yahoo.com" webpage, which was blocked; and I didn't have to actually visit "http://www.yahoo.com".

Simplified definition:
In sum, a proxy server duplicates the website that you want by copying the websites files; and then, it sends the files back to the user.

--------------

Extra:
Of course, in real life, proxy servers are more complex: they have to write an interface for the user; they have to interact with other servers; and, they have to adjust links and functions in the page so all interactions are directed through the proxy server (trying to direct links and functions through the blocked website would not work, because the blocked website is blocked). However, the above examples, etc, should be enough to give someone a general understanding of proxy servers.


Here I just want to write some of the extra considerations and specifics of running a proxy server.
It seems that the programmer of the proxy server would have to adjust all of the links on the webpage. Such functionality can be achieved by simply removing the domain name of the original, blocked website from the links, which tells the browser to apply all links to the end of the current - proxy - domain name. Mind you that advanced websites won't include the domain name in their website, because it is implied that the domain name that the website's links apply to should be the active website, unless explicitly specified otherwise.
Also, the proxy server would have to crawl the website to download all of the necessary files. This would only be hard if the website's robot file prevented crawling, and, as a result, prevent finding and downloading the necessary files.

UPDATES:
This article may or may not be updated in the future. If I decide to make my own proxy server for experience purposes, I will most likely provide an update that will include my experiences, lessons I've learned, etc.

Search This Blog