Nerd Represents: February 2011

Monday, February 28, 2011

Watson: So Much Potential

Hello.

I'm sure that a large portion of our society is familiar with Watson, IBM's question-answering machine who was recently featured on Jeopardy, where he obliterated reigning champions.

Development Process To Maximize Potential

I have an intriguing idea regarding Watson's potential.

First, of course, Watson has to be made to answer every question accurately. In other words, he should be able to answer all questions correctly. For those that he cannot, the algorithms must be tweaked, and any underlying problems must be fixed. (Please excuse me if I make this sound like an easy task.)

At this point, these algorithms should be further modified to a point, where Watson can "understand" allusions and word plays, and "understand" references (for example, when a sentence, which is preceded by another reads, "He...," Watson can work out what the "He" relates to. Essentially, he can mathematically understand the dynamic nature of language) . Then combine them into a well-structured and cohesive paragraph - or maybe even large essay-sized discourse.

Now notice that this sort of question answering system would be almost the polar opposite of Jeopardy-style. Instead of giving clues and receiving a one-word answer, Watson would receive one-word and return larger, more substantive, piece of writing.

Ultimately, Watson could be digging apart references between all sorts of sentences and finding connections between millions of articles to make one mega article. He could quickly perform such extensive research, beyond the ability of the human mind.

The Potential

Therefore, we may pose grand questions such as "what is life?" Now, of course he wouldn't have access to any documentation or theory that we do not have access to. However, if he were capable of analyzing more sources than a single human could in his or her lifetime, and he could understand allusions, and he could find synonyms (with direct accuracy - unlike the middleschooler who wants Microsoft Words to give him bigger words, without considering the accuracy to the initial word), and incorporate studies from other disciplines, Watson may be able to arrange words that may change the current perspective on "life" - or whatever the question was.

Why that works

Now, one might ask, why do we need to change the current perspective? In his book "The Origins of Modern Science," Mr. Herbert Butterfield provides an answer. He says that the whole transition to and development of Modern science can be traced back to one thing - and one thing alone: a shift in perspective. I read the book a while ago, so I do not remember all of the specifics, but in the study of oxygen, gravity, and the universe, people viewed them differently within their own minds and were able to understand the world differently than the scientists who supported incorrect theories. Watson's discourse may be the necessary tool to jump start us - or, for those Star Wars fans, put us into hyperdrive towards - the future. Watson may be the force, whose change in scope sparks a multi-discipline paradigmatic shift in understanding. (Eventually, he may even be throwing his articles into his own repertoire of "knowledge".)

Let's be practical

Let's consider a more practical question: "How to cure AIDS?." Watson would then research AIDS, other fields, other disciplines, historical examples, and give "his own opinion" about the elusive cure for AIDS, etc. I, by no means, believe that he will spit out the cure, but it will prompt researchers to take a new look at AIDS, and understand the implications of a minor shift in meaning. Einstein even took the same approach - without the computer - when he asked, "what is gravity?" (Provide a source for this)

Closing thoughts

I understand that this will most definitely NOT be easy, but its worth shot. I understand the negative effects - someone could have a thesis at their disposal without asserting any effort - but such an attempt to develop Watson in such a way would, at the least, provide benefits to the computer science field, and help people think logically even when they don't have a real human there to contemplate thoughts with.

In sum, those are great uses and can definitely help the world in the aforementioned ways. Even if this idea, which I find intriguing, does not succeed in the long-term, perspective-changing goals, Watson still has massive potential in those surgeon-performing-surgery scenarios - he may even save your life one day.

Coming Later:

Later a debate on Watson's conscious, the reason why I quoted so many words of human understanding, etc.

Tuesday, February 22, 2011

Robots File

Hello!

A friend of mine brought up robot files in a discussion, so I decided to write about that topic here. I'll provide a definition, display a sample, list some benefits as well as some problems of the robots file. Sources will be cited below.

Definition:

Essentially, in order to prevent web crawlers from crawling certain parts of a website, the administrator can write a "robots.txt" file formatted under guidelines found in the "Robots Exclusion Protocol." With the robots file, one can specify which files certain web crawlers should be unable to access. The benefits of this type of control will be outlined in the section below, "Benefits."

Sample:

1. User-agent: GoogleBot

2. Disallow: /cgi-bin/

3. Disallow: /temp/

If a web crawler comes across this file, it would read the user-agent line and check - via a substring test - whether that line applies to the crawler. If the crawler is mentioned in the user-agent line, it knows not to access whatever files or directories are found in the disallow statements.

General Structure:

1. User-agent: [Crawler name; asterisk means all crawlers]

2. Disallow: [Directory or Filename]

(NOTE: One file or directory per "Disallow" statement.)

(NOTE: The crawler mentioned in the user-agent line is only supposed to be disallowed from accessing each file or directory mentioned until the next user-agent line.)

Explanation of the code:

Line 1 specifies to which crawlers the following code applies. If an asterisk (*) is used, as in the above example, the following code applies to every web crawler.

Line 2 and 3 both start off with "Disallow:". The directory (e.g. /cgi-bin/) that follows the colon is the disallowed file/directory. In other words, line 2 tells the blocked web crawler (in this case all, because of the asterisk) to not access the "/cgi-bin/" directory. Line 3 does the same for the "/temp/" directory.

Benefits:

The robots file serves a multitude of purposes and can greatly assist any web developer, depending on his or her needs.

Robots files can protect your site from resource-hungry crawlers. Essentially, when a crawler visits your site, in order to thoroughly crawl and round up the data, the crawler may execute all of your scripts. Some scripts however, like Facebook's account-creating script, usually do not need to be crawled, for it probably does not serve the search engine a purpose. In that case, a robots file can block it, and thus there will be more resources available for human users. In addition, it can also protect the integrity of an online vote, in that the crawler will, hopefully, not affect the results, if it is asked to not run the scripts.
In another instance, if you do not have a robots file, a crawler might continually click a broken link on your site. When it does not find the page, an error page will be sent back. Since, I would say, most big websites have customized error pages, sending out those pages to a bunch of different crawlers will drain server resources, and waste the crawler's time. In sum, a robots file can prevent drainage of both a server's and a crawler's resources. Therefore, blocking access (through the robots file) to files that may very well contain broken links would be prudent.
Outfront.net (resource link 2) provided another great point. One can use a robots file, while they are slowly developing a site, if they do not want unfinished elements present in a search engine searches - or as previously mentioned, if they do not want to have to waste resources sending out error pages for broken links. This also benefits the search engine, in that the crawler will waste less of its time fruitlessly searches for inexistent pages.

Problems:

There are two problems associated with the robots file:

A robots file does not definitely mean that a web crawler will not interact with the files or directories that are disallowed. It merely provides a list for a well-intentioned crawler to work with. For example, when a crawler accesses your site, it reads the list and can decide where not to go; however, it can completely disregard it. A robots file is like a note, in which you tell your kid not to have a party, while you are out grocery shopping. Your kid may listen to it, but he or she can just as easily decide not to. If you want to block a website from accessing files, you would have to work with the ".htaccess" file, which is a different story.
A robots file is not a place to hide files. It really just tells a crawler where the administrator does not want the crawler to go. It is as if you tell a robber, "do not take the key that I leave under the mat." Now, the robber knows where the key is, and there is nothing actually preventing the robber from taking the key.

In sum, the Robot Exclusion Protocol (REP) is a set of guidelines, by which well-intentioned web crawlers are expected to abide. REP lists files and/or directories that a crawler should not crawl; however, it does not enforce blocking crawlers.

Sources:

http://www.thesitewizard.com/archive/robotstxt.shtml

http://www.outfront.net/tutorials_02/adv_tech/robots.htm

http://www.seoconsultants.com/robots-text-file/

http://webdesign.about.com/od/promotion/qt/sample-robot-txt-files.htm

http://www.robotstxt.org/orig.html