CA437 Multimedia Essay

CA437 Multimedia Information Retrieval Essay "Searching the Web"

An Insight into Search Engine's, WebCrawler's, Meta-Search Engines and Web Query Languages.

Name:             David Reade
Student No:     99755629
Class:             CASE4

Abstract

This essay tries to give an informative description of the current state of the Internet along with the new and old technologies being used throughout the area of information retrieval with relation to the World Wide Web. Some of the ideas and thoughts in this publication stem from my own experiences while others stem from the knowledge and experience of others. I have outlined a list of references which have been consulted throughout the production of this publication.

In this piece I have chosen to produce a detailed analysis of Search Engines specifically relating to their history, architecture, ranking algorithms and linkage analysis. I have also discussed the work done by WebCrawler's and given a short history of the first WebCrawler created by Brain Pinkerton. I have also chosen to analyse the use of newer technologies such as Web Query Languages used in Internet searching.

The final section is on my conclusions and I have developed my conclusions based on my own observations and on any relevant data which I have read from books and on the World Wide Web itself. It finishes by discussing the future development of the Web and of the future role which search engines will have to take.

Disclaimer

I hereby certify that the work presented and the material contained herein is my own except where references to other material are stated. General content references are contained in section 7.

Table of Contents

1. Introduction
2. Search Engines	2.1 How they work 2.1.1 Architecture 2.1.2 Ranking Algorithms 2.2 Search Engine History 2.2.1 TimeLine
3. WebCrawler's	3.1 How they work 3.2 History
4. Meta-Search Engines
5. Web Query Languages
6. Conclusions
7. References

1. Introduction

The World Wide Web has in just a few short years become a vast global information resource which is not only of use to individual people but of vital importance to the running of the ever increasing business sector within the global economy. The rapid growth of the Internet has resulted in the development of a large, unstructured database with many different search and retrieval methods and tools, all trying to optimise one basic core feature: fast retrieval of relevant information. For example the basic key metrics of search performance in relation to search engines are relevance, freshness and index size.

Search Engines, Meta-Search Engines, WebCrawler's, using Web Query Languages and simply browsing are all methods which any user of the web can avail of to retrieve relevant information. Each of these methods has its advantages and its disadvantages but none of them provide a completely effective tool as it is impossible to accurately measure relevance of data all of the time. The web in its current form poses a number of challenges relating mainly to the actual data and problems relating to the user and their interaction with the retrieval system. In other words the problem with people and data is that people differ and with that, opinion's on the relevance of data differs.

Some of the problems relating to the Web and the storage of data are detailed in table 1:

1.Size	The web is continuing to grow exponentially with no one certain what exactly it will end up as .
2.Distribution	The Web is spread out over various different platforms and computers and with different standards for each different location. This came about because nobody planned for the development of the Web, nobody sat down and carefully developed a set of guidelines which would have ensured a better state of affairs today. This is easy to write about in hind sight but nobody foresaw the effects the Web would have on the world and it is doubtful that even careful planning at the highest levels could have controlled something which grew globally so rapidly.
3. Volatility	The Web is easy to become a part of but at the same time it is easy to disappear from. Any user of the Web knows the frustration of following redundant links which could potentially have housed the data which was being searched for.
4. Structure	The Web contains a vast amount of repetition somewhere in the region of 30% with semantic redundancies being possibly even larger. The Web is a very bad implementation of a hypertext system as it has no pre-defined conceptual model behind it which results in a weak structure with little consistency to data and hyperlinks.
5. Quality	Anybody can submit content to the Web and this is all well and good when it comes to freedom of speech but when it comes to development of quality data there is no pre-submission checks which ensure the validity and grammatical quality of the data being submitted.
6. Heterogenous Data	The World is not a one dimensional place where everything is standardised and working in harmony but a balance needs to be met in order for the world to function. The World Wide Web is exactly the same as it has to deal with various real world problems such as the different languages and alphabets of the world. These problems exist in the real world but are made more acute by the fact that they are no longer restricted by the physical barrier of space and distance. The Web also has to deal with different media types on multiple formats and platforms.

The tools developed thus far do however try to effectively overcome the problems of the web and provide an easy to use service which is effective at retrieving relevant information.

My first experience of trying to find information on the internet was at secondary school when it was still a relatively new phenomenon. At that time a teacher introduced me to Alta Vista and I remember thinking that the computer was smart, when at first it came back with relevant hits, but my opinion changed and I became more frustrated when I felt that the computer did not understand what I was asking, as it seemed to return documents which were clearly not relevant to my query. This was however at a time when I had no concept of browsers or crawlers or html or anything to do with the web and I never stopped to think what was going on behind the scenes, why did the computer not 'understand' what I was asking it?. This is still the case for many people when they first attempt to search the web but understanding how search tools work is vital in allowing the user to effectively retrieve data, or in other words allowing the user to ask a question which the computer will be able to 'understand' and hopefully 'answer'.

The exponential growth of the internet has resulted in the need for new and different search tools and methods being developed. These tools must be user friendly and meet the needs of the time. Search Engines were one of the first solutions to the problem of information retrieval and at present, high speed, well structured engines such as Google provide an effective searching tool most of the time. The problem is that no-one can accurately predict what form the web will take over the next few years therefore no-one can pre-judge the relevance of Google in its current form to the web of tomorrow.

People now recognise the need for developing effective search tools which structure relevant data quickly from an ever growing range of unstructured and semi-structured documents and records. This has lead to information retrieval becoming a key area of research with the development of new data models like web query languages and the advent of XML and similar technologies which aim to guide the future development of the Web in a more constructive and structured fashion.

2. Searching the Web - Search Engines

For a vast quantity of people Search Engines have become a part of daily life but not many people really know about the progression of these tools over the past ten years or so and not many people know about the underlying algorithms and constructs which make them work.

The majority of search engines are implemented as Crawler-Based systems but the term search engine often includes Human-Powered systems such as Open Directory. The main types of 'Search Engines' are:

" Crawler-Based Systems - Automatic, uses Spidering and ranking algorithms
" Human Powered Directories - Human decision on ranking and relevance.
" Hybrid Search Engines - Mix of the above two.

For the purpose of this publication I will only be dealing with the history and architecture behind Crawler-Based Systems such Google as they implement automatic crawls and ranking algorithms.

2.1 How they work - Crawler-Based Search Engines

Crawler-based Search Engines can be broken down into three main elements.

" Spider - crawls through Web gathering information
" Search Engine Index - stores gathered information.
" Search Engine Software - runs queries on the indexed data.

A "spider" is a computer program which "crawls" through the Web gathering information and returning it to a central database which is then indexed. These spiders which are also called "webcrawlers" or "robots" or "walkers" or "Wanderers" are vital to the operation of most search engines. The spider will often return to the site every one or two months and to check for any changes that may have been made.

All data that the spider finds goes into the second part of the search engine, the index. The index contains a copy of every web page that the spider finds and if the spider finds page changes when it returns to an indexed site, the index is updated with the new information.

The final part of the engine is the Search engine software which is used to search through the millions of pages recorded in the index to find matches to a query. The software includes a ranking algorithm which it uses to rank the pages in order of relevance. Each different engine uses a different ranking algorithm which is why different engines give different results to the same queries. The underlying methods that search engines run on and rank pages by are often a closely guarded secret and each search engine has its own search interface and uses different criteria for matching searches with documents. Each may also differ in terms of search speed and how it ranks results in order of relevance.

Standardisation of search engines would make searching the vast amount of Web resources quicker and easier but when searching for uncommon data it is often recommended that more than one search engine be used. Meta-Search Engines allow you to search the results given by numerous different engines to any query at one time.

2.1.1 Search Engine Architecture

The main difference between the Web and standard IR systems is that all Web queries must be answered without accessing the text i.e. the use of indices. If this was not the case then copies would need to be stored locally which would result in high costs or remote pages would need to be accessed through the network at query time which is not practical as it would be far too slow. The architecture of Search Engine's can be dealt with under the following headings:

" Centralised Architecture
" Distributed Architecture

Centralised Architecture
This is the most common type of architecture employed by search engines. They use a centralised crawler-indexer which searches the Web and sends new or updated pages to a main server where indexing takes place. A common misconception relating to crawlers is the fact that they do not actually move out and run on remote machines rather it runs on a local system and sends request to remote Web servers. The index is then built up and it is from the index that any queries submitted are answered from.

Problems associated with this architecture:

" Gathering Data - Dynamic nature of the Web.
" Saturated Communication Links
" High Load at Web servers.

Fig 1. Crawler-Indexer Architecture

Example of Search Engine's which use a Centralised Architecture

" Google - most widely used engine at present.
" Inktomi - powers Microsoft, HotBot and GoTo.

Distributed Architecture

There are several variations on the above crawler-indexer architecture. The most important of these is Harvest which uses a distributed architecture to gather and distribute data. This architecture is more efficient than the previous crawler architecture because it addresses several problems such as:

" Web server loads
" Web Traffic
" Coordination

The way Harvest solves these problems is by introducing gatherers and brokers. A gatherer retrieves and extracts information related to indexing from one or more Web servers. This is done periodically i.e. the name Harvest. A broker on the other hand provides a mechanism for indexing and the query interface to all the data gathered. Brokers get their information from gatherers or other brokers and they update their indices incrementally.

A gatherer can run on a Web server, generating no external traffic for that server and can send information to several brokers, avoiding work repetition.

Fig 2. Distributed Architecture: Harvest

Summary: The largest search engines on the Web at the moment are:

1. Google (46.5%)
2. Yahoo (20.6%)
3. MSN Search (7.8%)
4. AltaVista (6.4%)
5. Terra Lycos (4.6%)
6. Ixquick (2.4%)
7. AOL Search (1.6%) Taken from: www.OneStat.com

2.1.2 Ranking including Linkage Analysis

Ranking Algorithms

Most Search Engines use either the Boolean or Vector model or variations of these models to perform ranking. Ranking is similar to searching in the way that ranking cannot access text only the index. Ranking algorithms are often closely guarded secrets and because of this there is not much public information about the specific algorithms used by current search engines. It is for this reason that it is difficult to fairly compare different search engines given the differences between them and regular improvements in the area.

Each search engine employs different algorithms for determining page position and these algorithms are subject to change over time.However, there are two general features which they have in common:

" Word frequency - in a page is important, particularly in the page title and near the top of the page.
" Number of Links - the number of links to a page is important, particularly from sites that are in some way recognised as being authoritative and relevant.

Determining the relevance of a site is an important feature of Search Engine design and each search engine handles this differently.

Criteria that might be used in ranking are:
" The Text used in links that point to a page.
" How close the words lie in the text, for searches on phrases of two or more words.
" The layout of the font used for writing the text.
" The use of H1-H6 tags.
" Whether a page is the root page of a domain.

In 1997 Yuwono and Lee proposed three ranking algorithms (additional to tf-idf scheme):

o Boolean spread (Classical Boolean plus "simplified link analysis")
o Vector spread (Vector space model plus "simplified link analysis")
o Most-cited (based only on the terms included in pages having a link to the pages in the answer)

The first two are extensions of the original algorithms which include pages pointed to by a page in the answer or pages that point to a page in the answer. The Most-Cited algorithm is based on the terms included in the pages which have a link to the pages in the answer.

Linkage Analysis

The next generation of ranking algorithms include the use of hyperlink information to evaluate rank. This is an important difference between the Web and normal information retrieval databases. Linkage Analysis works by determining the popularity and quality of a page by measuring the number of links that point to that page. A relationship is also often defined between pages if they share common links and reference the same pages.

Three examples of ranking techniques based in link analysis:

1. WebQuery
2. Kleinberg
3. PageRank

(1) WebQuery allows visual browsing of Web pages. WebQuery takes a set of pages - typically the result of a query and ranks them based on how connected each page is. On top of that it extends the set by finding pages that are highly connected to the original set.

(2) The Kleinberg ranking scheme depends on the query and considers the set of pages S that point to or are pointed by pages in the result of the query. Authorities are pages which have many links pointing to them in S i.e. they should have relevant content. Hubs are pages that have many outgoing links. This result's in better authority pages coming from incoming edges of good hubs and better hub pages come from outgoing edges to good authorities. Take H(p) and A(p) as the authority and hub value of a page p. The following equations are satisfied for all pages p:

H(p) - Hub value of page p A(p) - Authority value of page p

where H(p) and A(p) for all the pages are normalised. These values can be determined through an iterative algorithm, and they converge to the principal eigenvector of the link matrix of S. In relation to the Web, a maximal number of pages pointing to the answer can be defined in order to avoid an explosion of the size of S. (Yates, Neto, 1999)

(3) PageRank is part of the ranking algorithm which is used by Google. PageRank works by simulating a user randomly browsing the Web and is based on certain probabilities. PageRank models the user randomly following links. The first probability is that the user jumps to a random page with a probability q or follows a random hyperlink on the current page with probability 1-q. It is also assumed that the user never goes back to a previously visited page following an already traversed link backwards. This entire process can be modelled using a Markov Chain, from where the stationary probability of being in each page can be calculated. This value is then used as part of the ranking algorithm of Google. Let C(a) be the number of outgoing links of page a and suppose that a page a is pointed to by pages p1 -pn. Then, the PageRank, PR(a) of a is defined as

q -typical value =0.15

where q must be set by the system.
(Yates, Neto, 1999)

2.2 History of Search Engines

One of the most exciting features of the Internet explosion is the timescale that it all happened in and is still happening in. I can, as with most people my age remember a time without the Internet and without search engines and I believe that the Internet will probably be the "television" of our time meaning that our children will look at us in amusement when we tell them that we were alive before the Internet boom just as we look at our parents in amusement when they tell use about the times before television.

2.2.1

Date	Details	Location
1990	Alan Emtage creates “Archie”, the first search tool which archives a repository of Internet files by using anonymous FTP.	University of Magill
1991	Mark McCahill introduces “Gopher” as an alternative to Archie.	University of Minnesota
1992	Introduction of “Veronica” – a search tool that scans gopher servers for text files.	University of Nevada
1993	Introduction of “Jughead” an enhanced version of gopher which has keyword search and Boolean operator capabilities developed by Rhett Jones. Matthew Gray creates the earliest widely acclaimed Web robot “World Wide Web Wanderer”	University of Utah M.I.T
1994	Yahoo is created by David Filo and Jerry Yang as a way to keep track of their favourites on the Internet Brian Pinkerton introduces WebCrawler. See below Michael Maldin creates Lycos	Stanford University of W.A Carnegie Melon
1995	Infoseek becomes the default search engine for Netscape Erik Selberg and Oren Etizioni introduce Meta-Crawler. Excite is launched and Alta Vista is introduced to public acclaim. The first Meta Search Engine – SearchSavvy is introduced	University of Washington
1996	Developers discover the use of Meta tags to improve search engine ratings. Inktomi is founded by Eric Brewer and Paul Gauthier. HotBot is launched. LookSmart, a categorized directory of WebSite listings is introduced.
1997	Ask Jeeves is introduced with key aims: ease of use, relevance, precision and ability to learn. Bill Gross’ idealab introduces GoTo – first pay per click search engine Introduction of Search Engine ranking software which allows Web site owners to manipulate their position and ranking within the major search engines.
1998	Open Directory is launched with the aim of becoming the most comprehensive, robust directory on the Web. Larry Page and Sergey Brin introduce Google a search engine that takes an innovative new approach to evaluate relevancy. Direct Hit is introduced as a new technology which aims to change the world of searching by analysing the activity of past internet searches. Ranking algorithms start dealing with “off the page” considerations which leads to Web sites trying to increase their “link popularity” by increasing their links from external sites.
1999	The Go network is introduced by Disney and utilises InfoSeek search technology. The first public Internet company, NBCi is launched by NBC and includes the Web service Snap which provides Internet search and directory facilities. The first engine to index 200 million Web pages is launched, it is Norwegian-based Fast Search.
2000	Online marketers begin to recognise pay-per-click search engines as an easy, yet expensive approach to gaining top search rankings on important keywords and phrases.
2001	Ask Jeeves acquires the Teoma search property. GoTo changes its name to Overture as the company re-brands itself as an e-business services company.

At present Web site owners and optimisers recognize that the best way to obtain top search engine rankings is by building and optimising Web sites that have useful and relevant content. Page designers should include informative titles, headings, and meta-fields as well as good links. This will aid the process of ranking algorithms and will lead to more structured, easily searchable Web indices. This is widely accepted as the best way forward in web site design.

3. Crawling the Web - WebCrawler's

Web crawling plays a major part in searching the Web. This is evident from the important role which they play in the operation of Search Engines. Brian Pinkerton created the first WebCrawler in 1994 and the development of "WebCrawler" over the past eight years is detailed in section 3.2 WebCrawler Timeline.

3.1 How they work

(A)URL Sets

The Crawler starts off with a set of URLs and from there it extracts other URLs which are followed recursively in a breadth-first or depth-first fashion. To make up the URL set Search engines allow users to submit top Web sites that will be added to the set. The set could also be made up of popular, common URL's because they will contain information that will be frequently requested by users. Both cases work well for one crawler, but it is difficult to coordinate several crawlers to avoid visiting the same page more than once.

(B)Country Coding and Internet name

Another technique is to partition the Web using country codes or Internet names. Once the partitions have been made one or more robots are assigned to each partition which is then explored exhaustively.

(C)Crawling

The current fastest crawlers are able to traverse up to 10 million Web pages per day. The order in which the URLs are traversed is important:

" Breadth-First - Look at all the pages linked by the current page, and so on. This matches well Web sites that are structured by related topics. On the other hand, the coverage will be wide but shallow.

" Depth-First - Follow the first link of a page and we do the same on that page until we cannot go deeper, returning recursively.

(D)Updating

The Web Pages referenced in an index will be from one day to two months old. They will also have been explored at different dates and may in fact not exist any more. For this reason, most search engines show the date when the page was indexed. It is said that search engines store somewhere in the region of 2% to 9% of invalid links. There are some engines that learn the change frequency of a page and visit it accordingly.

3.2 WebCrawler Timeline

Date	Details
January 27, 1994	Brian Pinkerton, a Washington University student starts WebCrawler in his spare time. It was originally developed as a desktop application and not as the Web service it is today.
April 20, 1994	WebCrawler goes live on the Web. It originally had a database with pages from over six thousand sites.
November 14 1994	WebCrawler reached its one millionth query, this was eight years ago while today an engine could reach that figure in just minutes.
December 1, 1994	DealerNet and Starwave both decide to sponsor WebCrawler. WebCrawler was fully supported by advertising on and this kept it in operation.
June 1, 1995	WebCrawler is acquired by AOL. At that time, AOL had fewer than 1 million users, and no capability to access the Web. It was believed that AOL's resources could help make the most of WebCrawler's future.
September 4, 1995	WebCrawler changed its design and first introduced “Spidey” the WebCrawler mascot.
April, 1996	The functionality of WebCrawler is extended to include the best human-edited guide for the web: GNN
April 1, 1997	WebCrawler is acquired by Excite from AOL. WebCrawler was initially supported by its own dedicated team within Excite, but that was eventually abandoned in favour of running both WebCrawler and Excite on the same back end.
2001	WebCrawler is acquired by Infospace after Excite file for bankruptcy. Today Infospace runs WebCrawler as a meta-search engine and includes paid links in the search results and they have changed the name of the mascot from “Spidey” to “Hunter”.

Source: http://www.thinkpink.com/bp/WebCrawler/History.html

4. Meta-Search Engines

A Meta-search engine or multi-threaded engine works by sending a given query simultaneously to several different search engines, directories and other databases. After it collects the results it will remove any duplicate links and according to its ranking algorithm, present them to the user in a single merged list.

The reason for the development of meta-searchers is because every search engine indexes different Web pages, so if you use only one engine you could lose relevant results that another engine might return.

Meta-search engines differ from ordinary search engines because they:

" Do not have their own databases and
" Do not accept URL submissions.

Advantages of Meta-searchers

" Results can be sorted by different attributes such as host, date etc.
" Good for obscure or uncommon queries.
" More informative than a single search.
" Could return more relevant pages than a single engine.
" Save time by running query on multiple engines at once.

Disadvantages of Meta-searchers

" Not all engines may return results because of the use of quotes etc.
" Not all meta-searchers utilise a ranking algorithm to sort data.
" The Query language which is common to all engines could be small.

Some of the Better Meta-Search Engines are:

DogPile [www.dogpile.com]

ez2www [ http://ez2www.com/ ]

Vivísimo [ http://vivisimo.com/ ]

5. Web Query Languages

Search Queries do not always have to be based on the content of a page but can also be based on the link structure which connects Web pages. To be able to pose these queries a different model needs to be used than those for content based queries. The most important are:

(1) A labelled graph model which can be broken up into three areas:
" Nodes - used to represent Web pages
" Edges - used to represent hyperlinks
" Semi-structured data model to represent the content of Web pages.

(2) A semi-structured data model
" Schema is not given in advance, but is implicit in the data.
" Relatively large schema which changes regularly.
" Schema describes the current structure, but allows for violations of the schema.
" Data is not strongly typed which means that attributes with the same name may change type as they are used in different places.
XML falls under the category of semi-structured data model.

Some models for querying hypertext systems have been around since before the development of the Web but the first generation of Web query Languages were developed for the purpose of combining content with structure. The first generation of Web Query Languages combine patterns that appear within the result documents with graph queries describing link structure.These languages include:

" WebSQL
" W3QL
" WebLog
" WQL

Source: Modern Information retrieval 1999 [Ricardo Baeza-Yates, Berthier Riberio-Neto]

WebSQL:
WebSQL models the web site as a relational database. The database has two relations called the Document and the anchor. The document contains tuples for each document on the Web while the Anchor relation has one tuple for each anchor in each document. All tuples are virtual and cannot be enumerated. Symbols used by the language:
" -> for a link to the same site,
" #> for a link in the same document
" => for a link to another site.

W3QL:
Similar to WebSQL but, uses external programs for specifying content conditions on files instead of including these in the language. Next generation will replace these external methods with extensible methods based on the MIME standard.

WebLog -
Uses a deductive rules language, DataLog, instead of SQL-like syntax

WQL
query language of the WebDB project is similar to WebSQL, but supports more comprehensive SQL functionality such as aggregation and grouping. Also has limited intra-document structure querying.

The Second Generation of Web Query Languages which were called data manipulation languages and concentrated on a semi-structured model. They do however extend the first generation languages by providing access to the structure of Web pages.The main second generation languages are:

" STRUQL
" FLORID
" WEBOQL
" ARANEUS

Source:Web Query Languages, Intelligent Information Integration [Alan K. Dippel]

STRUQL
" Based on labelled directed graphs. Supports URL's, Postscript, text, image, and HTML files - Part of the Strudel web site management system.

FLORID
" Prototype implementation of the deductive and object-oriented formalism F-Logic.
" A web document is modelled by 2 classes, URL and webdoc, as strings.

WEBOQL
" uses hyper tree data structure
" ordered arc-labelled trees with two types of arcs, internal and external.
" Internal arcs represent structured objects and external arcs are used to represent references(typically hyperlinks)
" Sets of hyper trees are collected into webs.

ARANEUS
A database project that uses the Ulixes language to build relational views of the data and then generates hyper textual views for the user using the Penelope language.

Summary:

The web languages above are too complex to be used directly by interactive users. This is an area which is being worked on in order to make it suitable for casuals users. Overall this is an area of data-integration that has the most potential of making information on the Web available to the public.

Source:Web Query Languages, Intelligent Information Integration [Alan K. Dippel]

6. Conclusions

Up to now the development of the web has been sporadic and unplanned and Search engines have provided an effective mechanism for searching this unstructured mass of information, but with the Web growing exponentially, the focus needs to change from the way in which the Web is searched to the way in which the Web is developed. Leading Web site developers recognise the need for the development of informative, well structured sites with robust, relevant content but nobody can predict the form which the Web will take on over the next few years.

At the moment Google is recognised as the worlds most comprehensive search engine because of its large index and the amount of queries which it handles on a daily basis. Google also implements a good ranking algorithm while its site is simple and effective. Nobody knows what the future holds in store for the Web and for engines like Google but with the amount of data ever increasing, new algorithms and data structures will need to be developed to ensure that the ability to retrieve information grows at the same rate as new information is being added to this vast resource.

Because of the nature of the Web, planning its development is near impossible as it would require global support and cooperation from each corner of the earth, something which has yet to be achieved over any situation. This means that searching the Web is only going to get more tedious and difficult unless some form of effective standardisation is eventually introduced to structure and contain the vast amount of global information.

New search techniques such as those which examine hyperlinks are hailed as the way forward when it comes to searching the Web. Again the very nature of the Web casts doubt in my mind over the long term future of any of the current methods of information retrieval available because of the sheer size of the explosion that has occurred. By this I mean that, development of the Web is still new to the world because ten years ago nobody could have accurately predicted the situation which is here today. On the same hand nobody can safely bet money which guarantees the state or structure which the Web will take in ten years time.

Personally I believe that their will be a revolution within the World Wide Web, not a general revolution like before when it became widespread throughout the world but an internal revolution which will redefine everything that we currently think about the storage and availability of information on a global level.

Currently large governments such as that of China see the power behind the Web and the power which search engines in their current form have. The fact that, for the first time in history people from all walks of life have the opportunity to freely access information shows that the world has been changed for ever. Again, the only problem with this is the lack of some form of effective standardisation which will ensure the integrity of content being added to the World Wide Web.

My feelings on the future of searching the web can be summarised as follows. I do not believe in the long term future of any of the current search engines or techniques. The names and company logos may remain but the fundamentals behind the way in which they work will have been changed so much they will be unrecognisable in comparison. I do not see this as a bleak future for searching the web but as an exciting prospect because the future development of the Web is going force the design of new and innovative technologies which can deal with such a rapid growth. I believe that these new technologies will have to be extreme and radical and will require both imagination and expertise in order to be implemented.

My final thought is that the importance of the Web cannot be underestimated and the future structure cannot be predicted which has created a time of uncertainty after the initial boom where the importance of developing new and energetic search methods is a necessity in order to ensure the vast wealth of global knowledge remains accessible to every single person.

7. References

Modern Information retrieval 1999 [Ricardo Baeza-Yates, Berthier Riberio-Neto]

MMIR – Dublin City University 2002 http://www.computing.dcu.ie/~cgurrin/

Search engine publications and news
http://www.searchenginewatch.com/

A Brief History of WebCrawler
http://www.thinkpink.com/bp/WebCrawler/History.html

Information Relating to Gopher -

http://www.knowalot.com/nova/gopher.html

Information on Google’s features http://www.google.com/help/features.html

Some Web Searching Tools:

Google - The best search engine on the web.

Yahoo! - provider of comprehensive online products and services to consumers and businesses worldwide.

Excite - provides search, news, email, personals, portfolio tracking, and other services.

Lycos - develops and provides online guides to locate and filter information on the Internet. Products enable users to accurately identify and select information of interest to them.

Netscape Search - combines results from the Netcenter, Open Directory, and the Web.

AltaVista - portal featuring web and newsgroup search engine as well as paid submission services.

HotBot - offers users a point-and-click interface, pulldown menus, and the ability to use plain English terminology for constructing searches.

AOL Search - search engine and directory.

WebCrawler

David Reade Computer Applications Software Engineering - Last updated December 2002

Projects Semester 1

Project Preparation
OO Project
Databases
Multimedia Essay
Graphics Assignment

Project Semester 2

OO Models Z-specification
Multimedia Essay
Distributed Assignment I
Distributed Assignment II --

Final Year Project

1. Proposal
2. Functional Spec.
3. Technical Manual
4. User Manual

Third Year Project

1. Functional Spec.
2. Technical Manual
3. User Manual
4. Battleship Game

Course Description

Computer Applications

Subjects Semester 1

1. Object Oriented Metrics
2. Multimedia Retrieval
3. Computer Graphics
4. Database Theory
5. Project Preparation
6. Digital Signal Processing

Subjects Semester 2

1. Object Oriented Models
2. Distributed Prog.
3. Multimedia Technology
4. Final Year Project

Structure in Years

1. First Year
2. Second Year
3. Third Year
4. Fourth Year

Introduction of “Jughead” an enhanced version of gopher which has keyword search and Boolean operator capabilities developed by Rhett Jones.

Yahoo is created by David Filo and Jerry Yang as a way to keep track of their favourites on the Internet

Brian Pinkerton introduces WebCrawler. See below

Infoseek becomes the default search engine for Netscape

Erik Selberg and Oren Etizioni introduce Meta-Crawler.

Excite is launched and Alta Vista is introduced to public acclaim.

Developers discover the use of Meta tags to improve search engine ratings.

Inktomi is founded by Eric Brewer and Paul Gauthier.

HotBot is launched.

Ask Jeeves is introduced with key aims: ease of use, relevance, precision and ability to learn.

Bill Gross’ idealab introduces GoTo – first pay per click search engine

Open Directory is launched with the aim of becoming the most comprehensive, robust directory on the Web.

Larry Page and Sergey Brin introduce Google a search engine that takes an innovative new approach to evaluate relevancy.

Direct Hit is introduced as a new technology which aims to change the world of searching by analysing the activity of past internet searches.

The Go network is introduced by Disney and utilises InfoSeek search technology.

The first public Internet company, NBCi is launched by NBC and includes the Web service Snap which provides Internet search and directory facilities.

Ask Jeeves acquires the Teoma search property.

WebCrawler is acquired by Infospace after Excite file for bankruptcy. Today Infospace runs WebCrawler as a meta-search engine and includes paid links in the search results and they have changed the name of the mascot from “Spidey” to “Hunter”.