``We had the sky up there, all speckled with stars, and we used to lay on our backs and look up at them, and discuss about whether they was made, or only just happened.'' - Huckleberry Finn
The Internet is booming both in terms of number of hosts and in terms of the number of users. In particular, the World Wide Web (WWW) is still growing exponentially [1,2]. However, this growth is difficult to measure, in particular, if a statement about the quality of the available information, rather than the mere quantity, is to be made. A hostcounts in July 1998 estimated some 36,739,000 hosts worldwide, of which 6,529,000 replied to a ping [3]. Also, the estimate of the total number of 800 million Webpages is well-known. However, due to dynamic Webpages and mirror Websites this kind of information is not as relevant as is the quality and actuality of the number of pages that are there to crawl.
The challenge to create a large-scale archive of the WWW to conserve the history
of its change is similar in magnitude to that of the top-secret NSA project
Echelon [4], or Search engines like Inktomi [5] and
Google [6,2], with the additional complication of a
time dimension.
Technically, all the above mentioned examples are run on a network of
workstations
[7,8].
The high-end Internet search-engines of today include databases of about 150
million indexed Webpages, and they crawl more than 10 million
webpages per day [9], which are stored in the database.
This crawling speed will most likely have to be increased in future,
as the average life time of a Webpage is only 44 days
[10]
, the exponential growth
of the total storage size of all Webpages will be sustained for quite some time.
Most search engines store the page and perform some simple page-relevance
ranking.
However, further postprocessing methods of the search information differ a lot.
According to the authors' experience with scanning Webpages with robots
[11,1],
the average Webpage is about 300 words long. So, a typical crawling system is
left with the processing
of
words, and will also serve several hundred thousands of
searches
a day, which requires large-scale computing power
and fast access to massive storage systems.