next up previous
Next: ... and statistical methods Up: The WEB archives: A Previous: The Internet booms...

... and supercomputers and Web robots become more powerful

``Learning is a lifetime journey. . . growing older merely adds experience to knowledge and wisdom to curiosity.''
-C.E. Lawrence

To create a large-scale archive of the WWW to conserve the history of its change for the years to come, will require large computing power and fast access to massive storage system (see previous section). Promising candidates are general-purpose supercomputers that are built with custom hard- and software [*].

For example, in the Swiss-Tx project, researchers and developers from the Swiss Federal Institute of Technology in Lausanne (EPFL) and Zürich (ETHZ), from the Swiss Center for Scientific Computing in Manno (SCSC), and from Supercomputing Systems AG in Zurich (SCS), jointly develop such communication hard- and software [12,13]. The first Swiss-Tx prototype supercomputer with 64 Gflop/s peak performance for computation is the Swiss-T1. This machine consists of 8 processing nodes of 4 dual processor boxes (Figure 1), altogether 64 production processors, a four processor frontend node, and a development unit. The frontend takes care of the resource management and all the external interactions. The Swiss-T1 machine has 37 GByte of main memory, 800 GByte of local disk space, and 1 TByte of archiving space. For larger archiving purposes, every node could be connected to a RAID. The 12x12 crossbar switch will double connect the boxes in a node, the remaining links interconnect the nodes. The system is unique because of its interconnect switch, that allows to change the connections of the machines based on the communication load, which makes it ideal for searching, and parallel databases. One of the authors (A.S.A. Roehrl) performed initial experiments on custom code in combination with the Oracle Parallel Server, in joint work with Oracle.


 
Figure 1: Illustration of the first prototype Swiss-T1 supercomputer with 64 Gflop/s peak performance.

The Swiss-Tx is an excellent framework to test software Web robots for Web archiving purposes. The authors' modular, extensible, and scalable [14] crawler Ellen was developed as a search-robot to build the database for a search engine. The overall crawling is done from only two to four processors. The other processors are used to decide what to crawl, and they perform the user-based queries to see what the ``WWW" was like, e.g., two months ago. We use a combination of Google's PageRanking [15], some extra analysis of the link-structure [9,16], and capture-recapture methods (see Section 3) to weight the crawling results.

Figure 2 shows a simplified illustration of the search scheme of the the Web robot Ellen. It downloads a Web page, analyses it, stores the results and continues to process the links on the page, provided the statistical analysis shows that it might be worthwile to continue with the links on the page. The relatively low data dependency makes it an excellent task to run in parallel. The different search results are synchronized from time to time to decide whether to continue the search process or to start a new thread. But building a Web robot and launching it on a large-scale also offers many pitfalls. For example, when a robot requests too many pages from one server (``rapid-fire"), this is considered unacceptable in the Web community. Therefore, the parallel robot puts all newly found links from different servers into a workpool, from which new links are randomly drawn and examined in the next step.


 
Figure 2: Simplified search scheme of a multi-threaded version of the Web robot Ellen. The dashed lines indicate additional threads running in parallel.  

Due to lacking bandwidth or too little Webserver computing power, compared to the Web traffic, one often encounters stalling messages. The WWW is then literally transformed into World Wide Waiting. In the test, the robot Ellen had 40Mbps bandwith available with the USA and a 100Mbps connection available to other Swiss universities and to SwitchNG. This bandwidth is of the same order of magnitude as recently implemented among many US universities under the US ``Next Generation Internet Initiative" [17]. However, this bandwidth is at least one order of magnitude smaller than that achieved in commercial Internet backbones deployed by leading companies and 3+ orders of magnitude smaller than that achieved in current photonic testbeds in Europe [18], the US [19,20], and Japan [21].

The power of Web archives and the approach outlined in this paper, lies in the fact that the biggest database ever is the WWW, though it is not yet organized as database. About $90\%$ of today's data reside outside of real databases, which make it impossible or at least very hard to query these external data.

Some current Web software developments are designed to facilitate Web archiving and data interchange purposes in the future. The big success of the most widely used language for creating Web pages, HTML, is unfortunately also its biggest drawback. It is based around a simple and rigid set of tags, that does not allow to describe data properly for data-interchange. A new technology, XML (Extensible Markup Language) [22,23,24,25], does not have these limitations, since it was specifically designed for more complex data interchange purposes [26], e.g., with other standard software packages. Today, most software companies, including Microsoft, Oracle, SAP, and IBM, support XML [27,28]. XML is designed to make it ``easy and straightforward to use SGML on the Web: easy to define document types, easy to author and manage SGML-defined documents, and easy to transmit and share them across the Web" [29]. Once software robots will be able to understand better the contents of the Web pages [30], much more sophisticated datamining will be done much faster.


next up previous
Next: ... and statistical methods Up: The WEB archives: A Previous: The Internet booms...
A. S. A. Roehrl
2/14/2000