Taking an interdisciplinary approach
,
the authors discuss both technical issues of creating
archives of the World Wide Web (as suggested at www.archive.org),
and the possible socio-political relevance of such archives in the future.
As the Internet becomes the Ever- and Everywherenet, the Web archives may
become a memory of mankind, a sort of time-machine to go back into the
past.
The authors present the hardware and software concepts,
and an initial analysis, of a highly scalable and
extendable approach to archive a fully queryable copy of the ever-changing Web.
The purpose is not to compete with the efforts at www.archives.org, but to
present research results that may be useful in any future archiving project of
the Web.
The authors' software approach is unique in that the search
strategy of the Web crawler is based
on capture-recapture techniques (from statistics), rather than the common
brute-force method of scanning as many Web pages as possible.
This includes estimates on the quality of the scans. This is a difficult challenge,
due to the rapidly changing character of the Web.
The authors use the Swiss-Tx supercomputer as the hardware platform of choice,
because of its interconnect switch, which allows to
change the connections between the different nodes on the fly
to enhance the performance of the parallel search database.
Finally, the authors show that the constitution of coherent Web archives
will necessarily be of interest to researchers of all specialities,
since the memory of human kind can be the best heritage to be
left to future generations.