Lost but not forgotten: finding pages on the unarchived web

Hugo C Huurdeman,Thaer Samar,Anat Ben-David,Jaap Kamps,Arjen P De Vries,Richard A Rogers

doi:10.1007/s00799-015-0153-3

Abstract

Web archives attempt to preserve the fast chang- ing web, yet they will always be incomplete. Due to restric- tions in crawling depth, crawling frequency, and restrictive selection policies, large parts of the Web are unarchived and, therefore, lost to posterity. In this paper, we propose an approach to uncover unarchived web pages and websites and to reconstruct different types of descriptions for these pages and sites, based on links and anchor text in the set of crawled pages. We experiment with this approach on the Dutch Web Archive and evaluate the usefulness of page and host-level representations of unarchived content. Our main findings are the following: First, the crawled web contains evidence of a remarkable number of unarchived pages and websites, potentially dramatically increasing the coverage of aWebarchive.Second,thelinkandanchortexthaveahighly skewed distribution: popular pages such as home pages have

Highlights

The advent of the web has had a revolutionary impact on how we acquire, share, and publish information
We focus on the use case of the web archive, which is different from the live web given that we cannot go back and crawl the unarchived page and have to rely on these implicit representations exclusively
We study RQ1: Can we uncover a significant fraction of unarchived web pages and web sites based on references to them in the Web archive? We investigate the contents of the Dutch web archive, quantifying and classifying the unarchived material that can be uncovered via the archive

Summary

Introduction

The advent of the web has had a revolutionary impact on how we acquire, share, and publish information. Digital born content is rapidly taking over other forms of publishing, and the overwhelming majority of online publications has no parallel in a material format. Such digital content is as deleted as it is published, and the ephemerality of web content introduces unprecedented risks to the world’s digital cultural heritage, severely endangering the future understanding of our era [31]. AlSum et al [1] queried the Memento aggregator to profile and evaluate the coverage of twelve public web archives. Coverage (i.e. whether a resource is archived and in which archive its past versions are located) was calculated based on the HTTP header of host-level URLs

Methods

Results

Conclusion