Internet Archives as a Tool for Research: Decay in Large Scale Archival Records

Hai Nguyen,Matthew S Weber

doi:10.1109/bigdatacongress.2015.118

Abstract

Web archiving provides social scientists and digital humanities researchers with a data source that enables the study of a wealth of historical phenomena. One of the most notable efforts to record the history of the World Wide Web is the Internet Archive (IA) project, which maintains the largest repository of archived data in the world. Understanding the quality of archived data and the completeness of each record of a single website is a central issue for scholarly research, and yet there is no standard record of the provenance of digital archives. Indeed, although present day records tend to be quite accurate, archived Web content deteriorates as one moves back in time. This paper analyzes a subset or archived Web data, measures the degree of degradation in a subset of data, and proposes statistical inference to such overcome limitations.

Full Text