Characterizing Web Archive Content

Andrew Boyko

doi:10.2352/issn.2168-3204.2005.2.1.art00010

Abstract

The Library of Congress has been collecting web content since 2000, first through its MINERVA project and, since 2004, as part of a broader Internet capture project. In addition to providing access to some collected content, we have begun to develop tools and techniques to better understand and preserve what we are collecting. When compared with other digital collections, content from the Web has some unique characteristics, such as naming issues and the varying types of relationships between items; nevertheless, when considered at the level of individual items, existing digital preservation approaches are entirely applicable.In this article, we describe some initial results from examining some selected content from this perspective, including the tools used in our analysis of the Library's Web collections, the approaches taken, and directions for further analysis. We intend that this information will be useful for guiding future web harvest and preservation efforts both within and outside the Library. Our goals include:• Identifying and measuring the content types in the collection;• Assessing the variation in file types and validity of “wild” Internet content; and• Determining typical attributes of various file types, to generate predictors for future web harvests.We describe web collections as a specific case of a collection of heterogeneous digital content, focusing on the content as received. We will not address issues relating to acquiring the content, such as retrieval problems and link detection during the web crawl, as these issues have been addressed in detail elsewhere and are ultimately orthogonal to preservation issues.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Characterizing Web Archive Content

Abstract

Talk to us

Similar Papers

More From: Archiving Conference

Lead the way for us

Similar Papers

Early visual cortices reveal interrelated item and category representations in aging.
Claire Pauley ... Anna Karlsson
eneuro | VOL. 11
Claire Pauley, et. al.Claire Pauley ... Anna Karlsson
27 Feb 2024
eneuro | VOL. 11

A Preliminary Discussion of Russian Émigré Materials at the Library of Congress
Barbara L Dash
Slavic & East European Information Resources | VOL. 14
Barbara L DashBarbara L Dash
01 Jan 2013
Slavic & East European Information Resources | VOL. 14

Item analyses of memory differences
Timothy A Salthouse
Journal of Clinical and Experimental Neuropsychology | VOL. 39
Timothy A SalthouseTimothy A Salthouse
12 Sep 2016
Journal of Clinical and Experimental Neuropsychology | VOL. 39

Testing formal cognitive models of classification and old-new recognition in a real-world high-dimensional category domain
Brian J Meagher ... Robert M Nosofsky
Cognitive Psychology | VOL. 145
Brian J Meagher, et. al.Brian J Meagher ... Robert M Nosofsky
30 Aug 2023
Cognitive Psychology | VOL. 145

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Characterizing Web Archive Content

Abstract

Talk to us

Similar Papers

More From: Archiving Conference