Abstract

The Library of Congress has been collecting web content since 2000, first through its MINERVA project and, since 2004, as part of a broader Internet capture project. In addition to providing access to some collected content, we have begun to develop tools and techniques to better understand and preserve what we are collecting. When compared with other digital collections, content from the Web has some unique characteristics, such as naming issues and the varying types of relationships between items; nevertheless, when considered at the level of individual items, existing digital preservation approaches are entirely applicable.In this article, we describe some initial results from examining some selected content from this perspective, including the tools used in our analysis of the Library's Web collections, the approaches taken, and directions for further analysis. We intend that this information will be useful for guiding future web harvest and preservation efforts both within and outside the Library. Our goals include:• Identifying and measuring the content types in the collection;• Assessing the variation in file types and validity of “wild” Internet content; and• Determining typical attributes of various file types, to generate predictors for future web harvests.We describe web collections as a specific case of a collection of heterogeneous digital content, focusing on the content as received. We will not address issues relating to acquiring the content, such as retrieval problems and link detection during the web crawl, as these issues have been addressed in detail elsewhere and are ultimately orthogonal to preservation issues.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.