Abstract

Twenty years after the pioneering experiments performed by Internet Archive and few national libraries, web archiving has become a common activity of many scientific, cultural, and heritage institutions. They are using a set of tools, generally open source, to identify, harvest, store, index, make available to end users, and preserve internet content over the long term. Institutions seeking to preserve web archives are however facing major challenges: not only the huge amount of collected data, but also the lack of fully reliable metadata, which are crucial to understand the web archives and inform future preservation actions upon them. Web archives are generally stored in container formats, notably the ARC file format and its successor, the WARC format—an ISO standard. Context and Provenance information, generated prior to or as part of the harvesting process, is stored in these container formats, but other metadata—especially information on the formats of the collected files—may be generated afterwards. To store and archive these assets in digital repositories, it is necessary to record and manage their metadata. Therefore, institutions need to make data and metadata modeling choices, which should be consistent not only with the design of their own repository and the kind and amount of data they have to preserve, but also with their conceptual view of the nature of web archives. This paper presents the choices and achievements of the National Library of France, called “container modeling”. It then compares it to the approaches of other members of the International Internet Preservation Consortium and to the projects of the New York Art Resources Consortium. It underlines how the different solutions are implemented with PREMIS and concludes with the use of format identification tools and metadata vocabularies for emulation strategies.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.