MementoMap: A Web Archive Profiling Framework for Efficient Memento Routing

Sawood Alam

doi:10.25777/5vnk-s536

Abstract

With the proliferation of public web archives, it is becoming more important to better profile their contents, both to understand their immense holdings as well as to support routing of requests in Memento aggregators. A memento is a past version of a web page and a Memento aggregator is a tool or service that aggregates mementos from many different web archives. To save resources, the Memento aggregator should only poll the archives that are likely to have a copy of the requested Uniform Resource Identifier (URI). Using the Crawler Index (CDX), we generate profiles of the archives that summarize their holdings and use them to inform routing of the Memento aggregator’s URI requests. Additionally, we use full text search (when available) or sample URI lookups to build an understanding of an archive’s holdings. Previous work in profiling ranged from using full URIs (no false positives, but with large profiles) to using only top-level domains (TLDs) (smaller profiles, but with many false positives). This work explores strategies in between these two extremes. For evaluation we used CDX files from Archive-It, UK Web Archive, Stanford Web Archive Portal, and Arquivo.pt. Moreover, we used web server access log files from the Internet Archive’s Wayback Machine, UK Web Archive, Arquivo.pt, LANL’s Memento Proxy, and ODU’s MemGator Server. In addition, we utilized historical dataset of URIs from DMOZ. In early experiments with various URI-based static profiling policies we successfully identified about 78% of the URIs that were not present in the archive with less than 1% relative cost as compared to the complete knowledge profile and 94% URIs with less than 10% relative cost without any false negatives. In another experiment we found that we can correctly route 80% of the requests while maintaining about 0.9 recall by discovering only 10% of the archive holdings and generating a profile that costs less than 1% of the complete knowledge profile. We created MementoMap, a framework that allows web archives and third parties to express holdings and/or voids of an archive of any size with varying levels of details to fulfil various application needs. Our archive profiling framework enables tools and services to predict and rank archives where mementos of a requested URI are likely to be present. In static profiling policies we predefined the maximum depth of host and path segments of URIs for each policy that are used as URI keys. This gave us a good baseline for evaluation, but was not suitable for merging profiles with different policies. Later, we introduced a more flexible means to represent URI keys that uses wildcard characters to indicate whether a URI key was truncated. Moreover, we developed an algorithm to rollup URI keys dynamically at arbitrary depths when sufficient archiving activity is detected under certain URI prefixes. In an experiment with dynamic profiling of archival holdings we found that a MementoMap of less than 1.5% relative cost can correctly identify the…

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

MementoMap: A Web Archive Profiling Framework for Efficient Memento Routing

Abstract

Talk to us

Similar Papers

Lead the way for us

Similar Papers

Making Recommendations from Web Archives for "Lost" Web Pages
Lulwah M Alkwai ... Michael L Nelson
-
Lulwah M Alkwai, et. al.Lulwah M Alkwai ... Michael L Nelson
01 Aug 2020
01 Aug 2020

Web archive profiling through CDX summarization
Sawood Alam ... Lyudmila L Balakireva
International Journal on Digital Libraries | VOL. 17
Sawood Alam, et. al.Sawood Alam ... Lyudmila L Balakireva
16 Jul 2016
International Journal on Digital Libraries | VOL. 17

The Past Web: Exploring Web Archives
Amanda Greenwood
The American Archivist | VOL. 85
Amanda GreenwoodAmanda Greenwood
01 Sep 2022
The American Archivist | VOL. 85

Using Micro-Collections in Social Media to Generate Seeds for Web Archive Collections
Alexander Nwala ... Michael Nelson
-
Alexander Nwala, et. al.Alexander Nwala ... Michael Nelson
01 Jun 2019
01 Jun 2019

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

MementoMap: A Web Archive Profiling Framework for Efficient Memento Routing

Abstract

Talk to us

Similar Papers