Profiling web archive coverage for top-level domain and content language

Ahmed AlSum,Michael L. Nelson,Herbert Van de Sompel,Michele C. Weigle

doi:10.1007/s00799-014-0118-y

Abstract

The Memento Aggregator currently polls every known public web archive when serving a request for an archived web page, even though some web archives focus on only specific domains and ignore the others. Similar to query routing in distributed search, we investigate the impact on aggregated Memento TimeMaps (lists of when and where a web page was archived) by only sending queries to archives likely to hold the archived page. We profile fifteen public web archives using data from a variety of sources (the web, archives' access logs, and fulltext queries to archives) and use these profiles as resource descriptor. These profiles are used in matching the URI-lookup requests to the most probable web archives. We define $$Recall_{TM}(n)$$ R e c a l l T M ( n ) as the percentage of a TimeMap that was returned using $$n$$ n web archives. We discover that only sending queries to the top three web archives (i.e., 80 % reduction in the number of queries) for any request reaches on average $$Recall_{TM}=0.96$$ R e c a l l T M = 0.96 . If we exclude the Internet Archive from the list, we can reach $$Recall_{TM}=0.647$$ R e c a l l T M = 0.647 on average using only the remaining top three web archives.

Full Text