Making Recommendations from Web Archives for "Lost" Web Pages

Lulwah M Alkwai,Michele C Weigle,Michael L Nelson

doi:10.1145/3383583.3398533

Abstract

When a user requests a web page from a web archive, the user will typically either get an HTTP 200 if the page is available, or an HTTP 404 if the web page has not been archived. This is because web archives are typically accessed by Uniform Resource Identifier (URI) lookup, and the response is binary: the archive either has the page or it does not, and the user will not know of other archived web pages that exist and are potentially similar to the requested web page. In this paper, we propose augmenting these binary responses with a model for selecting and ranking recommended web pages in a Web archive. This is to enhance both HTTP 404 responses and HTTP 200 responses by surfacing web pages in the archive that the user may not know existed. First, we check if the URI is already classified in DMOZ or Wikipedia. If the requested URI is not found, we use machine learning to classify the URI using DMOZ as our ontology and collect candidate URIs to recommended to the user. The classification is in two parts, a first-level classification and a deep classification. Next, we filter the candidates based on if they are present in the archive. Finally, we rank candidates based on several features, such as archival quality, web page popularity, temporal similarity, and URI similarity. We calculated the F1 score for different methods of classifying the requested web page at the first level. We found that using all-grams from the URI after removing numerals and the top-level domain (TLD) produced the best result with F1 =0.59. For the deep-level classification, we measured the accuracy at each classification level. For second-level classification, the micro-average F1=0.30 and for third-level classification, F1=0.15. We also found that 44.89% of the correctly classified URIs contained at least one word that exists in a dictionary and 50.07% of the correctly classified URIs contained long strings in the domain. In comparison with the URIs from our Wayback access logs, only 5.39% of those URIs contained only words from a dictionary, and 26.74% contained at least one word from a dictionary. These percentages are low and may affect the ability for the requested URI to be correctly classified.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Making Recommendations from Web Archives for "Lost" Web Pages

Abstract

Talk to us

Similar Papers

Lead the way for us

Similar Papers

MementoMap: A Web Archive Profiling Framework for Efficient Memento Routing

-

07 Jan 2021
07 Jan 2021

Bootstrapping Web Archive Collections From Micro-Collections in Social Media

-

15 Sep 2020
15 Sep 2020

Using Micro-Collections in Social Media to Generate Seeds for Web Archive Collections
Alexander Nwala ... Michael Nelson
-
Alexander Nwala, et. al.Alexander Nwala ... Michael Nelson
01 Jun 2019
01 Jun 2019

Metadata: A Fundamental Component of the Semantic Web
Jane Greenberg ... D Grant Campbell
Bulletin of the American Society for Information Science and Technology | VOL. 29
Jane Greenberg, et. al.Jane Greenberg ... D Grant Campbell
01 Apr 2003
Bulletin of the American Society for Information Science and Technology | VOL. 29

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Making Recommendations from Web Archives for "Lost" Web Pages

Abstract

Talk to us

Similar Papers