Web archive profiling through CDX summarization

Sawood Alam,Michael L Nelson,Herbert Van De Sompel,David S H Rosenthal,Harihar Shankar,Lyudmila L Balakireva

doi:10.1007/s00799-016-0184-4

Abstract

With the proliferation of public web archives, it is becoming more important to better profile their contents, both to understand their immense holdings as well as to support routing of requests in the Memento aggregator. To save time, the Memento aggregator should only poll the archives that are likely to have a copy of the requested URI. Using the crawler index files produced after crawling, we can generate profiles of the archives that summarize their holdings and can be used to inform routing of the Memento aggregator's URI requests. Previous work in profiling ranged from using full URIs (no false positives, but with large profiles) to using only top-level domains (TLDs) (smaller profiles, but with many false positives). This work explores strategies in between these two extremes. In our experiments, we correctly identified about 78 % of the URIs that were present or not present in the archive with less than 1 % relative cost as compared to the complete knowledge profile and 94 % URIs with less than 10 % relative cost without any false negatives. With respect to the TLD-only profile, the registered domain profile doubled the routing precision, while complete hostname and one path segment gave a tenfold increase in the routing precision.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Web archive profiling through CDX summarization

Abstract

Talk to us

Similar Papers

More From: International Journal on Digital Libraries

Lead the way for us

Journal: International Journal on Digital Libraries	Publication Date: Jul 16, 2016
Citations: 11

Similar Papers

MementoMap: A Web Archive Profiling Framework for Efficient Memento Routing

-

07 Jan 2021
07 Jan 2021

Web Archive Profiling Through CDX Summarization
Sawood Alam ... Lyudmila L Balakireva
-
Sawood Alam, et. al.Sawood Alam ... Lyudmila L Balakireva
01 Jan 2015
01 Jan 2015

The Diagnosis Performance of Ultrasonic Transient Elastography for Noninvasive Assessment of Liver Fibrosis in 1138 Chronic Hepatitis C Patients
M Lupsor ... D Feier
Ultrasound in Medicine & Biology | VOL. 37
M Lupsor, et. al.M Lupsor ... D Feier
26 Jul 2011
Ultrasound in Medicine & Biology | VOL. 37

Urinary circulating DNA and circulating antigen for diagnosis of schistosomiasis mansoni: a field study.
Radwa Galal Diab ... Rasha Abdelmawla Ghazala
Tropical Medicine & International Health | VOL. 24
Radwa Galal Diab, et. al.Radwa Galal Diab ... Rasha Abdelmawla Ghazala
08 Jan 2019
Tropical Medicine & International Health | VOL. 24

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Web archive profiling through CDX summarization

Abstract

Talk to us

Similar Papers

More From: International Journal on Digital Libraries