Abstract

Since the practice of web archiving, or the act of preserving websites as historical, legal, and informational records, become more commonplace in the 2000s, web archives have become valuable sources for historical research. Unfortunately, many archived websites are of low quality and are missing crucial elements. In this paper, we examine the issue of quality and focus on visual correspondence, the similarity in appearance between the original website and its archived counterpart. We examine how the visual correspondence of an archived website can be measured using image similarity measures. Our results indicate that the Structural Similarity Index metric (SSIM) was able to successfully measure visual correspondence. If applied to the Quality Assurance process of an institution, this similarity metric could help web archivists quickly detect quality problems in their web archives, and fix them in order to create high-quality web archives. Depuis que la pratique de l'archivage Web, ou l'acte de préserver les sites Web en tant que documents historiques, juridiques et informatifs, est devenue plus courante dans les années 2000, les archives Web sont devenues des sources précieuses pour la recherche historique. Malheureusement, de nombreux sites Web archivés sont de mauvaise qualité et manquent d'éléments cruciaux. Dans cet article, nous examinons la question de la qualité et nous nous concentrons sur la correspondance visuelle, la similitude d'apparence entre le site Web d'origine et son homologue archivé. Nous examinons comment la correspondance visuelle d'un site Web archivé peut être mesurée à l'aide de mesures de similitude d'image. Nos résultats indiquent que la Structural Similarity Index metric (SSIM) a pu mesurer avec succès la correspondance visuelle. S'il est appliqué au processus d'assurance qualité d'une institution, cette indicateur de similitude pourrait aider les archivistes Web à détecter rapidement les problèmes de qualité dans leurs archives Web et à les résoudre afin de créer des archives Web de haute qualité.

Highlights

  • In 1996, the Internet Archive began using a web crawler to periodically take snapshots of websites and store them as historical records

  • In order to detect these quality problems, web archivists must engage in an onerous process of quality assurance (QA) where they manually inspect hundreds or even thousands of archived websites (Reyes Ayala, Phillips, and Ko, 2014)

  • When web archiving is done by national libraries that seek to capture and preserve their national domain, quality problems grow to such a scale that human intervention is no longer enough to detect and fix them

Read more

Summary

Introduction

In 1996, the Internet Archive began using a web crawler to periodically take snapshots of websites and store them as historical records. A high-quality archived website should be an accurate representation of the original website in content, form, and appearance. It should look and behave exactly like the original, but in practice this is rarely the case. This paper examines how the visual correspondence of an archived website can be measured using popular image similarity measures, originally employed in Computer Science to detect differences between images. Using these measures we evaluate how visual correspondence can be used as an indication of overall archive quality. ● Can an image similarity metric successfully distinguish between high-quality archived websites and lower-quality archived websites?

Literature Review
Methodology
Medium Quality Low Quality No Comparison
Results and Discussion
Reference List
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.