Abstract

Almost two decades of experience on web harvesting and archiving are counted; the subject of web harvesting and web archiving have been top in the interest of researchers, technologists and librarians-information scientists. Web harvesting projects and pilot programs on archiving content traced on the Web are becoming priorities for national libraries and cultural heritage organizations in the EU. This paper pertains to web harvesting as a process for data mining from web and only through web (“pull” function); this paper elaborates upon research implemented in the framework of the funded research project titled “Web Archiving in Public Libraries and IP Law” that focused on the processes of web-harvesting and archiving as well as Text and Data Mining (TDM) operations in the national libraries of EU Member States. Web archiving as an official operation in national libraries of EU Member States creates web collections and preserves them for the purpose of being accessible and usable in perpetuity. This paper pertains to research on various components of web harvesting and archiving through an online survey (qualitative research) which targeted the national libraries of EU Member States. The research team of authors posed seventeen questions to EU national libraries. The survey output comes from answers delivered by 22 national libraries of EU Member States. The questionnaire was created through the use of Google forms. The researchers reached the EU national libraries via email and follow up telephone calls seeking libraries’ participation in the research. The aim of the research was to delve on participant libraries’ Text and Data Mining operation leveraging on Web harvesting and Web archiving technologies and operations. Results analysis reveals that web harvesting is considered among national libraries’ top priorities; the relevant projects increase in number, the web collections become more and more and the technological infrastructures and tools for web harvesting improve. Yet, there are many issues that remain unresolved. A significant number of surveyed libraries consider that legal and technical issues remain the most important to resolve. Access to harvested material is still under legal restrictions. The Directive 2019/790/EU on Copyright in the Digital Single Market (DSM) creates a favorable legal foundation for the deployment of web harvesting operations in national libraries of the EU Member States. TDM technologies make possible new areas of research. Web harvesting that was initially aimed for preservation purposes now expands to unprecedented research of national heritage through state-of-the-art automated TDM processes.

Highlights

  • From the very beginning of Internet’s pioneering appearance in the early 90s, humanity realized that world culture has acquired a new “vehicle” for information spreading and dissemination of knowledge, science and research (Masanès 2002); the Internet was seen as a means for the modification of economy, society and cooperation, and a necessity of new management was derived

  • This paper pertains to web harvesting as a process for data mining from web and only through web (“pull” function); this paper elaborates upon research implemented in the framework of the funded research project titled “Web Archiving in Public Libraries and IP Law” that focused on the processes of web-harvesting and archiving as well as Text and Data Mining (TDM) operations in the national libraries of EU Member States

  • Authors’ research on EU national libraries’ TDM through the use of Web harvesting and Web archiving technologies and operations was implemented in the timeframe between March and July 2019; a short questionnaire was prepared in consideration of the assumption that most EU national libraries may not be fully prepared for large scale Web harvesting and Web archiving operations given that the relevant EU legal framework was just set through the new EU Directive on Copyright in the Digital Single Market (DSM)

Read more

Summary

Introduction

From the very beginning of Internet’s pioneering appearance in the early 90s, humanity realized that world culture has acquired a new “vehicle” for information spreading and dissemination of knowledge, science and research (Masanès 2002); the Internet was seen as a means for the modification of economy, society and cooperation, and a necessity of new management was derived, . Web content is changing at a pace that puts itself at risk of extinction or falsification while humans would probably want to preserve it in the future as part of world cultural heritage. Web harvesting and web archiving have emerged as new official functions of intellectual and cultural heritage preservation organizations leveraged to serve the need for management of content harvested from the web. According to the International Internet Preservation Consortium (IIPC), “Web archiving is the process of collecting portions of the World Wide Web, preserving the collections in an archival format, and serving the archives for access and use” 2

Objectives
Methods
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call