An open source infrastructure for quality assurance and preservation of a large digital book collection

Sven Schlarb

doi:10.2352/issn.2168-3204.2013.10.1.art00050

An open source infrastructure for quality assurance and preservation of a large digital book collection

Sven Schlarb

https://doi.org/10.2352/issn.2168-3204.2013.10.1.art00050

Copy DOI

Journal: Archiving Conference

Publication Date: Jan 1, 2013

#Austrian National Library #Set Of Best Practices + Show 8 more

Abstract
Full-Text
Similar Papers

Abstract

This article presents an open source infrastructure for processing large collections of digital books available at the Austrian National Library with a special focus on quality assurance tasks in the context of the European project SCAPE (www.scapeproject-eu). It describes the cluster hardware and the software components used for building the experimental IT infrastructure.More concretely, a set of best practices for the data analysis of large document image collections on the basis of Apache Hadoop will be shown. Different types of Hadoop jobs (Hadoop-Streaming-API, Hadoop MapReduce, and Hive) are used as basic components, and the Taverna workflow description language and execution engine (www.taverna.org.uk) is used for orchestrating complex data processing tasks.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Similar Papers

Paper Title

Journal

Date

Author

View more papers

More From: Archiving Conference

Paper Title

Journal

Date

Author

View more papers

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.