Abstract

The paper will describe how web-based collaboration tools can engage users in the building of historical printed text resources created by mass digitisation projects. The drivers for developing such tools will be presented, identifying the benefits that can be derived for both the user community and cultural heritage institutions. The perceived risks, such as new errors introduced by the users, and the limitations of engaging with users in this way will be set out with the lessons that can be learned from existing activities, such as the National Library of Australia's newspaper website which supports collaborative correction of Optical Character Recognition (OCR) output. The paper will present the work of the IMPACT (Improving Access to Text) project, a large-scale integrating project funded by the European Commission as part of the Seventh Framework Programme (FP7). One of the aims of the project is to develop tools that help improve OCR results for historical printed texts, specifically those works published before the industrial production of books from the middle of the 19th century. Technological improvements to image processing and OCR engine technology are vital to improving access to historic text, but engaging the user community also has an important role to play. Utilising the intended user can help achieve the levels of accuracy currently found in born-digital materials. Improving OCR results will allow for better resource discovery and enhance performance by text mining and accessibility tools. The IMPACT project will specifically develop a tool that supports collaborative correction and validation of OCR results and a tool to allow user involvement in building historical dictionaries which can be used to validate word recognition. The technologies use the characteristics of human perception as a basis for error detection.

Highlights

  • In recent years, advanced libraries all over the world have been setting the pace for mass digitisation of entire collections

  • Several commercial software tools are available for Optical Character Recognition (OCR) that already achieve very accurate results on modern prints, none of them perform very well when applied to historic source material

  • In view of the enormous amounts of text that are to become available in digital format during the years and the difficulties even elaborate OCR software has in dealing with historical material, ways must be found to distribute the work on many shoulders

Read more

Summary

Introduction

In recent years, advanced libraries all over the world have been setting the pace for mass digitisation of entire collections. In the undertaking of mass digitisation projects, especially when dealing with historical printed material, many challenges are yet to be met. One of the most important challenges involves the accurate transformation of digital images into high-quality searchable text that researchers require to make proper use of these rich resources. Several commercial software tools are available for Optical Character Recognition (OCR) that already achieve very accurate results on modern prints, none of them perform very well when applied to historic source material. The EU funded IMPACT project — Improving Access to Text1 — aims at addressing these challenges by developing a variety of software tools to enhance the digital image, improve the state-of-the-art OCR software and enrich the results of text recognition by making use of lexical resources and language technology. There efforts have to scale up to the millions of pages that are digitised every day

A Growing Digital Collection
Crowd Sourcing — a Potential to Capitalise
The Concept of Collaborative Correction in IMPACT
Findings
Summary
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call