Abstract
The State and University Library Bremen (SuUB) is dedicated to the digitization of its historical collections. Digitization is an important instrument for improving the accessibility of valuable information contained in fragile historical documents. It facilitates academic research and teaching and is indispensable to the digital humanities. Especially the research of digital serial publications benefits from ‘recent systematic digitization efforts, often initiated by libraries […]. More and more historical periodicals and other serial publications are now digitally available in full, i.e., all of their issues’ [Piotrowski, this volume]. The historical journal presented in this article is one of these and the final section will discuss why it can be considered a complete corpus. Usually, digitization projects produce digital images, metadata for cataloguing and web-navigation purposes and OCR full text for searching. This information is made available through the library's web portal for digital collections. However, digital humanists need high-quality full texts enriched with metadata in the appropriate format to analyse them with powerful software tools. The historical journal Die Grenzboten serves as an exemplary model to bridge the gap between digitization projects in libraries and research infrastructures. Die Grenzboten is a long running serial publication (1841 – 1922). It can be classified as a literary journal that also covered politics and arts. We demonstrate that OCR post correction and a page-wise structuring are prerequisites for the creation of a high-quality TEI version of a full text. The TEI version was created in cooperation with the Deutsches Textarchiv (DTA) at the Berlin-Brandenburg Academy of Sciences and Humanities (BBAW). A fully automated OCR post correction developed at the SuUB Bremen is freely available on GitHub. To enable scientists to work with powerful software tools the transfer of high-quality full texts to research infrastructures is a necessary step. We describe transfers of full text and the experience we have made, but still some general questions persist: What has to be done to prepare raw OCR output for this purpose in a reasonable and cost-effective manner? What quality is needed or expected? Which metadata and file formats are needed? Should there not be a closer cooperation between research infrastructures and libraries handling the digitization? OCR full texts, even post corrected, are not perfect but character recognition rates around 99% certainly provide more options than just being used as a search index. There is a vast amount of textual resources available ready to be made fully accessible for scientific research! Finally, some suggestions for scholars and the researchers working on digital serial publications are given.
Highlights
Since 1999, the State and University Library Bremen (SuUB) has been dedicated to the digitization of its historical collections, such as historical maps, publications of Bremen’s regional history, or material of interest to scientists, such as historical journals or German seventeenth-century newspapers
We demonstrate that optical character recognition (OCR) post correction and a page-wise structuring are prerequisites for the creation of a high-quality Text Encoding Initiative (TEI) version of a full text
We describe transfers of full text and the experience we have had, but still some general questions persist: What has to be done to prepare raw OCR output for this purpose in a reasonable and costeffective manner? What quality is needed or expected? Which metadata and file formats are needed? Should there not be a closer cooperation between research infrastructures and libraries handling the digitization? OCR full texts, even post corrected, are not
Summary
Since 1999, the State and University Library Bremen (SuUB) has been dedicated to the digitization of its historical collections, such as historical maps, publications of Bremen’s regional history, or material of interest to scientists, such as historical journals or German seventeenth-century newspapers. There is a need for easy, accessible, high-quality full texts enriched with metadata in the right format to be able to analyze them with powerful software tools.3 This need is not restricted to the digital humanities.The historical journal Die Grenzboten serves as an exemplary model to bridge this gap between digitization projects in libraries and the requirements of the digital humanities. As a second aspect of text quality, we enhanced the level of document structure according to an agreed standard format in Supporting these processes as a digitizing library will result in considerably improved outcome in all fields of automated and computer-aided research across disciplines working with digitized material. The SuUB is in contact with various research groups within the fields of German philology, linguistics, Topic Modeling, full text quality improvement, and research infrastructures An example of the former is a cooperation with a research group at the University of Bremen conducting a project on the exploration of so-called ‘Bildprosa’.
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.