Abstract

Great effort is being made to collect and preserve historic manuscripts from the early modern and eighteenth-century periods; unfortunately, searching the Early English Books Online (EEBO) and Eighteenth Century Collections Online (ECCO) collections can be extremely difficult for researchers because current Optical Character Recognition (OCR) engines struggle to read and recognize various historic fonts, especially in manuscripts of declining quality. To address this problem, the Early Modern OCR Project (eMOP) at the Initiative for the Digital Humanities, Media, and Culture (IDHMC) at Texas A&M University seeks to train OCR engines to read historic documents more effectively in order to make the entirety of these collections accessible to searching. The first step in this project involves using Aletheia Desktop Tool, developed by PRImA Research Lab at the University of Salford, to use documents from the EEBO and ECCO collections to create training sets to aid OCR engines, such as Google's Tesseract, in recognizing the special characters such as ligatures, italics, and blackletter found within early modern fonts. In the year that the Aletheia team has been working to create these font training libraries, we have overcome several problems, including learning how to select, extract, and deliver the data that best suits Tesseract training requirements. This work with Aletheia is part of a larger scholarly project that endeavors to not only make the EEBO and ECCO collections more accessible for data mining purposes for researchers, but also seeks to make available to the public the methodologies, workflow, and digital tools developed during the eMOP project to aid libraries, museums, and scholars in other fields in their efforts to preserve and study our combined cultural history.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call