Book History in the Early Modern OCR Project, or, Bringing Balance to the Force

Todd Samuelson,Jacob Heil

doi:10.1353/jem.2013.0050

Abstract

The Early Modern OCR Project (eMOP), funded by a development grant from the Andrew Mellon Foundation to Texas AM instead, they find their power source in that bibliograph- ical data. In the case of eMOP, as this essay will discuss, the relationship between the digital and the bibliographic is dialogic and reciprocal: while the project's goals, angles of approach, and ethos of interdisciplinarity are all char- acteristic of DH, it is only through an acknowledged utilization of book his- tory scholarship and methods that the project's ends will be accomplished. Book history-our corner of eMOP-represents two foundational nodes of the project. In the first place, we are identifying specific, minutely variant typefaces in order to distinguish as best we can between the myriad versions of the standard Roman typeface in early modern books.1 Secondly, we are study- ing type founders and foundries to trace the flow of fonts into and through London. Through this research we hope to realize the goal of eMOP: the auto- mation of a process by which trained optical character recognition (OCR) en- gines might more accurately the images of early modern book pages in, for example, Early English Books Online (EEBO) and Eighteenth Century Collec- tions Online (ECCO). Ultimately our work will be formalized in a database that serves as the hub of this automated OCR process: the printer and typo- graphical data will act as a traffic cop of sorts, directing the properly trained OCR engine to read the appropriate page images. In fact, a central tenet of the Early Modern OCR Project is that training OCR engines to recognize the let- terforms in specific font sets will improve the accuracy of the OCR output- the resultant text files-when these engines are called upon to scan page im- ages printed in that typeface.While this is only a cursory sketch of eMOP, it suggests that book history is central to solving the digital problem of using OCR software on early mod- ern books. Before printing became more regularized by technological ad- vances in the nineteenth century, and before English typography approached something of a standard, national identity in the early eighteenth century, the typefaces and their settings on the printed page were highly variable. For this reason one might think that the most advanced OCR engines of our day might be rather effective, as they pull from their expansive font libraries to recognize different characters. However, part of the complication is that a large number of characters found in early English printing-including the ubiquitous long s, scribal abbreviations borrowed from the Latin, and even characters derived from Anglo-Saxon, such as the thorn-are unfamiliar to non-expert readers and not present in OCR libraries. Another challenge lies in the range of type- faces used by printers, which were for most stretches of English printing im- ported from various other countries and so display characteristics adopted from these national traditions. It is still another kind of noise, however, that makes them ineffective: because the engines cannot distinguish line divi- sions-cannot focus their field of vision, in other words-they are not able to discriminate between letterforms and the various other blots that cloud page images. On more modern, higher quality images, of course, the OCR is more accurate, but those that have been preserved in mass-digitization projects- and which therefore will be central to eMOP's automated process-were lim- ited by the technologies of their historical moments. …

Full Text