Abstract

Bibliothèque et Archives Nationales du Québec digitally scanned and converted to text a large collection of newspapers to create a resource of tremendous potential value to historians. Unfortunately, the text files are difficult to search reliably due to many errors caused by the optical character recognition (OCR) text conversion process. 
 This digital history project applied natural language processing in an R language computer program to create a new and useful index of this corpus of digitized content despite OCR related errors. The project used editions of The Equity, published in Shawville, Quebec since 1883. 
 The program extracted the names of all the person, location and organization entities that appeared in each edition. Each of the entities was cataloged in a database and related to the edition of the newspaper it appeared in. The database was published to a public website to allow other researchers to use it.
 The resulting index or finding aid allows researchers to access The Equity in a different way than just full text searching. People, locations and organizations appearing in the Equity are listed on the website and each entity links to a page that lists all of the issues that entity appeared in as well as the other entities that may be related to it.
 Rendering the text files of each scanned newspaper into entities and indexing them in a database allows the content of the newspaper to be interacted with by entity name and type rather than just a set of large text files.
 Website: http://www.jeffblackadar.ca/graham_fellowship/corpus_entities_equity/

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call