Abstract

AustLit is a major Australian cultural heritage database and the most comprehensive record of a nation’s literary history in the world. In this article we will present the successful results of a project addressing the challenge of discovering and recording creative writing published in digitized historical Australian newspapers, provided by the National Library of Australia’s Trove service. As a first step in identifying creative writing, we developed an automated method for identifying articles that are likely to be poems by searching for a number of signals embedded in articles. When this work began, AustLit contained more 10,200 bibliographical records for poems published between 1803 and 1954 (75% prior to 1900) with links to the full text in 115 different newspaper. The aim of the project was to expand this number of bibliographical records in AustLit and provide a foundation for analysing the importance of poetry in newspaper publishing of the period. Taking advantage of Ted Underwood’s (Getting Everything you Want from HathiTrust , and Open Data ( ): The Stone and the Shell, Underwood blog posts (Both accessed 27 October 2015), 2012) work with seventeenth- and eighteenth-century full text in the HathiTrust collection, we trained a naive Bayesian classifier, modifying code from Daniel Shiffman (Bayesian Filtering. (accessed 27 October 2015), 2008) and Paul Graham (A Plan for Spam. (accessed 27 October 2015), 2002) and improving the quality of Optical Character Recognition (OCR) by using the overProof correction algorithm. We have been able to successfully identify large numbers of poems in the newspapers database, greatly expanding AustLit’s coverage of this important literary form. After suitable training of the classifier, we were able to successfully identify 88% of the newspaper articles that a knowledgeable human would classify as ‘poetry’. Our results have encouraged us to consider enhancing and extending the techniques to aid the identification of other forms of literature and criticism.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call