Machine learning in diachronic corpus phonology: mining verse data to infer trajectories in English phonotactics

Andreas Baumann

doi:10.2218/pihph.3.2018.2878

Abstract

Machine learning is a powerful method when working with large data sets such as diachronic corpora. However, as opposed to standard techniques from inferential statistics like regression modeling, machine learning is less commonly used among phonological corpus linguists. This paper discusses three different machine learning techniques (K nearest neighbors classifiers; Naïve Bayes classifiers; artificial neural networks) and how they can be applied to diachronic corpus data to address specific phonological questions. To illustrate the methodology, I investigate Middle English schwa deletion and when and how it potentially triggered reduction of final /mb/ clusters in English.

Highlights

In this methodological paper, I demonstrate how machine learning techniques can be used to generate more nuanced data for research in diachronic corpus phonology
For an English historical phonologist it is important to know if final schwa is present in a given period: (i) in metrical theory it is relevant for investigating stress clashes or numbers of syllables (Burzio 2007, Dresher & Lahiri 2005); (ii) in cognitive phonology one may be interested in diphones which function as cues for word segmentation (Dressler, Dziubalska-Kołaczyk & Pestal 2010, Daland & Pierrehumbert 2011); (iii) in phonotactics we want to be certain about syllable structure (Hogg & McCully 1987, Dziubalska-Kołaczyk 2005), for instance if there is a coda cluster like /mb/ in Middle English items like lambe or if /b/ is the onset of a final syllable /bə/
After running the Machine learning (ML) algorithms described in the previous subsection, the models derived from the training data need to be evaluated

Summary

Introduction

I demonstrate how machine learning techniques can be used to generate more nuanced data for research in diachronic corpus phonology. This is motivated by the following problem. In the diachronic study of English, the phenomenon of final schwa deletion is complicated: it is gradual (as most linguistic changes are); spelling does not provide reliable cues for phonological analyses (and there is no audio data available for most periods to begin with); and it depends on many factors (e.g. phonological context, word length, morphosyntax, not to mention socio-geography; Minkova 1991). Verse data are arguably more suitable for studying phenomena like schwa deletion (because we can use rhythm as a diagnostic tool), but especially if we want to do long term studies involving many centuries, poetry data are sparse

Objectives

Results

Conclusion