Abstract
The time-consuming nature of coding sociophonetic variables that are typically treated as categorical represents an impediment to addressing research questions around these variables that require large volumes of data. In this paper, we apply a machine learning method, random forest classification (Breiman, 2001), to automate coding (categorical prediction) of two English sociophonetic variables traditionally treated as categorical, non-prevocalic /r/ and word-medial intervocalic /t/, based on tokens’ acoustic signatures. We found good performance for binary classifiers of non-prevocalic /r/ (Absent versus Present) and medial /t/ (Voiced versus Voiceless), but not for medial /t/ with a six-way coding distinction (largely due to some codes being sparsely represented in the training data). This method also yields rankings of acoustic measures in terms of importance in classification. Beyond any individual measures, this method generates probabilistic predictions of variation (classifier probabilities) that represent a composite of the acoustic cues fed into the model. In a listening experiment, we found that not only did classifier probabilities significantly capture gradience in trained listeners’ perceptions of rhoticity, they better predicted listeners’ perceptions than individual acoustic measures. This method thus represents a new approach to reconciling the categorical and continuous dimensions of sociophonetic variation.
Highlights
Unlike modeling techniques like generalized linear modeling, random forests do not suffer from overfitting when predictors are collinear (Dormann et al, 2013; Matsuki et al, 2016; Strobl & Zeileis, 2008); as a result, collinearity does not hinder random forests’ ability to predict unseen data and to determine the relative importance of independent variables.1. This property is important for the present study, as the acoustic complexity of /r/ and /t/ led us to include many acoustic measures, including those that are naturally correlated with one another, in our random forests; we demonstrate in Section 4.3.2 that our data is characterized by a considerable degree of collinearity
In simulations with subsets of the /r/ and two-class /t/ data, we found high correlations between the variable importance scores, indicating stability between estimates of variable importance. (Details of these simulations can be found in the online supplementary materials.12) This result validates our use of random forests for sociophonetic applications of feature selection, such as determining which acoustic features are most influential in classifying variants of sociophonetic variables like /r/ and /t/
The performance results for the binary classifiers stand as a proof of concept for using a random forest classifier to automatically code unseen data. (We explore a different means of validating classifier performance, comparing predictions of unseen data to human listeners’ judgments, in Section 5.) Both classifiers achieved overall accuracy rates that rival inter-rater reliability for human listeners’ coding of acoustically complex variables such as these
Summary
Researchers in sociophonetics and variationist sociolinguistics have increasingly turned to computational methods to automate time-consuming research tasks such as data extraction (e.g., Fromont & Hay, 2012), phonetic alignment (e.g., McAuliffe, Socolof, Mihuc, Wagner, & Sonderegger, 2017; Rosenfelder, Fruehwald, Evanini, & Yuan, 2011), transcription (e.g., Reddy & Stanford, 2015), and measurement of vowels (e.g., Labov, Rosenfelder, & Freuhwald, 2013), consonants (e.g., Schuppler, Ernestus, Scharenborg, & Boves, 2011; Schuppler, van Dommelen, Koreman, & Ernestus, 2012; Sonderegger & Keshet, 2012), and suprasegmentals (e.g., Rosenberg, 2017). Villarreal et al: From categories to gradience allophones), to variable data based on acoustic features Complicating this process are two common properties of sociophonetic variables: acoustic complexity and gradient variability. When listeners (whether trained or lay) hear tokens of a sociophonetic variable, they do hear one or two salient acoustic measures, but rather a constellation of acoustic cues. These cues’ distributions seldom divide neatly into bins corresponding to individual allophonic variants, but are rather characterized by continuous, gradient variation; this fact belies the long-standing categorical treatment of variables like /r/, a treatment that hides from researchers’ view the potential meaningfulness of tokens that exist in the gray area between cardinal variants. Just as human coders contend with acoustic complexity and gradient variability in making coding judgments— not always with success, as evidenced by low inter-rater reliability for variables like /r/ (see Section 3.1)—so too must an automated method for sociophonetic coding contend with acoustic complexity and gradient variability if its predictions are to have any validity
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
More From: Laboratory Phonology: Journal of the Association for Laboratory Phonology
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.