Automated accent rating using deep neural networks

Tyler T Schnoor,Matthew C Kelley,Benjamin V Tucker

doi:10.1121/10.0008581

Abstract

Automated accentedness rating has the potential to improve many human-computer interactions involving speech, including the adaptation of automatic speech recognition or other artificial intelligence models to the speaker's accent. Accent ratings may also be used as a metric by which language learners can quantify their progress. This study employs bidirectional long short-term memory layers in a neural network to predict human ratings of the accentedness of recorded speech. Speech data are extracted in 5-s segments from over 2000 first- and second-language English speakers from multiple corpora. Human ratings are obtained in an online experiment where participants rate the accentedness of a given speech recording on a 9-point Likert scale. Mel-frequency cepstral coefficients and mel-filterbank energy features are tested as speech input representations for the neural network. When inference is tested using 10-fold cross validation, the mean correlation between the model’s predictions and human ratings is high (r = 0.74). While previous methods attained a similar correlation by automatically comparing speech that has been transcribed [Wieling et al., Lang. Dyn. Chang. 4, 253–269 (2014)] or by making accent-specific Gaussian mixture models [Cheng et al., Interspeech 2013 (2013), pp. 2574–2578], the present model requires no transcription and can perform accent-general inference.

Full Text