Localizing category-related information in speech with multi-scale analyses.

Sam Tilsen,Seung-Eun Kim,Claire Wang,Leonardo Lancia

doi:10.1371/journal.pone.0258178

Abstract

Measurements of the physical outputs of speech—vocal tract geometry and acoustic energy—are high-dimensional, but linguistic theories posit a low-dimensional set of categories such as phonemes and phrase types. How can it be determined when and where in high-dimensional articulatory and acoustic signals there is information related to theoretical categories? For a variety of reasons, it is problematic to directly quantify mutual information between hypothesized categories and signals. To address this issue, a multi-scale analysis method is proposed for localizing category-related information in an ensemble of speech signals using machine learning algorithms. By analyzing how classification accuracy on unseen data varies as the temporal extent of training input is systematically restricted, inferences can be drawn regarding the temporal distribution of category-related information. The method can also be used to investigate redundancy between subsets of signal dimensions. Two types of theoretical categories are examined in this paper: phonemic/gestural categories and syntactic relative clause categories. Moreover, two different machine learning algorithms were examined: linear discriminant analysis and neural networks with long short-term memory units. Both algorithms detected category-related information earlier and later in signals than would be expected given standard theoretical assumptions about when linguistic categories should influence speech. The neural network algorithm was able to identify category-related information to a greater extent than the discriminant analyses.

Highlights

What does it mean to “localize” category-related information in speech, and why is this a worthwhile goal? Theoretical models of speech often posit discrete categories, such as phones, articulatory gestures, syllables, moras, words, pitch accents, a hierarchy of phrase types, etc
How can we test whether speech signals contain evidence for these theoretical constructs, and if they do, how can we determine when in time that evidence is located? This is not a trivial problem and it requires some careful attention to our assumptions about the nature of speech and the concept of information
The multiscale analysis technique presented here was shown to be useful for localizing category-related information in articulatory and acoustic signals

Summary

Introduction

Theoretical models of speech often posit discrete categories, such as phones, articulatory gestures, syllables, moras, words, pitch accents, a hierarchy of phrase types, etc. None of these theoretical entities is self- “present” at any given point in time in acoustic or articulatory signals. A machine learning algorithm is trained to learn input-output mappings, and classification accuracy on unseen data is recorded. This training-testing procedure is repeated many times. We examine two such algorithms: linear discriminant analysis (LDA) and deep neural network classification using two layers of bidirectional long short-term memory units (biLSTM)

Methods

Results

Conclusion