Rethinking classification results based on read speech, or: why improvements do not always transfer to other speaking styles

Barbara Schuppler

doi:10.1007/s10772-017-9436-y

Abstract

With the growing interest among speech scientists in working with natural conversations also the popularity for using articulatory–acoustic features as basic unit increased. They showed to be more suitable than purely phone-based approaches. Even though the motivation for AF classification is driven by the properties of conversational speech, most of the new methods continue to be developed on read speech corpora (e.g., TIMIT). In this paper, we show in two studies that the improvements obtained on read speech do not always transfer to conversational speech. The first study compares four different variants of acoustic parameters for AF classification of both read and conversational speech using support vector machines. Our experiments show that the proposed set of acoustic parameters substantially improves AF classification for read speech, but only marginally for conversational speech. The second study investigates whether labeling inaccuracies can be compensated for by a data selection approach. Again, although an substantial improvement was found with the data selection approach for read speech, this was not the case for conversational speech. Overall, these results suggest that we cannot continue to develop methods for one speech style and expect that improvements transfer to other styles. Instead, the nature of the application data (here: read vs. conversational) should be taken into account already when defining the basic assumptions of a method (here: segmentation in phones), and not only when applying the method to the application data

Highlights

Speech science and technology used to rely on the assumption that speech utterances can be described as a sequence of words and that words are composed of a sequence of phones, known as the ’beads on a string’ model of speech (Ostendorf 1999)
In order to compare the results with those obtained on TIMIT, the classifiers were tested on automatically created acoustic features (AFs) labeled material
Our results show that for conversational speech, our set of acoustic parameters Both did not yield an improvement in comparison to Baseline (i.e., F = 0.65 vs. F = 0.66)

Summary

Introduction

Speech science and technology used to rely on the assumption that speech utterances can be described as a sequence of words and that words are composed of a sequence of phones, known as the ’beads on a string’ model of speech (Ostendorf 1999). This model works satisfactorily for carefully produced speech, but it runs into problems with conversational speech, mainly due to the high pronunciation variability (Saraçlar et al 2000). Greenberg (1999) reports an average of 22.2 pronunciation

Objectives

Methods

Results

Discussion

Conclusion