Posterior-thresholding feature extraction for paralinguistic speech classification

Gábor Gosztolya

doi:10.1016/j.knosys.2019.104943

Abstract

The standard approach for handling computational paralinguistic speech tasks is to extract several thousand utterance-level features from the speech excerpts, and use machine learning methods such as Support Vector Machines and Deep Neural Networks (DNNs) for the actual classification task. In contrast, Automatic Speech Recognition handles the speech signal in small, equal-sized parts called frames. Although the speech community has developed techniques for efficient frame classification, these efforts have mostly been ignored in computational paralinguistics. In this study we propose a simple, three-step technique to utilize frame-level DNN training know-how in computational paralinguistics. We show that this method by itself provides good accuracy scores, and by combining it with the standard paralinguistic classification approach, we get close to the performance of heavyweight, state-of-the-art techniques such as Fisher vector analysis. However, our approach has the advantage that it can be easily realized by using standard speech recognition tools. To demonstrate the generic applicability of this three-step method proposed, we performed our experiments on four different corpora containing different paralinguistic tasks. Overall, we were able to achieve improvements over the baseline score in all four cases, leading to relative error reductions of up to 19%.

Full Text