Abstract

Many features have been proposed for speech-based emotion recognition, and a majority of them are frame based or statistics estimated from frame-based features. Temporal information is typically modelled on a per utterance basis, with either functionals of frame-based features or a suitable back-end. This paper investigates an approach that combines both, with the use of temporal contours of parameters extracted from a three-component model of speech production as features in an automatic emotion recognition system using a hidden Markov model (HMM)-based back-end. Consequently, the proposed system models information on a segment-by-segment scale is larger than a frame-based scale but smaller than utterance level modelling. Specifically, linear approximations to temporal contours of formant frequencies, glottal parameters and pitch are used to model short-term temporal information over individual segments of voiced speech. This is followed by the use of HMMs to model longer-term temporal information contained in sequences of voiced segments. Listening tests were conducted to validate the use of linear approximations in this context. Automatic emotion classification experiments were carried out on the Linguistic Data Consortium emotional prosody speech and transcripts corpus and the FAU Aibo corpus to validate the proposed approach.

Highlights

  • Human speech is an acoustic waveform generated by the vocal apparatus, whose parameters are modulated by the speaker to convey information

  • Classification experiments were performed using the hidden Markov model (HMM)-based system using features based on pitch contour, glottal parameter contours and formant contours individually

  • In this paper we explore a combined approach, extracting ‘short-term’ temporal information in the front-end and modelling ‘longer-term’ temporal information with the back-end

Read more

Summary

Introduction

Human speech is an acoustic waveform generated by the vocal apparatus, whose parameters are modulated by the speaker to convey information. The physical characteristics and the mental state of the speaker determine how these parameters are affected and, how speech conveys the intended, and on occasion unintended, information. Information about emotional state is expressed via speech through numerous cues, ranging from low-level acoustic ones to high-level linguistic content; several approaches to speech-based automatic emotion recognition, each taking advantage of a few of these cues, have been explored [1-9]. The glottal model is assumed to be a two-pole low-pass system (typically with both poles at unity) whose effects are ‘removed’ at the pre-emphasis stage of feature extraction.

Objectives
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call