Anger recognition in speech using acoustic and linguistic cues

Tim Polzehl,Alexander Schmitt,Florian Metze,Michael Wagner

doi:10.1016/j.specom.2011.05.002

Abstract

The present study elaborates on the exploitation of both linguistic and acoustic feature modeling for anger classification. In terms of acoustic modeling we generate statistics from acoustic audio descriptors, e.g. pitch, loudness, spectral characteristics. Ranking our features we see that loudness and MFCC seem most promising for all databases. For the English database also pitch features are important. In terms of linguistic modeling we apply probabilistic and entropy-based models of words and phrases, e.g. Bag-of-Words ( BOW), Term Frequency ( TF), Term Frequency – Inverse Document Frequency ( TF.IDF) and the Self-Referential Information ( SRI). SRI clearly outperforms vector space models. Modeling phrases slightly improves the scores. After classification of both acoustic and linguistic information on separated levels we fuse information on decision level adding confidences. We compare the obtained scores on three different databases. Two databases are taken from the IVR customer care domain, another database accounts for a WoZ data collection. All corpora are of realistic speech condition. We observe promising results for the IVR databases while the WoZ database shows lower scores overall. In order to provide comparability between the results we evaluate classification success using the f1 measurement in addition to overall accuracy figures. As a result, acoustic modeling clearly outperforms linguistic modeling. Fusion slightly improves overall scores. With a baseline of approximately 60% accuracy and .40 f1-measurement by constant majority class voting we obtain an accuracy of 75% with respective .70 f1 for the WoZ database. For the IVR databases we obtain approximately 79% accuracy with respective .78 f1 over a baseline of 60% accuracy with respective .38 f1.

Highlights

Detecting emotions in vocal human-computer interaction (HCI) is gaining increasing attention in speech research
We investigate the performance of word modeling for anger recognition using four different feature spaces, i.e. Bag-of-Words (BOW), Term Frequency (TF), Term Frequency - Inverse Document Frequency (TF.IDF) and the Self-Referential Information (SRI)
It should be noted that all of the experimental nizer trained on best possible transcripts for training databases in those works were of low average word-per- of emotion models and a second matched” recognizer turn rate as well

Summary

May 2010 10 February 2011 4 May 2011

This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain. Tim Polzehl,a, Alexander Schmittb, Florian Metzec, Michael Wagnerd aQuality and Usability Lab, Technischen Universitat Berlin / Deutsche Telekom Laboratories, Ernst-Reuter-Platz 7, D-10587 Berlin, Germany bDialogue Systems Group / Institute of Information Technology, University of Ulm, Albert-Einstein-Allee 43, D-89081 Ulm, Germany cLanguage Technologies Institute, Carnegie Mellon University, 5000 Forbes Avenue, Pittsburgh, PA 15213, U.S.A. dNational Centre for Biometric Studies, University of Canberra, ACT 2601, Australia

Introduction

Selected Corpora

Feature Definition

Fusion of Linguistic and Acoustic Features

Discussion

Signal Quality While the WoZ database has been recorded in wide

Scenario

Speakers

Class Labels and Agreement

Full Text

Published version (

Free)

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: Speech Communication	Publication Date: May 12, 2011
Citations: 103	License type: cc-by

R Discovery Prime

R Discovery Prime

Anger recognition in speech using acoustic and linguistic cues

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Speech Communication

Lead the way for us

Similar Papers

Exploring recurrent neural network based acoustic and linguistic modeling for children's speech recognition
Rohit Sinha ... Sreeram Ganji
-
Rohit Sinha, et. al.Rohit Sinha ... Sreeram Ganji
01 Nov 2017
01 Nov 2017

Chinese Speech Recognition System based on Deep Learning
Pengyuan Shao
Journal of Physics: Conference Series | VOL. 1549
Pengyuan ShaoPengyuan Shao
01 Jun 2020
Journal of Physics: Conference Series | VOL. 1549

A Novel Speech Recognition Model with Bayesian Optimization
Yunfei Zhang ... Meiling Xu
-
Yunfei Zhang, et. al.Yunfei Zhang ... Meiling Xu
06 Nov 2020
06 Nov 2020

N-best vector quantization for isolated word speech recognition
... Shuichi Maki
-
, et. al. ... Shuichi Maki
01 Sep 2007
01 Sep 2007

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Anger recognition in speech using acoustic and linguistic cues

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Speech Communication